A report on a Higher Education Dataset

Introduction

In this report, I am going to explore the dataset ‘higher.edu’ from the Estonian statistics website. Additional information can be found here https://raw.githubusercontent.com/Maria-13/DataAnalytics_R/main/Assignments/higher_edu_in_Estonia-1.csv

The ‘higher.edu’ data frame includes six variables: - ‘Level.of.study’ - ‘Study.programme.group’ - ‘Estonian’ - ‘Russian’ - ‘Other.mother.tongue’ - ‘Mother.tongue.unknown’

First, I am downloading the package to explore the dataset

There are 108 entries and 6 variables. For the ‘level of study’ variable, there are four levels For the ‘study.programme.group’ variable, there are 21 levels For the ‘Estonian’, ‘Russian’, ‘Other.mother.tongue’, and ‘Mother.tongue.unknown’ variables, there are 108 levels each.

Performing EDA

I will explore the structure and content of the dataset, including the number of rows and columns, the data types of each variable, and the presence of missing values.

Also will use the summary function to generate a summary of the dataset, including the mean, median, mode, and other descriptive statistics for each variable.

## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Level.of.study       : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
##  $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
##  $ Estonian             : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
##  $ Russian              : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
##  $ Other.mother.tongue  : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
##  $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
##  Level.of.study     Study.programme.group    Estonian         Russian      
##  Length:108         Length:108            Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character      1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character      Median : 12.50   Median :  1.00  
##                                           Mean   : 66.64   Mean   : 12.57  
##                                           3rd Qu.: 74.25   3rd Qu.: 13.00  
##                                           Max.   :571.00   Max.   :165.00  
##  Other.mother.tongue Mother.tongue.unknown
##  Min.   :  0.000     Min.   : 0.000       
##  1st Qu.:  0.000     1st Qu.: 0.000       
##  Median :  0.000     Median : 0.000       
##  Mean   :  7.231     Mean   : 6.583       
##  3rd Qu.:  7.250     3rd Qu.: 4.000       
##  Max.   :103.000     Max.   :88.000

Using ‘dplyr’ package, I am building a summary to view the average ‘Estonian’ for the level of study

##                         Level.of.study Estonian
## 1                     Bachelor's study       90
## 2                       Doctoral study        4
## 3 Integrated Bachelor's/Master's study        0
## 4                       Master's study       53

In the same line, I am building a summary to view the average ‘Russian’ for the level of study

##                         Level.of.study Other.mother.tongue
## 1                     Bachelor's study                   1
## 2                       Doctoral study                   1
## 3 Integrated Bachelor's/Master's study                   0
## 4                       Master's study                   9

I am choosing my variables to work with which are Level of study, Estonian and Russian. The mean for Estonian speakers is 103.9 and the median is 53 while the mean for Russian speakers is 16.22 and the median is 7

##  Level.of.study        Estonian        Russian     
##  Length:27          Min.   :  0.0   Min.   : 0.00  
##  Class :character   1st Qu.: 28.0   1st Qu.: 1.50  
##  Mode  :character   Median : 53.0   Median : 7.00  
##                     Mean   :103.9   Mean   :16.22  
##                     3rd Qu.:113.5   3rd Qu.:18.50  
##                     Max.   :531.0   Max.   :85.00

I am visualising the selected variable

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Getting a Research Question

Based on the dataset, I will be formulating my research question for Chi-Square test of independence. Is there a significant relationship in the Master’s level of study between Estonian and Russian speakers?

To answer this question, we need to create a contingency table for these two variables:

## `geom_smooth()` using formula = 'y ~ x'

Carrying out Hypothesis Testing

In the dataset, I explored the Master’s study in Level.of.study for Estonian and Russian language speakers in Estonia. I will be analysing the data to check if there is a relationship between both variables.

##      
##        0  1  2  3  4  7  9 11 13 18 19 20 22 37 73 79 85 Sum
##   0    2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   2
##   7    0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   12   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   18   0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   22   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   26   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   30   0  1  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0   2
##   41   0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0   1
##   43   0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0   1
##   46   0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0   1
##   50   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   53   0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0   1
##   63   0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0   1
##   70   0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0   1
##   74   0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0   1
##   75   0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0   1
##   105  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0   1
##   106  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0   1
##   121  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0   1
##   125  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0   1
##   150  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0   1
##   198  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0   1
##   353  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0   1
##   455  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1   1
##   531  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0   1
##   Sum  6  1  2  1  2  2  2  1  2  1  1  1  1  1  1  1  1  27

With a balloon plot, let us visualise the contingency table which is signifying a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

## Warning in chisq.test(obs.table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  obs.table
## X-squared = 411.75, df = 425, p-value = 0.6687

The plot shows that there are more Estonian speaking students than Russian speakers in the Master’s level of studying.

Null hypothesis- The row and column variables of the contingency table are independent and have no relationship.

Alternative hypothesis- The row and column variables are dependent and there is a relationship between Estonian and Russian speaking Master’s students.

A Report on Collected Data and Sample size.

The selected dataset consist of 3 variables, Master’s level of study, Estonian and Russian speakers. It includes 27 observations with the minimum value for both languages as 0, the maximum value for Estonian speakers is 531 while that of Russian speakers is 85, the mean for Master’s Estonian speaking students is 103.9 while the mean for Master’s Russian speaking students is 16.22. The median for Estonian speakers in Master’s is 53 while that of Russian speakers is 7.

##  Level.of.study        Estonian        Russian     
##  Length:27          Min.   :  0.0   Min.   : 0.00  
##  Class :character   1st Qu.: 28.0   1st Qu.: 1.50  
##  Mode  :character   Median : 53.0   Median : 7.00  
##                     Mean   :103.9   Mean   :16.22  
##                     3rd Qu.:113.5   3rd Qu.:18.50  
##                     Max.   :531.0   Max.   :85.00

Making Assumption and Reporting Results

It is possible that there is some kind of relationship between the Estonian and Russian language speakers in the Master’s level of study. In order to investigate this, I used chi-square test to check and determine this assumption.The result on the test that was carried out shows a p-value of 0.669 which is greater than 0.05 which is the significant level. Hence, I therefore support and fail to reject the null hypothesis that there is no relationship between Master’s Estonian speaking and Russian speaking students.

#Significance level p= 0.669 (p > 5%)

Conclusion:

A chi-square test of independence was performed to examine the relation between Estonian and Russian speaking students studying Master’s. The result is that there is no relationship between Estonian speaking and Russian speaking Master’s students.