Introduction

We will take a look at the dataset of admitted students across different specialties recorded in 2021 from the Estonian statistics website. The dataset contain 108 rows and 6 columns showing how many students with each mother tongue are in each study level and program as following:

## Rows: 108
## Columns: 6
## $ `Level of study`        <chr> "Bachelor's study", "Bachelor's study", "Bache…
## $ `Study programme group` <chr> "Journalism and information", "Architecture an…
## $ Estonian                <dbl> 82, 31, 215, 173, 124, 571, 25, 266, 206, 67, …
## $ Russian                 <dbl> 3, 7, 45, 53, 24, 165, 0, 84, 40, 22, 0, 14, 1…
## $ `Other mother tongue`   <dbl> 0, 0, 1, 20, 5, 14, 0, 7, 8, 1, 0, 5, 1, 1, 0,…
## $ `Mother tongue unknown` <dbl> 0, 0, 0, 11, 9, 23, 0, 0, 23, 0, 0, 11, 0, 0, …

Level of study is a nominal categorical data (Because Integrated Bachelor's/Master's study is not clearly in order with other levels). The dataset include 4 levels including:

## [1] "Bachelor's study"                    
## [2] "Integrated Bachelor's/Master's study"
## [3] "Master's study"                      
## [4] "Doctoral study"

Then we have 27 study programme group as a nominal categorical data including:

##  [1] "Journalism and information"                       
##  [2] "Architecture and construction"                    
##  [3] "Biological and environmental sciences"            
##  [4] "Physical sciences"                                
##  [5] "Humanities"                                       
##  [6] "Information and Communication Technologies (ICTs)"
##  [7] "Personal services"                                
##  [8] "Languages and cultures"                           
##  [9] "Arts"                                             
## [10] "Mathematics and statistics"                       
## [11] "Medicine"                                         
## [12] "Music and performing arts"                        
## [13] "Psychology"                                       
## [14] "Agriculture, forestry and fishery"                
## [15] "Military and defence"                             
## [16] "Protection of persons and property"               
## [17] "Social sciences"                                  
## [18] "Social services"                                  
## [19] "Sports"                                           
## [20] "Engineering, manufacturing and technology"        
## [21] "Health care"                                      
## [22] "Transport services"                               
## [23] "Religion and theology"                            
## [24] "Veterinary"                                       
## [25] "Law"                                              
## [26] "Teacher training and education science"           
## [27] "Business and administration"

Then we have Estonian, Russian, Other mother tongue, and Mother tongue unknown as a discrete numerical data representing number of student with each mother tongue.

Research question

I want to test if there is a relationship between level of study and student mother tongue of admitted student in Estonia in 2021.

EDA

Let have a look at the statistics of columns related to mother tongue languages.

##     Estonian         Russian       Other mother tongue Mother tongue unknown
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000     Min.   : 0.000       
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000     1st Qu.: 0.000       
##  Median : 12.50   Median :  1.00   Median :  0.000     Median : 0.000       
##  Mean   : 66.64   Mean   : 12.57   Mean   :  7.231     Mean   : 6.583       
##  3rd Qu.: 74.25   3rd Qu.: 13.00   3rd Qu.:  7.250     3rd Qu.: 4.000       
##  Max.   :571.00   Max.   :165.00   Max.   :103.000     Max.   :88.000

Interestingly, Estonian and Russian having the 1st quadrant = 0 while other and unknown mother tongue having median = 0. This showing that a lot of record have value of 0 in these columns. So I plot the histogram of number of students with each mother tongue language to see the distribution. It is very skewed to the right as expected as shown below.

Data cleaning and transformation

From the dataset, I want to

  1. Remove Study programme group and Mother tongue unknown columns since it’s not related to the research question.
  2. Remove study word in Level of study column because it is redundant.
  3. Rename Other mother tongue column to Other to make it concise.
  4. Pivot language related columns to long format of Mother tongue language and Number of students.
  5. Change column data type
  1. Group data by the Level of study and Mother tongue language.

Here is the final result:

## # A tibble: 12 × 3
## # Groups:   Level of study, Mother tongue language [12]
##    `Level of study`               `Mother tongue language` `Number of students`
##    <fct>                          <fct>                                   <int>
##  1 Bachelor's                     Estonian                                 3705
##  2 Bachelor's                     Russian                                   806
##  3 Bachelor's                     Other                                     115
##  4 Master's                       Estonian                                 2804
##  5 Master's                       Russian                                   438
##  6 Master's                       Other                                     526
##  7 Integrated Bachelor's/Master's Estonian                                  519
##  8 Integrated Bachelor's/Master's Russian                                    91
##  9 Integrated Bachelor's/Master's Other                                      50
## 10 Doctoral                       Estonian                                  169
## 11 Doctoral                       Russian                                    23
## 12 Doctoral                       Other                                      90

Number of students for each mother tongue languages in each level of study can be show by the following chart:

Estonian dominate in every study level as expected, but it is not clear to see if there is a relationship between level of study and mother tongue language.

Hypothesis testing

Our dataset from Estonian statistic website include

Our sample size (number of total students) is

## [1] 9336

I want to test a relationship between study level and mother tongue language. Both variables are categorical thus Chi square test are appropriated. All of the assumption of Chi square are met as following:

First, I create a contingency table from the dataset.

##                                 Mother tongue language
## Level of study                   Estonian Russian Other
##   Bachelor's                         3705     806   115
##   Master's                           2804     438   526
##   Integrated Bachelor's/Master's      519      91    50
##   Doctoral                            169      23    90

No cell is less than 5, we should get an accurate result. So I run the Chi square test with confident interval of 99%.

## 
##  Pearson's Chi-squared test
## 
## data:  ctable
## X-squared = 596.06, df = 6, p-value < 2.2e-16

According to the Chi square table with 0.01 probability and 6 degree of freedom, the critical value is 16.812. Chi square from the test is higher than the critical value and the p-value is very low so we can reject the null hypothesis.

Conclusion

Chi square test show that there is a relationship between level of study and student mother tongue language. However we don’t know how much or what kind if relationship is that due to the limitation of Chi square test, and that could be the new research question for a further study.