We will take a look at the dataset of admitted students across different specialties recorded in 2021 from the Estonian statistics website. The dataset contain 108 rows and 6 columns showing how many students with each mother tongue are in each study level and program as following:
## Rows: 108
## Columns: 6
## $ `Level of study` <chr> "Bachelor's study", "Bachelor's study", "Bache…
## $ `Study programme group` <chr> "Journalism and information", "Architecture an…
## $ Estonian <dbl> 82, 31, 215, 173, 124, 571, 25, 266, 206, 67, …
## $ Russian <dbl> 3, 7, 45, 53, 24, 165, 0, 84, 40, 22, 0, 14, 1…
## $ `Other mother tongue` <dbl> 0, 0, 1, 20, 5, 14, 0, 7, 8, 1, 0, 5, 1, 1, 0,…
## $ `Mother tongue unknown` <dbl> 0, 0, 0, 11, 9, 23, 0, 0, 23, 0, 0, 11, 0, 0, …
Level of study is a nominal categorical data (Because
Integrated Bachelor's/Master's study is not clearly in
order with other levels). The dataset include 4 levels including:
## [1] "Bachelor's study"
## [2] "Integrated Bachelor's/Master's study"
## [3] "Master's study"
## [4] "Doctoral study"
Then we have 27 study programme group as a nominal categorical data including:
## [1] "Journalism and information"
## [2] "Architecture and construction"
## [3] "Biological and environmental sciences"
## [4] "Physical sciences"
## [5] "Humanities"
## [6] "Information and Communication Technologies (ICTs)"
## [7] "Personal services"
## [8] "Languages and cultures"
## [9] "Arts"
## [10] "Mathematics and statistics"
## [11] "Medicine"
## [12] "Music and performing arts"
## [13] "Psychology"
## [14] "Agriculture, forestry and fishery"
## [15] "Military and defence"
## [16] "Protection of persons and property"
## [17] "Social sciences"
## [18] "Social services"
## [19] "Sports"
## [20] "Engineering, manufacturing and technology"
## [21] "Health care"
## [22] "Transport services"
## [23] "Religion and theology"
## [24] "Veterinary"
## [25] "Law"
## [26] "Teacher training and education science"
## [27] "Business and administration"
Then we have Estonian, Russian, Other mother tongue, and Mother tongue unknown as a discrete numerical data representing number of student with each mother tongue.
I want to test if there is a relationship between level of study and student mother tongue of admitted student in Estonia in 2021.
Let have a look at the statistics of columns related to mother tongue languages.
## Estonian Russian Other mother tongue Mother tongue unknown
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 12.50 Median : 1.00 Median : 0.000 Median : 0.000
## Mean : 66.64 Mean : 12.57 Mean : 7.231 Mean : 6.583
## 3rd Qu.: 74.25 3rd Qu.: 13.00 3rd Qu.: 7.250 3rd Qu.: 4.000
## Max. :571.00 Max. :165.00 Max. :103.000 Max. :88.000
Interestingly, Estonian and Russian having the 1st quadrant = 0 while other and unknown mother tongue having median = 0. This showing that a lot of record have value of 0 in these columns. So I plot the histogram of number of students with each mother tongue language to see the distribution. It is very skewed to the right as expected as shown below.
From the dataset, I want to
Study programme group and
Mother tongue unknown columns since it’s not related to the
research question.Level of study column
because it is redundant.Other mother tongue column to Other
to make it concise.Mother tongue language and
Number of students.Level of study and Mother tongue language
to levels because it is a limit set of character.Number of students to integer because we don’t need
decimal points.Level of study and
Mother tongue language.Here is the final result:
## # A tibble: 12 × 3
## # Groups: Level of study, Mother tongue language [12]
## `Level of study` `Mother tongue language` `Number of students`
## <fct> <fct> <int>
## 1 Bachelor's Estonian 3705
## 2 Bachelor's Russian 806
## 3 Bachelor's Other 115
## 4 Master's Estonian 2804
## 5 Master's Russian 438
## 6 Master's Other 526
## 7 Integrated Bachelor's/Master's Estonian 519
## 8 Integrated Bachelor's/Master's Russian 91
## 9 Integrated Bachelor's/Master's Other 50
## 10 Doctoral Estonian 169
## 11 Doctoral Russian 23
## 12 Doctoral Other 90
Number of students for each mother tongue languages in each level of study can be show by the following chart:
Estonian dominate in every study level as expected, but it is not clear
to see if there is a relationship between level of study and mother
tongue language.
Our dataset from Estonian statistic website include
Our sample size (number of total students) is
## [1] 9336
I want to test a relationship between study level and mother tongue language. Both variables are categorical thus Chi square test are appropriated. All of the assumption of Chi square are met as following:
First, I create a contingency table from the dataset.
## Mother tongue language
## Level of study Estonian Russian Other
## Bachelor's 3705 806 115
## Master's 2804 438 526
## Integrated Bachelor's/Master's 519 91 50
## Doctoral 169 23 90
No cell is less than 5, we should get an accurate result. So I run the Chi square test with confident interval of 99%.
##
## Pearson's Chi-squared test
##
## data: ctable
## X-squared = 596.06, df = 6, p-value < 2.2e-16
According to the Chi square table with 0.01 probability and 6 degree of freedom, the critical value is 16.812. Chi square from the test is higher than the critical value and the p-value is very low so we can reject the null hypothesis.
Chi square test show that there is a relationship between level of study and student mother tongue language. However we don’t know how much or what kind if relationship is that due to the limitation of Chi square test, and that could be the new research question for a further study.