Download the dataset from the website on Estonian statistics into
R Markdown session. The number of admitted students across
different specialties is recorded for Year 2021. The numbers of admitted
students are presented for Study programme group, Level of
study, and Mother tongue.
To download the dataset, use the following R code:
dlyr package and
tidyr when necessary).Through View() we found no NA value in the data set. I first converted the data frame into a tibble, and then used mutate function to add a column to sum up the total number of students in different programmes.
Then I used the split function to divide the large tibble into four tibbles according to the level of study.
In order to form a contingency table, I had to calculate the total number of students in columns in four separated tibbles. Setting a function can spare time from repeating the same coding. Finally, I form a table to accommodate each possible pair of values for the two variables.
higher.edu.sum <- as_tibble(higher.edu %>%
mutate(Spop_by_pro = c(rowSums(higher.edu[3:6]))))
higher.edu.sum
## # A tibble: 108 × 7
## Level.of.study Study.programme.gr…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Bachelor's study Journalism and info… 82 3 0 0 85
## 2 Bachelor's study Architecture and co… 31 7 0 0 38
## 3 Bachelor's study Biological and envi… 215 45 1 0 261
## 4 Bachelor's study Physical sciences 173 53 20 11 257
## 5 Bachelor's study Humanities 124 24 5 9 162
## 6 Bachelor's study Information and Com… 571 165 14 23 773
## 7 Bachelor's study Personal services 25 0 0 0 25
## 8 Bachelor's study Languages and cultu… 266 84 7 0 357
## 9 Bachelor's study Arts 206 40 8 23 277
## 10 Bachelor's study Mathematics and sta… 67 22 1 0 90
## # … with 98 more rows, and abbreviated variable names ¹Study.programme.group,
## # ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro
split_HES <- split(higher.edu.sum, f = higher.edu.sum$Level.of.study)
split_HES
## $`Bachelor's study`
## # A tibble: 27 × 7
## Level.of.study Study.programme.gr…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Bachelor's study Journalism and info… 82 3 0 0 85
## 2 Bachelor's study Architecture and co… 31 7 0 0 38
## 3 Bachelor's study Biological and envi… 215 45 1 0 261
## 4 Bachelor's study Physical sciences 173 53 20 11 257
## 5 Bachelor's study Humanities 124 24 5 9 162
## 6 Bachelor's study Information and Com… 571 165 14 23 773
## 7 Bachelor's study Personal services 25 0 0 0 25
## 8 Bachelor's study Languages and cultu… 266 84 7 0 357
## 9 Bachelor's study Arts 206 40 8 23 277
## 10 Bachelor's study Mathematics and sta… 67 22 1 0 90
## # … with 17 more rows, and abbreviated variable names ¹Study.programme.group,
## # ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro
##
## $`Doctoral study`
## # A tibble: 27 × 7
## Level.of.study Study.programme.group Eston…¹ Russian Other…² Mothe…³ Spop_…⁴
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Doctoral study Journalism and inform… 1 0 0 0 1
## 2 Doctoral study Architecture and cons… 4 0 1 2 7
## 3 Doctoral study Biological and enviro… 19 2 15 7 43
## 4 Doctoral study Physical sciences 15 1 8 4 28
## 5 Doctoral study Humanities 8 0 3 1 12
## 6 Doctoral study Information and Commu… 16 5 18 2 41
## 7 Doctoral study Personal services 0 0 0 0 0
## 8 Doctoral study Languages and cultures 10 3 6 0 19
## 9 Doctoral study Arts 9 1 1 0 11
## 10 Doctoral study Mathematics and stati… 3 0 0 0 3
## # … with 17 more rows, and abbreviated variable names ¹Estonian,
## # ²Other.mother.tongue, ³Mother.tongue.unknown, ⁴Spop_by_pro
##
## $`Integrated Bachelor's/Master's study`
## # A tibble: 27 × 7
## Level.of.study Study…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Integrated Bachelor's/Master… Journa… 0 0 0 0 0
## 2 Integrated Bachelor's/Master… Archit… 226 39 0 0 265
## 3 Integrated Bachelor's/Master… Biolog… 0 0 0 0 0
## 4 Integrated Bachelor's/Master… Physic… 0 0 0 0 0
## 5 Integrated Bachelor's/Master… Humani… 0 0 0 0 0
## 6 Integrated Bachelor's/Master… Inform… 0 0 0 0 0
## 7 Integrated Bachelor's/Master… Person… 0 0 0 0 0
## 8 Integrated Bachelor's/Master… Langua… 0 0 0 0 0
## 9 Integrated Bachelor's/Master… Arts 0 0 0 0 0
## 10 Integrated Bachelor's/Master… Mathem… 0 0 0 0 0
## # … with 17 more rows, and abbreviated variable names ¹Study.programme.group,
## # ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro
##
## $`Master's study`
## # A tibble: 27 × 7
## Level.of.study Study.programme.group Eston…¹ Russian Other…² Mothe…³ Spop_…⁴
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 Master's study Journalism and inform… 46 4 0 0 50
## 2 Master's study Architecture and cons… 75 13 14 6 108
## 3 Master's study Biological and enviro… 105 18 18 3 144
## 4 Master's study Physical sciences 53 9 40 8 110
## 5 Master's study Humanities 43 4 19 6 72
## 6 Master's study Information and Commu… 353 79 88 63 583
## 7 Master's study Personal services 12 0 0 0 12
## 8 Master's study Languages and cultures 106 11 14 11 142
## 9 Master's study Arts 74 9 20 45 148
## 10 Master's study Mathematics and stati… 7 2 5 4 18
## # … with 17 more rows, and abbreviated variable names ¹Estonian,
## # ²Other.mother.tongue, ³Mother.tongue.unknown, ⁴Spop_by_pro
pro_by_lang <- function(x) {
new_list <- c(colSums(x[3:7]))
}
Ba_lang <- pro_by_lang(split_HES$`Bachelor's study`)
In_lang <- pro_by_lang(split_HES$`Integrated Bachelor's/Master's study`)
Ma_lang <- pro_by_lang(split_HES$`Master's study`)
Do_lang <- pro_by_lang(split_HES$`Doctoral study`)
new_df <- data.frame(cbind(Ba_lang, In_lang, Ma_lang, Do_lang))
new_df <- new_df %>% mutate(Spop_by_lang = c(rowSums(new_df[1:4])))
new_df
## Ba_lang In_lang Ma_lang Do_lang Spop_by_lang
## Estonian 3705 519 2804 169 7197
## Russian 806 91 438 23 1358
## Other.mother.tongue 115 50 526 90 781
## Mother.tongue.unknown 263 11 403 34 711
## Spop_by_pro 4889 671 4171 316 10047
The new data frame “new_df” shows that the ethnic majority, Estonian, occupies the large proportion of students in higher education in 2021: 7197 students. The number of students whose mother tongue is Russian is similar to the number of students whose native language is neither Estonian nor Russian. Let’s use the mosaic plot to display the proportion of students speaking different mother tongues.
mosaicplot(new_df[1:4, 1:4], shade = TRUE, las = 2, main = "the propotion of students speaking different mother tongues")
Blue colour suggests that the observed value is higher than the expected value, while red colour indicates the expected value higher than observed value. This plot tells us that bachelor students mainly consist of Estonian speakers and Russian speakers. Estonian students still occupy the large proportion of Master’s students; however, the number of postundergraduate students (not including integrating degrees) whose mother tongue is neither Russian nor Estonian is higher than the number of postundergraduate students whose mother tongue is Russian.
I assume that there should be an association between students’ mother tongue and the level of study they enrolled in 2021; I also assume that the The research questions, therefore, are if there is an association between the levels of study and students’ mother tongue and whether there is a difference between the proportion of postungraudate students whose mother tongue is unknown and postundergraduate students whose mother tongue is others. I decided to use Chi-square independence test and inference for two population proportions.
We sample the data using the function sample_n first:
pro_by_lang <- function(x) {
new_list <- c(colSums(x[3:7]))
}
bachelor <- pro_by_lang(sample_n(split_HES$`Bachelor's study`, size = 13))
integrate <- pro_by_lang(sample_n(split_HES$`Integrated Bachelor's/Master's study`, size = 13))
master <- pro_by_lang(sample_n(split_HES$`Master's study`, size = 13))
doctor <- pro_by_lang(sample_n(split_HES$`Doctoral study`, size = 13))
new_f <- data.frame(cbind(bachelor, integrate, master, doctor))
new_f
## bachelor integrate master doctor
## Estonian 1877 293 1878 75
## Russian 367 52 319 11
## Other.mother.tongue 51 50 343 48
## Mother.tongue.unknown 144 11 272 20
## Spop_by_pro 2439 406 2812 154
new_f <- new_f %>% mutate(Spop_by_lang = c(rowSums(new_f[1:4])))
new_f
## bachelor integrate master doctor Spop_by_lang
## Estonian 1877 293 1878 75 4123
## Russian 367 52 319 11 749
## Other.mother.tongue 51 50 343 48 492
## Mother.tongue.unknown 144 11 272 20 447
## Spop_by_pro 2439 406 2812 154 5811
new_f_table <- as.table(as.matrix(new_f))
new_f_table
## bachelor integrate master doctor Spop_by_lang
## Estonian 1877 293 1878 75 4123
## Russian 367 52 319 11 749
## Other.mother.tongue 51 50 343 48 492
## Mother.tongue.unknown 144 11 272 20 447
## Spop_by_pro 2439 406 2812 154 5811
In the first Chi-square independence test, the null hypothesis is that the mother tongue reported by students and the level of study they enrolled in are not associated; the alternative hypothesis is that the two are associated.
There are three assumptions that lay the ground for the Chi-square independence test: 1) all expected frequencies are 1 or greater; 2) at most 20 per cent of the expected frequencies are less than 5; 3) simple random sample.
Hence, after we calculate the result with the function chisq.test, we have to extract the expected frequencies to examine whether this test is feasible.
chisq <- chisq.test(new_f_table[1:4, 1:4])
chisq
##
## Pearson's Chi-squared test
##
## data: new_f_table[1:4, 1:4]
## X-squared = 355.45, df = 9, p-value < 2.2e-16
chisq$expected
## bachelor integrate master doctor
## Estonian 1730.5106 288.06367 1995.1602 109.26553
## Russian 314.3712 52.33075 362.4485 19.84960
## Other.mother.tongue 206.5028 34.37481 238.0836 13.03872
## Mother.tongue.unknown 187.6154 31.23077 216.3077 11.84615
Since the P-value is far lower than 0.001, at the 5 per cent significant level, we have outstandingly stronger evidence to reject the null hypothesis in favour of the alternative hypothesis; in other words, the association between the mother tongues and the level of study is robustly supported.
The extracted frequencies evince that the assumption 1 and 2 are met; thus, it is reliable to carry out chi-square test to explore the association among categorical variables.
In the second hypothesis test for the inference for two population proportions, the null hypothesis is that the percentage of postundergraduate students whose mother tongue is unknown is not different from that of postgraduate students whose reported definite mother tongue which is neither Estonian nor Russian; the alternative hypothesis is that two percentages differs from each other.
There are three assumptions that lay the ground for the Chi-square independence test: 1) the numbers of two groups of postundergraduate students, and the remainders of the sample size minus the numbers of postundergraduate students are all 5 or greater; 2) independent samples; 3) simple random samples.
Based on the the finding above, we are confident that three assumptions are met.
prop <- prop.test(x = c(238, 214), n = c(354, 422))
prop
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(238, 214) out of c(354, 422)
## X-squared = 20.931, df = 1, p-value = 4.761e-06
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.09430225 0.23611251
## sample estimates:
## prop 1 prop 2
## 0.6723164 0.5071090
Since the P-value is far lower than 0.001, at the 5 per cent significant level, we have outstandingly stronger evidence to reject the null hypothesis in favour of the alternative hypothesis; in other words, the proportion of postundergraduate students whose mother tongue is unknown significantly differs from the proportion of postundergraduate students whose mother tongue, clearly reported, is neither Estonian nor Russian.