Homework Assignment #8

Assignment Description

Download the dataset from the website on Estonian statistics into R Markdown session. The number of admitted students across different specialties is recorded for Year 2021. The numbers of admitted students are presented for Study programme group, Level of study, and Mother tongue.

To download the dataset, use the following R code:

Introduction

Perform quick EDA and pick up variables you want to explore in more depth.
Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

Through View() we found no NA value in the data set. I first converted the data frame into a tibble, and then used mutate function to add a column to sum up the total number of students in different programmes.

Then I used the split function to divide the large tibble into four tibbles according to the level of study.

In order to form a contingency table, I had to calculate the total number of students in columns in four separated tibbles. Setting a function can spare time from repeating the same coding. Finally, I form a table to accommodate each possible pair of values for the two variables.

higher.edu.sum <- as_tibble(higher.edu %>% 
  mutate(Spop_by_pro = c(rowSums(higher.edu[3:6]))))
higher.edu.sum

## # A tibble: 108 × 7
##    Level.of.study   Study.programme.gr…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
##    <chr>            <chr>                  <int>   <int>   <int>   <int>   <dbl>
##  1 Bachelor's study Journalism and info…      82       3       0       0      85
##  2 Bachelor's study Architecture and co…      31       7       0       0      38
##  3 Bachelor's study Biological and envi…     215      45       1       0     261
##  4 Bachelor's study Physical sciences        173      53      20      11     257
##  5 Bachelor's study Humanities               124      24       5       9     162
##  6 Bachelor's study Information and Com…     571     165      14      23     773
##  7 Bachelor's study Personal services         25       0       0       0      25
##  8 Bachelor's study Languages and cultu…     266      84       7       0     357
##  9 Bachelor's study Arts                     206      40       8      23     277
## 10 Bachelor's study Mathematics and sta…      67      22       1       0      90
## # … with 98 more rows, and abbreviated variable names ¹Study.programme.group,
## #   ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro

split_HES <- split(higher.edu.sum, f = higher.edu.sum$Level.of.study)
split_HES

## $`Bachelor's study`
## # A tibble: 27 × 7
##    Level.of.study   Study.programme.gr…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
##    <chr>            <chr>                  <int>   <int>   <int>   <int>   <dbl>
##  1 Bachelor's study Journalism and info…      82       3       0       0      85
##  2 Bachelor's study Architecture and co…      31       7       0       0      38
##  3 Bachelor's study Biological and envi…     215      45       1       0     261
##  4 Bachelor's study Physical sciences        173      53      20      11     257
##  5 Bachelor's study Humanities               124      24       5       9     162
##  6 Bachelor's study Information and Com…     571     165      14      23     773
##  7 Bachelor's study Personal services         25       0       0       0      25
##  8 Bachelor's study Languages and cultu…     266      84       7       0     357
##  9 Bachelor's study Arts                     206      40       8      23     277
## 10 Bachelor's study Mathematics and sta…      67      22       1       0      90
## # … with 17 more rows, and abbreviated variable names ¹Study.programme.group,
## #   ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro
## 
## $`Doctoral study`
## # A tibble: 27 × 7
##    Level.of.study Study.programme.group  Eston…¹ Russian Other…² Mothe…³ Spop_…⁴
##    <chr>          <chr>                    <int>   <int>   <int>   <int>   <dbl>
##  1 Doctoral study Journalism and inform…       1       0       0       0       1
##  2 Doctoral study Architecture and cons…       4       0       1       2       7
##  3 Doctoral study Biological and enviro…      19       2      15       7      43
##  4 Doctoral study Physical sciences           15       1       8       4      28
##  5 Doctoral study Humanities                   8       0       3       1      12
##  6 Doctoral study Information and Commu…      16       5      18       2      41
##  7 Doctoral study Personal services            0       0       0       0       0
##  8 Doctoral study Languages and cultures      10       3       6       0      19
##  9 Doctoral study Arts                         9       1       1       0      11
## 10 Doctoral study Mathematics and stati…       3       0       0       0       3
## # … with 17 more rows, and abbreviated variable names ¹Estonian,
## #   ²Other.mother.tongue, ³Mother.tongue.unknown, ⁴Spop_by_pro
## 
## $`Integrated Bachelor's/Master's study`
## # A tibble: 27 × 7
##    Level.of.study                Study…¹ Eston…² Russian Other…³ Mothe…⁴ Spop_…⁵
##    <chr>                         <chr>     <int>   <int>   <int>   <int>   <dbl>
##  1 Integrated Bachelor's/Master… Journa…       0       0       0       0       0
##  2 Integrated Bachelor's/Master… Archit…     226      39       0       0     265
##  3 Integrated Bachelor's/Master… Biolog…       0       0       0       0       0
##  4 Integrated Bachelor's/Master… Physic…       0       0       0       0       0
##  5 Integrated Bachelor's/Master… Humani…       0       0       0       0       0
##  6 Integrated Bachelor's/Master… Inform…       0       0       0       0       0
##  7 Integrated Bachelor's/Master… Person…       0       0       0       0       0
##  8 Integrated Bachelor's/Master… Langua…       0       0       0       0       0
##  9 Integrated Bachelor's/Master… Arts          0       0       0       0       0
## 10 Integrated Bachelor's/Master… Mathem…       0       0       0       0       0
## # … with 17 more rows, and abbreviated variable names ¹Study.programme.group,
## #   ²Estonian, ³Other.mother.tongue, ⁴Mother.tongue.unknown, ⁵Spop_by_pro
## 
## $`Master's study`
## # A tibble: 27 × 7
##    Level.of.study Study.programme.group  Eston…¹ Russian Other…² Mothe…³ Spop_…⁴
##    <chr>          <chr>                    <int>   <int>   <int>   <int>   <dbl>
##  1 Master's study Journalism and inform…      46       4       0       0      50
##  2 Master's study Architecture and cons…      75      13      14       6     108
##  3 Master's study Biological and enviro…     105      18      18       3     144
##  4 Master's study Physical sciences           53       9      40       8     110
##  5 Master's study Humanities                  43       4      19       6      72
##  6 Master's study Information and Commu…     353      79      88      63     583
##  7 Master's study Personal services           12       0       0       0      12
##  8 Master's study Languages and cultures     106      11      14      11     142
##  9 Master's study Arts                        74       9      20      45     148
## 10 Master's study Mathematics and stati…       7       2       5       4      18
## # … with 17 more rows, and abbreviated variable names ¹Estonian,
## #   ²Other.mother.tongue, ³Mother.tongue.unknown, ⁴Spop_by_pro

pro_by_lang <- function(x) {
  new_list <- c(colSums(x[3:7]))
}

Ba_lang <- pro_by_lang(split_HES$`Bachelor's study`)
In_lang <- pro_by_lang(split_HES$`Integrated Bachelor's/Master's study`)
Ma_lang <- pro_by_lang(split_HES$`Master's study`)
Do_lang <- pro_by_lang(split_HES$`Doctoral study`)

new_df <- data.frame(cbind(Ba_lang, In_lang, Ma_lang, Do_lang))
new_df <- new_df %>% mutate(Spop_by_lang = c(rowSums(new_df[1:4])))
new_df

##                       Ba_lang In_lang Ma_lang Do_lang Spop_by_lang
## Estonian                 3705     519    2804     169         7197
## Russian                   806      91     438      23         1358
## Other.mother.tongue       115      50     526      90          781
## Mother.tongue.unknown     263      11     403      34          711
## Spop_by_pro              4889     671    4171     316        10047

The new data frame “new_df” shows that the ethnic majority, Estonian, occupies the large proportion of students in higher education in 2021: 7197 students. The number of students whose mother tongue is Russian is similar to the number of students whose native language is neither Estonian nor Russian. Let’s use the mosaic plot to display the proportion of students speaking different mother tongues.

mosaicplot(new_df[1:4, 1:4], shade = TRUE, las = 2, main = "the propotion of students speaking different mother tongues")

Blue colour suggests that the observed value is higher than the expected value, while red colour indicates the expected value higher than observed value. This plot tells us that bachelor students mainly consist of Estonian speakers and Russian speakers. Estonian students still occupy the large proportion of Master’s students; however, the number of postundergraduate students (not including integrating degrees) whose mother tongue is neither Russian nor Estonian is higher than the number of postundergraduate students whose mother tongue is Russian.

Analytic hypothesis and methods

Based on the dataset, formulate your Research Question(s). Think of questions that you can answer with the help of either chi-square statistical test or some type of t-test.

I assume that there should be an association between students’ mother tongue and the level of study they enrolled in 2021; I also assume that the The research questions, therefore, are if there is an association between the levels of study and students’ mother tongue and whether there is a difference between the proportion of postungraudate students whose mother tongue is unknown and postundergraduate students whose mother tongue is others. I decided to use Chi-square independence test and inference for two population proportions.

We sample the data using the function sample_n first:

pro_by_lang <- function(x) {
  new_list <- c(colSums(x[3:7]))
}

bachelor <- pro_by_lang(sample_n(split_HES$`Bachelor's study`, size = 13))
integrate <- pro_by_lang(sample_n(split_HES$`Integrated Bachelor's/Master's study`, size = 13))
master <- pro_by_lang(sample_n(split_HES$`Master's study`, size = 13))
doctor <- pro_by_lang(sample_n(split_HES$`Doctoral study`, size = 13))

new_f <- data.frame(cbind(bachelor, integrate, master, doctor))
new_f

##                       bachelor integrate master doctor
## Estonian                  1877       293   1878     75
## Russian                    367        52    319     11
## Other.mother.tongue         51        50    343     48
## Mother.tongue.unknown      144        11    272     20
## Spop_by_pro               2439       406   2812    154

new_f <- new_f %>% mutate(Spop_by_lang = c(rowSums(new_f[1:4])))
new_f

##                       bachelor integrate master doctor Spop_by_lang
## Estonian                  1877       293   1878     75         4123
## Russian                    367        52    319     11          749
## Other.mother.tongue         51        50    343     48          492
## Mother.tongue.unknown      144        11    272     20          447
## Spop_by_pro               2439       406   2812    154         5811

new_f_table <- as.table(as.matrix(new_f))
new_f_table

##                       bachelor integrate master doctor Spop_by_lang
## Estonian                  1877       293   1878     75         4123
## Russian                    367        52    319     11          749
## Other.mother.tongue         51        50    343     48          492
## Mother.tongue.unknown      144        11    272     20          447
## Spop_by_pro               2439       406   2812    154         5811

Chi-square independence test

In the first Chi-square independence test, the null hypothesis is that the mother tongue reported by students and the level of study they enrolled in are not associated; the alternative hypothesis is that the two are associated.

There are three assumptions that lay the ground for the Chi-square independence test: 1) all expected frequencies are 1 or greater; 2) at most 20 per cent of the expected frequencies are less than 5; 3) simple random sample.

Hence, after we calculate the result with the function chisq.test, we have to extract the expected frequencies to examine whether this test is feasible.

chisq <- chisq.test(new_f_table[1:4, 1:4])
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  new_f_table[1:4, 1:4]
## X-squared = 355.45, df = 9, p-value < 2.2e-16

chisq$expected

##                        bachelor integrate    master    doctor
## Estonian              1730.5106 288.06367 1995.1602 109.26553
## Russian                314.3712  52.33075  362.4485  19.84960
## Other.mother.tongue    206.5028  34.37481  238.0836  13.03872
## Mother.tongue.unknown  187.6154  31.23077  216.3077  11.84615

The extracted frequencies evince that the assumption 1 and 2 are met; thus, it is reliable to carry out chi-square test to explore the association among categorical variables.

Inference for two population proportions

In the second hypothesis test for the inference for two population proportions, the null hypothesis is that the percentage of postundergraduate students whose mother tongue is unknown is not different from that of postgraduate students whose reported definite mother tongue which is neither Estonian nor Russian; the alternative hypothesis is that two percentages differs from each other.

There are three assumptions that lay the ground for the Chi-square independence test: 1) the numbers of two groups of postundergraduate students, and the remainders of the sample size minus the numbers of postundergraduate students are all 5 or greater; 2) independent samples; 3) simple random samples.

Based on the the finding above, we are confident that three assumptions are met.

prop <- prop.test(x = c(238, 214), n = c(354, 422))
prop

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(238, 214) out of c(354, 422)
## X-squared = 20.931, df = 1, p-value = 4.761e-06
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.09430225 0.23611251
## sample estimates:
##    prop 1    prop 2 
## 0.6723164 0.5071090

Since the P-value is far lower than 0.001, at the 5 per cent significant level, we have outstandingly stronger evidence to reject the null hypothesis in favour of the alternative hypothesis; in other words, the proportion of postundergraduate students whose mother tongue is unknown significantly differs from the proportion of postundergraduate students whose mother tongue, clearly reported, is neither Estonian nor Russian.