The dataset I will explore is the number of admitted students across different specialties which is recorded for Year 2021. The numbers of admitted students are presented for Study programme group, Level of study, and Mother tongue.There are six varialbes in the dataset:Year, Level of study, Study programme group, Estonian,Russian, Other mother tongue, Mother tongue unknown.
## # A tibble: 6 × 6
## Level.of.study Study.programme.group Eston…¹ Russian Other…² Mothe…³
## <chr> <chr> <int> <int> <int> <int>
## 1 Bachelor's study Journalism and information 82 3 0 0
## 2 Bachelor's study Architecture and construction 31 7 0 0
## 3 Bachelor's study Biological and environmental… 215 45 1 0
## 4 Bachelor's study Physical sciences 173 53 20 11
## 5 Bachelor's study Humanities 124 24 5 9
## 6 Bachelor's study Information and Communicatio… 571 165 14 23
## # … with abbreviated variable names ¹​Estonian, ²​Other.mother.tongue,
## # ³​Mother.tongue.unknown
1.What are the observed population, the observation unit and the reference period?
The observed population is all the admitted students across different specialties and their mother tongue in Estonia. The observation unit is a student and the reference period is 2021.
2.What are the data types of the variables? Do we need to change them?
The dataset include 6 variables. The data types of the variables shows as below:
## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
## $ Level.of.study : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
## $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
## $ Estonian : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
## $ Russian : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
## $ Other.mother.tongue : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
## $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
We don’t need to change the data types of the variables.
3.What is the mean for each variable?
Here is the mean for each variable.
## Level.of.study Study.programme.group Estonian Russian
## Length:108 Length:108 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 12.50 Median : 1.00
## Mean : 66.64 Mean : 12.57
## 3rd Qu.: 74.25 3rd Qu.: 13.00
## Max. :571.00 Max. :165.00
## Other.mother.tongue Mother.tongue.unknown
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000
## Mean : 7.231 Mean : 6.583
## 3rd Qu.: 7.250 3rd Qu.: 4.000
## Max. :103.000 Max. :88.000
5.Are there any Null / NA values?
There are no Null / NA values.
sum(is.na(higher.edu))
## [1] 0
I wanted to explore the variables of level of study , Mother tougue(Estonian, Russian)and the relationship between them.
## `summarise()` has grouped output by 'Level.of.study'. You can override using
## the `.groups` argument.
## # A tibble: 6 × 3
## # Groups: Level.of.study [3]
## Level.of.study Mother_Tongue sum_of_student
## <chr> <chr> <int>
## 1 Bachelor's study Estonian 3705
## 2 Bachelor's study Russian 806
## 3 Doctoral study Estonian 169
## 4 Doctoral study Russian 23
## 5 Integrated Bachelor's/Master's study Estonian 519
## 6 Integrated Bachelor's/Master's study Russian 91
RQ: Is there a relationship between the level of study and their main mother tongue (Estonian and Russian)?
To answer the research question, I will use Two sample t-test. Firstly, we need to create a contingency table for these variables:
## Mother_Tongue
## Level.of.study Estonian Russian
## Bachelor's study 3705 806
## Doctoral study 169 23
## Integrated Bachelor's/Master's study 519 91
## Master's study 2804 438
Here I will visualize the contingency table.
From the plot we can see that more people whose Mother Tongue is Estonian tend to have higher level of study.
Null hypothesis: the level of study and mother tongue are independent.
Alternative hypothesis: the level of study and mother tongue are dependent
I collected all study level of students whose monther tongue are Estonian and Russia in Estonia in 2021. The sampe size is 5313.
The contigency table provides us with information that all categories’answers are much more than 5. This did not violation of the assumption for chi-square test. here I will perform the Chi- test.
##
## Pearson's Chi-squared test
##
## data: Ctable
## X-squared = 29.587, df = 3, p-value = 1.685e-06
## Mother_Tongue
## Level.of.study Estonian Russian
## Bachelor's study 3794.9348 716.06523
## Doctoral study 161.5224 30.47762
## Integrated Bachelor's/Master's study 513.1701 96.82992
## Master's study 2727.3728 514.62724
According to the result, The total Chi-square statistic
is 29.487.
Meanwhile, here I explore the most contributing cells to the total
Chi-square score, which is 29.487.
From the residuals plot we can see the main contributors: bachelor’study students whose mother tongue is Russia as well as Master’study students whose mother tongue is Estonian.
As p-value which is 1.685e-06 is much smaller the 0.05 significance level. So we reject the null hypothesis and conclude that the level of study and mother tongue are dependent.
In order to examine the relationship between The Level of Study and
Mother Tongue, a Chi-square test of independence was performed. The
relation between these variables was statistically significant, the
total Chi-square statistic is 29.487,the p-value which is
1.685e-06 is much smaller the 0.05 significance level. The level of
study and mother tongue are dependent.