Introduction

The dataset I will explore is the number of admitted students across different specialties which is recorded for Year 2021. The numbers of admitted students are presented for Study programme group, Level of study, and Mother tongue.There are six varialbes in the dataset:Year, Level of study, Study programme group, Estonian,Russian, Other mother tongue, Mother tongue unknown.

## # A tibble: 6 × 6
##   Level.of.study   Study.programme.group         Eston…¹ Russian Other…² Mothe…³
##   <chr>            <chr>                           <int>   <int>   <int>   <int>
## 1 Bachelor's study Journalism and information         82       3       0       0
## 2 Bachelor's study Architecture and construction      31       7       0       0
## 3 Bachelor's study Biological and environmental…     215      45       1       0
## 4 Bachelor's study Physical sciences                 173      53      20      11
## 5 Bachelor's study Humanities                        124      24       5       9
## 6 Bachelor's study Information and Communicatio…     571     165      14      23
## # … with abbreviated variable names ¹​Estonian, ²​Other.mother.tongue,
## #   ³​Mother.tongue.unknown

Questions for EDA

1.What are the observed population, the observation unit and the reference period?

The observed population is all the admitted students across different specialties and their mother tongue in Estonia. The observation unit is a student and the reference period is 2021.

2.What are the data types of the variables? Do we need to change them?

The dataset include 6 variables. The data types of the variables shows as below:

## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Level.of.study       : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
##  $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
##  $ Estonian             : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
##  $ Russian              : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
##  $ Other.mother.tongue  : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
##  $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...

We don’t need to change the data types of the variables.

3.What is the mean for each variable?

Here is the mean for each variable.

##  Level.of.study     Study.programme.group    Estonian         Russian      
##  Length:108         Length:108            Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character      1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character      Median : 12.50   Median :  1.00  
##                                           Mean   : 66.64   Mean   : 12.57  
##                                           3rd Qu.: 74.25   3rd Qu.: 13.00  
##                                           Max.   :571.00   Max.   :165.00  
##  Other.mother.tongue Mother.tongue.unknown
##  Min.   :  0.000     Min.   : 0.000       
##  1st Qu.:  0.000     1st Qu.: 0.000       
##  Median :  0.000     Median : 0.000       
##  Mean   :  7.231     Mean   : 6.583       
##  3rd Qu.:  7.250     3rd Qu.: 4.000       
##  Max.   :103.000     Max.   :88.000

5.Are there any Null / NA values?

There are no Null / NA values.

sum(is.na(higher.edu))
## [1] 0

Description of data cleaning and transformation

I wanted to explore the variables of level of study , Mother tougue(Estonian, Russian)and the relationship between them.

## `summarise()` has grouped output by 'Level.of.study'. You can override using
## the `.groups` argument.
## # A tibble: 6 × 3
## # Groups:   Level.of.study [3]
##   Level.of.study                       Mother_Tongue sum_of_student
##   <chr>                                <chr>                  <int>
## 1 Bachelor's study                     Estonian                3705
## 2 Bachelor's study                     Russian                  806
## 3 Doctoral study                       Estonian                 169
## 4 Doctoral study                       Russian                   23
## 5 Integrated Bachelor's/Master's study Estonian                 519
## 6 Integrated Bachelor's/Master's study Russian                   91

Research Question

RQ: Is there a relationship between the level of study and their main mother tongue (Estonian and Russian)?

To answer the research question, I will use Two sample t-test. Firstly, we need to create a contingency table for these variables:

##                                       Mother_Tongue
## Level.of.study                         Estonian Russian
##   Bachelor's study                         3705     806
##   Doctoral study                            169      23
##   Integrated Bachelor's/Master's study      519      91
##   Master's study                           2804     438

Here I will visualize the contingency table.

From the plot we can see that more people whose Mother Tongue is Estonian tend to have higher level of study.

Hypothesis testing

State the null hypothesis and the alternative hypothesis

Null hypothesis: the level of study and mother tongue are independent.

Alternative hypothesis: the level of study and mother tongue are dependent

Report on collected data and sample size.

I collected all study level of students whose monther tongue are Estonian and Russia in Estonia in 2021. The sampe size is 5313.

Check the assumption of the chosen statistical test. Perform the required statistical test.

The contigency table provides us with information that all categories’answers are much more than 5. This did not violation of the assumption for chi-square test. here I will perform the Chi- test.

## 
##  Pearson's Chi-squared test
## 
## data:  Ctable
## X-squared = 29.587, df = 3, p-value = 1.685e-06
##                                       Mother_Tongue
## Level.of.study                          Estonian   Russian
##   Bachelor's study                     3794.9348 716.06523
##   Doctoral study                        161.5224  30.47762
##   Integrated Bachelor's/Master's study  513.1701  96.82992
##   Master's study                       2727.3728 514.62724

According to the result, The total Chi-square statistic is 29.487.
Meanwhile, here I explore the most contributing cells to the total Chi-square score, which is 29.487.

From the residuals plot we can see the main contributors: bachelor’study students whose mother tongue is Russia as well as Master’study students whose mother tongue is Estonian.

Decide whether to reject or fail to reject your null hypothesis, report selected significance level.

As p-value which is 1.685e-06 is much smaller the 0.05 significance level. So we reject the null hypothesis and conclude that the level of study and mother tongue are dependent.

Interpret and report the results.

In order to examine the relationship between The Level of Study and Mother Tongue, a Chi-square test of independence was performed. The relation between these variables was statistically significant, the total Chi-square statistic is 29.487,the p-value which is 1.685e-06 is much smaller the 0.05 significance level. The level of study and mother tongue are dependent.