HW8_Habilainen

Introduction

The dataset I am going to analyse is from Estonian statistics website, which shows the number of admitted students across different specialties for the year 2021. The students are presented for study programme group, level of study, and mother tongue.

Questions for EDA

I am creating separate datasets to see the means of Estonian and Russian speaking students for each level of study.

##                         Level.of.study   Estonian
## 1                     Bachelor's study 137.222222
## 2                       Doctoral study   6.259259
## 3 Integrated Bachelor's/Master's study  19.222222
## 4                       Master's study 103.851852

##                         Level.of.study    Russian
## 1                     Bachelor's study 29.8518519
## 2                       Doctoral study  0.8518519
## 3 Integrated Bachelor's/Master's study  3.3703704
## 4                       Master's study 16.2222222

There looks to be a significant difference between the means of admitted Estonian and Russian speaking students, estonian being more ambundant in every level of study. There are different reasons for this difference, such as: there are overall more Estonian speaking people in Estonia, education is mainly in Estonian. Also, it should be noted that the data is only for the 2021 and doesn’t show if the means have increased or decreased in either case.

Description of data transformation

The new dataset includes 108 observations for 3 variables: ‘level of study’, ‘Estonian’, and ‘Russian’.

I created histograms for Estonian and Russian speaking students.

Both histograms are positively skewed.

Research question:

Is there a difference between the amount of admitted Estonian and Russian speaking students across different specialties in the year 2021?

Hypothesis testing

H0: There is no difference between the overall amount of admitted Estonian and Russian speaking students.

H1: There is a difference between the overall amount of admitted Estonian and Russian speaking students.

The sample size is quite large, Estonian speaking students being 7197 and Russian speaking students 1358, although it is obvious that there are a lot more Estonian speaking students admitted to different specialties.

I am choosing to use Chi-square test because I want to see is there is a significant difference between the two variables I chose. The assumptions of Chi-square are also suitable for the data analysis, which include the sample sizes being unequal size.

## 
##  Pearson's Chi-squared test
## 
## data:  data$Estonian and data$Russian
## X-squared = 3200.6, df = 1922, p-value < 2.2e-16

Chi-square test gave p-value < 2.2e-16, which means the significance level is below 0.05. Therefore H0 can be rejected.

Conclusion:

Chi-square test revealed that there is a significant difference (p<0.05) between admitted students across different specialties in the year 2021 that speak either Estonian or Russian as their mother tongue