Download the dataset from the website on Estonian statistics into
R Markdown session. The number of admitted students across
different specialties is recorded for Year 2021. The numbers of admitted
students are presented for Study programme group, Level of
study, and Mother tongue.
Perform quick EDA and pick up variables you want to explore in more depth.
Prepare the data set that includes only variables of your
interest in a suitable format for analysis (use dlyr
package and tidyr when necessary).
Based on the dataset, formulate your Research Question(s). Think of questions that you can answer with the help of either chi-square statistical test or some type of t-test.
Hypothesis testing.
4.1 State the null hypothesis and the alternative hypothesis.
4.2 Report on collected data and sample size.
4.3 Check the assumption of the chosen statistical test. Perform the
required statistical test.
4.4 Decide whether to reject or fail to reject your null hypothesis,
report selected significance level.
4.5 Interpret and report the results.
Create a report in R Markdown with clear sections,
using headings and subheadings.
… and deleting the second column ‘Indicator’
## # A tibble: 108 × 6
## Level.of.study Study.programme.group Eston…¹ Russian Other…² Mothe…³
## <chr> <chr> <int> <int> <int> <int>
## 1 Bachelor's study Journalism and information 82 3 0 0
## 2 Bachelor's study Architecture and constructi… 31 7 0 0
## 3 Bachelor's study Biological and environmenta… 215 45 1 0
## 4 Bachelor's study Physical sciences 173 53 20 11
## 5 Bachelor's study Humanities 124 24 5 9
## 6 Bachelor's study Information and Communicati… 571 165 14 23
## 7 Bachelor's study Personal services 25 0 0 0
## 8 Bachelor's study Languages and cultures 266 84 7 0
## 9 Bachelor's study Arts 206 40 8 23
## 10 Bachelor's study Mathematics and statistics 67 22 1 0
## # … with 98 more rows, and abbreviated variable names ¹Estonian,
## # ²Other.mother.tongue, ³Mother.tongue.unknown
We are looking at the data set from the website stat.ee which includes the number of admitted students across different specialties recorded for Year 2021. The data set includes the numbers of admitted students for Study programme group, their Level of study, and their Mother tongue.
After deleting the second column ‘Indicator’ (it is not needed), the data set consists of a tibble (108 x 6) and includes the columns ‘Level of study’, ‘Study.programme.group’, ‘Estonian’, ‘Russian’, ‘Other.mother.tongue’ and ‘Mother.tongue.unknown’.
There are no NA values in the data set.
To answer that question, I can look at the structure of the data set
with str() function:
I want to change the data type of Level.of.study and Study.programme.group to factor:
## # A tibble: 108 × 6
## Level.of.study Study.programme.group Eston…¹ Russian Other…² Mothe…³
## <fct> <fct> <int> <int> <int> <int>
## 1 Bachelor's study Journalism and information 82 3 0 0
## 2 Bachelor's study Architecture and constructi… 31 7 0 0
## 3 Bachelor's study Biological and environmenta… 215 45 1 0
## 4 Bachelor's study Physical sciences 173 53 20 11
## 5 Bachelor's study Humanities 124 24 5 9
## 6 Bachelor's study Information and Communicati… 571 165 14 23
## 7 Bachelor's study Personal services 25 0 0 0
## 8 Bachelor's study Languages and cultures 266 84 7 0
## 9 Bachelor's study Arts 206 40 8 23
## 10 Bachelor's study Mathematics and statistics 67 22 1 0
## # … with 98 more rows, and abbreviated variable names ¹Estonian,
## # ²Other.mother.tongue, ³Mother.tongue.unknown
This does not really matter for the following analysis. We have nominal data so we anyways cannot use tests for parametric data.
I specifically want to look at the Study programs “Journalism and information” as well as “Sports” and compare if these programs differ regarding the mother tongue of students (Estonian or Russian; there are no students with other or unknown mother tongue in these programs). These are all ordinal levels of measurement and I later want to use chi-square test of independence to compare the groups.
Therefore, I will now prepare a new data set that only includes the variables I want to look at:
## # A tibble: 8 × 4
## Level.of.study Study.programme.group Eston…¹ Russian
## <fct> <fct> <int> <int>
## 1 Bachelor's study Journalism and informati… 82 3
## 2 Integrated Bachelor's/Master's study Journalism and informati… 0 0
## 3 Master's study Journalism and informati… 46 4
## 4 Doctoral study Journalism and informati… 1 0
## 5 Bachelor's study Sports 90 5
## 6 Integrated Bachelor's/Master's study Sports 0 0
## 7 Master's study Sports 18 2
## 8 Doctoral study Sports 3 0
## # … with abbreviated variable name ¹Estonian
For the chi-square test of independence, I will need a contingency table:
## Estonian Russian
## Sports 111 7
## Journalism.and.information 129 7
My research question is: “Does the number of”Journalism and information” as well as “Sports” students with Estonian and Russian as mother tongue differ between the groups (i.e. “Sports” and “Journalism and information” students).
H0 (null hypothesis): “Sports” and “Journalism and information” students do not differ regarding their mother tongue.
H1 (alternative hypothesis): “Sports” and “Journalism and information” students do differ regarding their mother tongue.
The collected data is from the website stat.ee and it includes the number of admitted students across different specialties recorded for Year 2021. The data set includes the numbers of admitted students for Study programme group, Level of study, and Mother tongue. The sample I am looking at includes information for students of “Sports” as well as “Journalism and information” of all levels of study and their mother tongue. It includes information on 254 students.
I will test for significance at the 5 % level.
##
## Pearson's Chi-squared test
##
## data: data
## X-squared = 0.074785, df = 1, p-value = 0.7845
Chi-squared is 0.074785. The p-value is 0.7845 which is not significant.
Th p-value of 0.7845 fails to confirm the alternative hypothesis and supports the null hypothesis. The chi-square test for independence does not indicate a significant difference between the groups of “Sports” and “Information and information” regarding students’ mother tongue. This supports the null hypothesis: “Sports” and “Journalism and information” students do not differ regarding their mother tongue. We do not have a significant chi-square test so we cannot draw conclusions from this sample to the population.