Homework Assignment 8

Assignment Description

Download the dataset from the website on Estonian statistics into R Markdown session. The number of admitted students across different specialties is recorded for Year 2021. The numbers of admitted students are presented for Study programme group, Level of study, and Mother tongue.

  1. Perform quick EDA and pick up variables you want to explore in more depth.

  2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

  3. Based on the dataset, formulate your Research Question(s). Think of questions that you can answer with the help of either chi-square statistical test or some type of t-test.

  4. Hypothesis testing.
    4.1 State the null hypothesis and the alternative hypothesis.
    4.2 Report on collected data and sample size.
    4.3 Check the assumption of the chosen statistical test. Perform the required statistical test.
    4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level.
    4.5 Interpret and report the results.

  5. Create a report in R Markdown with clear sections, using headings and subheadings.

Downloading data set

… and deleting the second column ‘Indicator’

## # A tibble: 108 × 6
##    Level.of.study   Study.programme.group        Eston…¹ Russian Other…² Mothe…³
##    <chr>            <chr>                          <int>   <int>   <int>   <int>
##  1 Bachelor's study Journalism and information        82       3       0       0
##  2 Bachelor's study Architecture and constructi…      31       7       0       0
##  3 Bachelor's study Biological and environmenta…     215      45       1       0
##  4 Bachelor's study Physical sciences                173      53      20      11
##  5 Bachelor's study Humanities                       124      24       5       9
##  6 Bachelor's study Information and Communicati…     571     165      14      23
##  7 Bachelor's study Personal services                 25       0       0       0
##  8 Bachelor's study Languages and cultures           266      84       7       0
##  9 Bachelor's study Arts                             206      40       8      23
## 10 Bachelor's study Mathematics and statistics        67      22       1       0
## # … with 98 more rows, and abbreviated variable names ¹​Estonian,
## #   ²​Other.mother.tongue, ³​Mother.tongue.unknown

Introduction (and some EDA)

We are looking at the data set from the website stat.ee which includes the number of admitted students across different specialties recorded for Year 2021. The data set includes the numbers of admitted students for Study programme group, their Level of study, and their Mother tongue.

After deleting the second column ‘Indicator’ (it is not needed), the data set consists of a tibble (108 x 6) and includes the columns ‘Level of study’, ‘Study.programme.group’, ‘Estonian’, ‘Russian’, ‘Other.mother.tongue’ and ‘Mother.tongue.unknown’.

Questions for EDA

Are there any NA values?

There are no NA values in the data set.

What are the data types of the variables? Do we need to change them?

To answer that question, I can look at the structure of the data set with str() function:

  • Level.of.study: character
  • Study.programme.group: character
  • Estonian: integer
  • Russian: integer
  • Other.mother.tongue: integer
  • Mother.tongue.unknown: integer

I want to change the data type of Level.of.study and Study.programme.group to factor:

## # A tibble: 108 × 6
##    Level.of.study   Study.programme.group        Eston…¹ Russian Other…² Mothe…³
##    <fct>            <fct>                          <int>   <int>   <int>   <int>
##  1 Bachelor's study Journalism and information        82       3       0       0
##  2 Bachelor's study Architecture and constructi…      31       7       0       0
##  3 Bachelor's study Biological and environmenta…     215      45       1       0
##  4 Bachelor's study Physical sciences                173      53      20      11
##  5 Bachelor's study Humanities                       124      24       5       9
##  6 Bachelor's study Information and Communicati…     571     165      14      23
##  7 Bachelor's study Personal services                 25       0       0       0
##  8 Bachelor's study Languages and cultures           266      84       7       0
##  9 Bachelor's study Arts                             206      40       8      23
## 10 Bachelor's study Mathematics and statistics        67      22       1       0
## # … with 98 more rows, and abbreviated variable names ¹​Estonian,
## #   ²​Other.mother.tongue, ³​Mother.tongue.unknown

How is the distribution of the data?

This does not really matter for the following analysis. We have nominal data so we anyways cannot use tests for parametric data.

Visualisation of data

Description of data transformation

I specifically want to look at the Study programs “Journalism and information” as well as “Sports” and compare if these programs differ regarding the mother tongue of students (Estonian or Russian; there are no students with other or unknown mother tongue in these programs). These are all ordinal levels of measurement and I later want to use chi-square test of independence to compare the groups.

Therefore, I will now prepare a new data set that only includes the variables I want to look at:

## # A tibble: 8 × 4
##   Level.of.study                       Study.programme.group     Eston…¹ Russian
##   <fct>                                <fct>                       <int>   <int>
## 1 Bachelor's study                     Journalism and informati…      82       3
## 2 Integrated Bachelor's/Master's study Journalism and informati…       0       0
## 3 Master's study                       Journalism and informati…      46       4
## 4 Doctoral study                       Journalism and informati…       1       0
## 5 Bachelor's study                     Sports                         90       5
## 6 Integrated Bachelor's/Master's study Sports                          0       0
## 7 Master's study                       Sports                         18       2
## 8 Doctoral study                       Sports                          3       0
## # … with abbreviated variable name ¹​Estonian

For the chi-square test of independence, I will need a contingency table:

##                            Estonian Russian
## Sports                          111       7
## Journalism.and.information      129       7

Research Question

My research question is: “Does the number of”Journalism and information” as well as “Sports” students with Estonian and Russian as mother tongue differ between the groups (i.e. “Sports” and “Journalism and information” students).

Hypothesis testing

H0 and H1

H0 (null hypothesis): “Sports” and “Journalism and information” students do not differ regarding their mother tongue.

H1 (alternative hypothesis): “Sports” and “Journalism and information” students do differ regarding their mother tongue.

Collected data and sample size

The collected data is from the website stat.ee and it includes the number of admitted students across different specialties recorded for Year 2021. The data set includes the numbers of admitted students for Study programme group, Level of study, and Mother tongue. The sample I am looking at includes information for students of “Sports” as well as “Journalism and information” of all levels of study and their mother tongue. It includes information on 254 students.

Assumptions of chi-square of independence

  • Data should be in frequencies / counts: assumption is met
  • categories are mutually exclusive: assumption is met
  • subjects contribute data only to one cell of table: assumption is met
  • groups must be independent: assumption is met
  • two variables measured as categories (nominal or ordinal data): assumption is met
  • should be at least 5 or more in at least 80 % of the cells: assumption is met

Conducting chi-square test of independence

I will test for significance at the 5 % level.

## 
##  Pearson's Chi-squared test
## 
## data:  data
## X-squared = 0.074785, df = 1, p-value = 0.7845

Results of chi-square test of indepenence

Chi-squared is 0.074785. The p-value is 0.7845 which is not significant.

Reporting and interpretation of results

Th p-value of 0.7845 fails to confirm the alternative hypothesis and supports the null hypothesis. The chi-square test for independence does not indicate a significant difference between the groups of “Sports” and “Information and information” regarding students’ mother tongue. This supports the null hypothesis: “Sports” and “Journalism and information” students do not differ regarding their mother tongue. We do not have a significant chi-square test so we cannot draw conclusions from this sample to the population.