A report on a Higher Education Dataset

Introduction

In this report I am going to explore the dataset higher.edu from the Github repository using this code “https://raw.githubusercontent.com/Maria-13/DataAnalytics_R/main/Assignments/higher_edu_in_Estonia-1.csv”.

The higher.edu data frame includes six variables:
- Level.of.study
- Study.programme.group
- Estonian
- Russian
- Other.mother.tongue
- Mother.tongue.unknown

First, I am downloading the dataset from the Github repository using this code below to explore the dataset.

#Observation There are 6 columns with the above mentioned variables and 108 entries.
For the Level.of.study variable there are 4 levels For the Study.programme.group variable there are 21 levels For the Estonian variable there are 108 levels For the Russian variable there are 108 levels For the Other.mother.tongue variable there are 108 levels For the Mother.tongue.unknown variable there are 108 levels

Quick EDA and variables Identification for more in-depth Exploration.

Using dplyr package, I am building a summary to view the average Estonian speakers for each level of study

##                         Level.of.study Estonian
## 1                     Bachelor's study       90
## 2                       Doctoral study        4
## 3 Integrated Bachelor's/Master's study        0
## 4                       Master's study       53

To view the structure and summary for Estonian speakers for the entire level of study

## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Level.of.study       : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
##  $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
##  $ Estonian             : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
##  $ Russian              : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
##  $ Other.mother.tongue  : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
##  $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
##  Level.of.study     Study.programme.group    Estonian         Russian      
##  Length:108         Length:108            Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character      1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character      Median : 12.50   Median :  1.00  
##                                           Mean   : 66.64   Mean   : 12.57  
##                                           3rd Qu.: 74.25   3rd Qu.: 13.00  
##                                           Max.   :571.00   Max.   :165.00  
##  Other.mother.tongue Mother.tongue.unknown
##  Min.   :  0.000     Min.   : 0.000       
##  1st Qu.:  0.000     1st Qu.: 0.000       
##  Median :  0.000     Median : 0.000       
##  Mean   :  7.231     Mean   : 6.583       
##  3rd Qu.:  7.250     3rd Qu.: 4.000       
##  Max.   :103.000     Max.   :88.000

In the same line, I am building a summary to view the average ‘Russian speakers’ for each level of study

##                         Level.of.study Russian
## 1                     Bachelor's study      14
## 2                       Doctoral study       0
## 3 Integrated Bachelor's/Master's study       0
## 4                       Master's study       7

To view the structure and summary for Russian speakers for the entire level of study

## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Level.of.study       : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
##  $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
##  $ Estonian             : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
##  $ Russian              : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
##  $ Other.mother.tongue  : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
##  $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
##  Level.of.study     Study.programme.group    Estonian         Russian      
##  Length:108         Length:108            Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character      1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character      Median : 12.50   Median :  1.00  
##                                           Mean   : 66.64   Mean   : 12.57  
##                                           3rd Qu.: 74.25   3rd Qu.: 13.00  
##                                           Max.   :571.00   Max.   :165.00  
##  Other.mother.tongue Mother.tongue.unknown
##  Min.   :  0.000     Min.   : 0.000       
##  1st Qu.:  0.000     1st Qu.: 0.000       
##  Median :  0.000     Median : 0.000       
##  Mean   :  7.231     Mean   : 6.583       
##  3rd Qu.:  7.250     3rd Qu.: 4.000       
##  Max.   :103.000     Max.   :88.000

##Data Preparation

I am using dlyr package to transform the dataset and create a new dataset named myworkdata with the data for only Bachelor’s study

## tibble [27 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Level.of.study: chr [1:27] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
##  $ Estonian      : int [1:27] 82 31 215 173 124 571 25 266 206 67 ...
##  $ Russian       : int [1:27] 3 7 45 53 24 165 0 84 40 22 ...
##  Level.of.study        Estonian        Russian      
##  Length:27          Min.   :  0.0   Min.   :  0.00  
##  Class :character   1st Qu.: 27.5   1st Qu.:  2.00  
##  Mode  :character   Median : 90.0   Median : 14.00  
##                     Mean   :137.2   Mean   : 29.85  
##                     3rd Qu.:200.0   3rd Qu.: 46.50  
##                     Max.   :571.0   Max.   :165.00

I am visualising the selected variable

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Research Question:

For the analysis, the following research question is formulated:
Is there a statistically significant relationship between Estonian speaking Bachelor students and Russian speaking Bachelor student?

To answer this question we need to create a contingency table for these two variables:

## `geom_smooth()` using formula = 'y ~ x'

In the dataset Level.of.study I explored Estonian and Russian language speakers studying Bachelors degree in Estonia. I want to know if there is a relationship between the Bachelors level of study Estonian speakers and Bachelor level of study Russian speakers.

##      
##        0  1  3  5  7 11 14 21 22 24 29 40 45 48 52 53 62 84 92 165 Sum
##   0    5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   5
##   20   0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   25   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   30   0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   31   0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   54   0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0   0   1
##   55   0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   67   0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0   0   1
##   82   0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   90   0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   113  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   124  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0   0   1
##   165  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0   0   1
##   173  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0   0   1
##   193  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0   0   1
##   194  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0   0   1
##   206  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0   0   1
##   215  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0   0   1
##   266  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0   0   1
##   298  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0   0   1
##   362  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1   0   1
##   371  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0   0   1
##   571  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   1   1
##   Sum  6  1  1  1  1  1  3  1  1  1  1  1  1  1  1  1  1  1  1   1  27

The contingency table is then visualised with balloonplot signifying a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.

## Warning in chisq.test(obs.table): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  obs.table
## X-squared = 513, df = 460, p-value = 0.04398

#From the plot we can see that there are more Bachelor students speaking Estonian than students speaking Russian.

Hypothesis testing.

4.1 null hypothesis- the row and column variables of the cotingency table are independent and have no relationship. while the alternative hypothesis- the row and column variables are ependent and there is a relationshipbetween Estonian speaking bachelor’s students and Russian speaking bachelor’s students.

Report on collected data and sample size.

The data consist 3 variables, Bachelor level of study, Estonian and Russian language speaking students. It includes 27 observations with the minimum value for both languages as 0, the maximum value for estonian speakers is 571 while that of Russian speakers is 165, the mean for Estonian bachelor speaking students is while the mean for Russian speaking bachelor students is 29.85. The median for estonian speakers in Bachelors is 90 while that of Russian speakers is 14.

##  Level.of.study        Estonian        Russian      
##  Length:27          Min.   :  0.0   Min.   :  0.00  
##  Class :character   1st Qu.: 27.5   1st Qu.:  2.00  
##  Mode  :character   Median : 90.0   Median : 14.00  
##                     Mean   :137.2   Mean   : 29.85  
##                     3rd Qu.:200.0   3rd Qu.: 46.50  
##                     Max.   :571.0   Max.   :165.00

Assumption of the chosen statistical test.

its possible for there to be some kind of relationship between the bachelors students that speak Estonian and the Bachelor students that speaks Russian language.o investigate this, i would use chi-square test to check and determine this assumption.

Result

From the result on the test that was carried out i got a p value of 0.04398 and the Chi-square statistic is 513.

#4.5 Results Interpretation. The p value is lesser than 0.05 which is the significant level. Hence, i therefore reject the null hypothesis that there is no relationship between Estonian speaking bachelor students and Russian speaking bachelor students.

#Significance level p= 0.04398 (p < 5%)

Conclusion:

A chi-square test of independence was performed to examine the relation between Estonian and Russian speaking students studying bachelors. The result is that there is no relationship between Estonian speaking bachelor students and Russian speaking bachelor students.