In this report I am going to explore the dataset
higher.edu from the Github repository using this code “https://raw.githubusercontent.com/Maria-13/DataAnalytics_R/main/Assignments/higher_edu_in_Estonia-1.csv”.
The higher.edu data frame includes six variables:
- Level.of.study
- Study.programme.group
- Estonian
- Russian
- Other.mother.tongue
- Mother.tongue.unknown
First, I am downloading the dataset from the Github repository using this code below to explore the dataset.
#Observation There are 6 columns with the above mentioned variables
and 108 entries.
For the Level.of.study variable there are 4 levels For the
Study.programme.group variable there are 21 levels For the
Estonian variable there are 108 levels For the
Russian variable there are 108 levels For the
Other.mother.tongue variable there are 108 levels For the
Mother.tongue.unknown variable there are 108 levels
Using dplyr package, I am building a summary to view the
average Estonian speakers for each level of study
## Level.of.study Estonian
## 1 Bachelor's study 90
## 2 Doctoral study 4
## 3 Integrated Bachelor's/Master's study 0
## 4 Master's study 53
To view the structure and summary for Estonian speakers for the entire level of study
## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
## $ Level.of.study : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
## $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
## $ Estonian : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
## $ Russian : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
## $ Other.mother.tongue : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
## $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
## Level.of.study Study.programme.group Estonian Russian
## Length:108 Length:108 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 12.50 Median : 1.00
## Mean : 66.64 Mean : 12.57
## 3rd Qu.: 74.25 3rd Qu.: 13.00
## Max. :571.00 Max. :165.00
## Other.mother.tongue Mother.tongue.unknown
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000
## Mean : 7.231 Mean : 6.583
## 3rd Qu.: 7.250 3rd Qu.: 4.000
## Max. :103.000 Max. :88.000
In the same line, I am building a summary to view the average ‘Russian speakers’ for each level of study
## Level.of.study Russian
## 1 Bachelor's study 14
## 2 Doctoral study 0
## 3 Integrated Bachelor's/Master's study 0
## 4 Master's study 7
To view the structure and summary for Russian speakers for the entire level of study
## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
## $ Level.of.study : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
## $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
## $ Estonian : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
## $ Russian : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
## $ Other.mother.tongue : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
## $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
## Level.of.study Study.programme.group Estonian Russian
## Length:108 Length:108 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 12.50 Median : 1.00
## Mean : 66.64 Mean : 12.57
## 3rd Qu.: 74.25 3rd Qu.: 13.00
## Max. :571.00 Max. :165.00
## Other.mother.tongue Mother.tongue.unknown
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000
## Mean : 7.231 Mean : 6.583
## 3rd Qu.: 7.250 3rd Qu.: 4.000
## Max. :103.000 Max. :88.000
##Data Preparation
I am using dlyr package to transform the dataset and
create a new dataset named myworkdata with the data for
only Bachelor’s study
## tibble [27 × 3] (S3: tbl_df/tbl/data.frame)
## $ Level.of.study: chr [1:27] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
## $ Estonian : int [1:27] 82 31 215 173 124 571 25 266 206 67 ...
## $ Russian : int [1:27] 3 7 45 53 24 165 0 84 40 22 ...
## Level.of.study Estonian Russian
## Length:27 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 27.5 1st Qu.: 2.00
## Mode :character Median : 90.0 Median : 14.00
## Mean :137.2 Mean : 29.85
## 3rd Qu.:200.0 3rd Qu.: 46.50
## Max. :571.0 Max. :165.00
I am visualising the selected variable
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For the analysis, the following research question is
formulated:
Is there a statistically significant relationship between Estonian
speaking Bachelor students and Russian speaking Bachelor student?
To answer this question we need to create a contingency table for these two variables:
## `geom_smooth()` using formula = 'y ~ x'
In the dataset Level.of.study I explored Estonian and Russian language speakers studying Bachelors degree in Estonia. I want to know if there is a relationship between the Bachelors level of study Estonian speakers and Bachelor level of study Russian speakers.
##
## 0 1 3 5 7 11 14 21 22 24 29 40 45 48 52 53 62 84 92 165 Sum
## 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
## 20 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 25 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 30 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 31 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 54 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
## 55 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 67 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
## 82 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 90 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 113 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 124 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## 165 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 173 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
## 193 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
## 194 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
## 206 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## 215 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
## 266 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
## 298 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
## 362 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
## 371 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
## 571 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## Sum 6 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 27
The contingency table is then visualised with balloonplot signifying a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
## Warning in chisq.test(obs.table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: obs.table
## X-squared = 513, df = 460, p-value = 0.04398
#From the plot we can see that there are more Bachelor students speaking Estonian than students speaking Russian.
4.1 null hypothesis- the row and column variables of the cotingency table are independent and have no relationship. while the alternative hypothesis- the row and column variables are ependent and there is a relationshipbetween Estonian speaking bachelor’s students and Russian speaking bachelor’s students.
The data consist 3 variables, Bachelor level of study, Estonian and Russian language speaking students. It includes 27 observations with the minimum value for both languages as 0, the maximum value for estonian speakers is 571 while that of Russian speakers is 165, the mean for Estonian bachelor speaking students is while the mean for Russian speaking bachelor students is 29.85. The median for estonian speakers in Bachelors is 90 while that of Russian speakers is 14.
## Level.of.study Estonian Russian
## Length:27 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 27.5 1st Qu.: 2.00
## Mode :character Median : 90.0 Median : 14.00
## Mean :137.2 Mean : 29.85
## 3rd Qu.:200.0 3rd Qu.: 46.50
## Max. :571.0 Max. :165.00
its possible for there to be some kind of relationship between the bachelors students that speak Estonian and the Bachelor students that speaks Russian language.o investigate this, i would use chi-square test to check and determine this assumption.
From the result on the test that was carried out i got a p value of 0.04398 and the Chi-square statistic is 513.
#4.5 Results Interpretation. The p value is lesser than 0.05 which is the significant level. Hence, i therefore reject the null hypothesis that there is no relationship between Estonian speaking bachelor students and Russian speaking bachelor students.
#Significance level p= 0.04398 (p < 5%)
A chi-square test of independence was performed to examine the
relation between Estonian and Russian speaking
students studying bachelors. The result is that there is no relationship
between Estonian speaking bachelor students and Russian speaking
bachelor students.