In this report, I am going to explore the dataset ‘higher.edu’ from the Estonian statistics website. Additional information can be found here https://raw.githubusercontent.com/Maria-13/DataAnalytics_R/main/Assignments/higher_edu_in_Estonia-1.csv
The ‘higher.edu’ data frame includes six variables: - ‘Level.of.study’ - ‘Study.programme.group’ - ‘Estonian’ - ‘Russian’ - ‘Other.mother.tongue’ - ‘Mother.tongue.unknown’
First, I am downloading the package to explore the dataset
There are 108 entries and 6 variables. For the ‘level of study’ variable, there are four levels For the ‘study.programme.group’ variable, there are 21 levels For the ‘Estonian’, ‘Russian’, ‘Other.mother.tongue’, and ‘Mother.tongue.unknown’ variables, there are 108 levels each.
I will explore the structure and content of the dataset, including the number of rows and columns, the data types of each variable, and the presence of missing values.
Also will use the summary function to generate a summary of the dataset, including the mean, median, mode, and other descriptive statistics for each variable.
## tibble [108 × 6] (S3: tbl_df/tbl/data.frame)
## $ Level.of.study : chr [1:108] "Bachelor's study" "Bachelor's study" "Bachelor's study" "Bachelor's study" ...
## $ Study.programme.group: chr [1:108] "Journalism and information" "Architecture and construction" "Biological and environmental sciences" "Physical sciences" ...
## $ Estonian : int [1:108] 82 31 215 173 124 571 25 266 206 67 ...
## $ Russian : int [1:108] 3 7 45 53 24 165 0 84 40 22 ...
## $ Other.mother.tongue : int [1:108] 0 0 1 20 5 14 0 7 8 1 ...
## $ Mother.tongue.unknown: int [1:108] 0 0 0 11 9 23 0 0 23 0 ...
## Level.of.study Study.programme.group Estonian Russian
## Length:108 Length:108 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 12.50 Median : 1.00
## Mean : 66.64 Mean : 12.57
## 3rd Qu.: 74.25 3rd Qu.: 13.00
## Max. :571.00 Max. :165.00
## Other.mother.tongue Mother.tongue.unknown
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000
## Mean : 7.231 Mean : 6.583
## 3rd Qu.: 7.250 3rd Qu.: 4.000
## Max. :103.000 Max. :88.000
Using ‘dplyr’ package, I am building a summary to view the average ‘Estonian’ for the level of study
## Level.of.study Estonian
## 1 Bachelor's study 90
## 2 Doctoral study 4
## 3 Integrated Bachelor's/Master's study 0
## 4 Master's study 53
In the same line, I am building a summary to view the average ‘Russian’ for the level of study
## Level.of.study Other.mother.tongue
## 1 Bachelor's study 1
## 2 Doctoral study 1
## 3 Integrated Bachelor's/Master's study 0
## 4 Master's study 9
I am choosing my variables to work with which are Level of study, Estonian and Russian. The mean for Estonian speakers is 103.9 and the median is 53 while the mean for Russian speakers is 16.22 and the median is 7
## Level.of.study Estonian Russian
## Length:27 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 28.0 1st Qu.: 1.50
## Mode :character Median : 53.0 Median : 7.00
## Mean :103.9 Mean :16.22
## 3rd Qu.:113.5 3rd Qu.:18.50
## Max. :531.0 Max. :85.00
I am visualising the selected variable
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the dataset, I will be formulating my research question for Chi-Square test of independence. Is there a significant relationship in the Master’s level of study between Estonian and Russian speakers?
To answer this question, we need to create a contingency table for these two variables:
## `geom_smooth()` using formula = 'y ~ x'
In the dataset, I explored the Master’s study in Level.of.study for Estonian and Russian language speakers in Estonia. I will be analysing the data to check if there is a relationship between both variables.
##
## 0 1 2 3 4 7 9 11 13 18 19 20 22 37 73 79 85 Sum
## 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
## 7 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 18 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 22 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 26 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 30 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2
## 41 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
## 43 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
## 46 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
## 50 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 53 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## 63 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 70 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## 74 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## 75 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
## 105 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
## 106 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
## 121 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
## 125 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
## 150 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
## 198 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
## 353 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
## 455 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## 531 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
## Sum 6 1 2 1 2 2 2 1 2 1 1 1 1 1 1 1 1 27
With a balloon plot, let us visualise the contingency table which is signifying a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
## Warning in chisq.test(obs.table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: obs.table
## X-squared = 411.75, df = 425, p-value = 0.6687
The plot shows that there are more Estonian speaking students than Russian speakers in the Master’s level of studying.
Null hypothesis- The row and column variables of the contingency table are independent and have no relationship.
Alternative hypothesis- The row and column variables are dependent and there is a relationship between Estonian and Russian speaking Master’s students.
The selected dataset consist of 3 variables, Master’s level of study, Estonian and Russian speakers. It includes 27 observations with the minimum value for both languages as 0, the maximum value for Estonian speakers is 531 while that of Russian speakers is 85, the mean for Master’s Estonian speaking students is 103.9 while the mean for Master’s Russian speaking students is 16.22. The median for Estonian speakers in Master’s is 53 while that of Russian speakers is 7.
## Level.of.study Estonian Russian
## Length:27 Min. : 0.0 Min. : 0.00
## Class :character 1st Qu.: 28.0 1st Qu.: 1.50
## Mode :character Median : 53.0 Median : 7.00
## Mean :103.9 Mean :16.22
## 3rd Qu.:113.5 3rd Qu.:18.50
## Max. :531.0 Max. :85.00
It is possible that there is some kind of relationship between the Estonian and Russian language speakers in the Master’s level of study. In order to investigate this, I used chi-square test to check and determine this assumption.The result on the test that was carried out shows a p-value of 0.669 which is greater than 0.05 which is the significant level. Hence, I therefore support and fail to reject the null hypothesis that there is no relationship between Master’s Estonian speaking and Russian speaking students.
#Significance level p= 0.669 (p > 5%)
A chi-square test of independence was performed to examine the
relation between Estonian and Russian speaking
students studying Master’s. The result is that there is no relationship
between Estonian speaking and Russian speaking Master’s students.