R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Your full name: Zhenyu Wang
# Date: 18th Oct 2022

####### TASK 1: 5 points. Short EDA report ###############

# The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot,
# export data using R Markdown and 'knit' by creating an html.file
?suppressWarnings
# 1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.
df.edu_in_EST <- read.csv("~/Downloads/HT71_20221018-130159.csv", header = FALSE, comment.char = "#")
# 2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

# EDA is the process of analysing and visualising the data to get a better understanding of the data and glean insight from it. Common steps: import, clean, process, and visualised the data
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(readr)
library(tibble)
library(tidyr)
library(dplyr)

?read.csv
?read_csv
# read_csv function imports data into R as a tibble, while read.csv function imports a regular old data frame instead. Tibbles are better than regular data frames because they load faster, don't change input types, allow you to have columns as lists, allow non-standard variable names, and never create row names
df.edu_dropouts <- read_csv("Downloads/HT307_20221018-204420.csv")
## Rows: 16 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Sex, Level of study
## dbl (4): Year, Estonian, Russian, Other mother tongue
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
?View
?view
## Help on topic 'view' was found in the following packages:
## 
##   Package               Library
##   tibble                /Library/Frameworks/R.framework/Versions/4.1/Resources/library
##   dplyr                 /Library/Frameworks/R.framework/Versions/4.1/Resources/library
## 
## 
## Using the first match ...
view(df.edu_dropouts)
View(df.edu_dropouts)

Including Plots

You can also embed plots, for example:

# Q: What metadata is available for this data set? 
#    Are the descriptions of variables provided? 
#    What do we know about the population / sampling?
#    What are the observed population, the observation unit and the reference period?
#   What are the data types of the variables? Do we need to change them?
#   What are the frequency distributions of these variables? 
#   What are the measures of central tendency and dispersion? Anything that surprises you?
#   Are there any outliers? Are there any values that look like errors?
#   What is the mean for each variable?
#   Are there any Null / NA values? 

# I selected questions: #   What are the data types of the variables? Do we need to change them?
# There are two types of data: double and character. We don't need to change them because the tibble shows that all the data types match values

# What is the mean for each variable?
browseVignettes("tibble")
## starting httpd help server ... done
row_sex <- factor(df.edu_dropouts$Sex)
data_row_males <- df.edu_dropouts[row_sex == "Males", ]
mean_row_males_EST <- mean(data_row_males$Estonian)
# 545
mean_row_males_RUS <- mean(data_row_males$Russian)
# 107
mean_row_males_OML <- mean(data_row_males$`Other mother tongue`)
# 52
data_row_females <- df.edu_dropouts[row_sex == "Females", ]
mean_row_females_EST <- mean(data_row_females$Estonian)
# 811
mean_row_females_RUS <- mean(data_row_females$Russian)
# 140
mean_row_females_OML <- mean(data_row_females$`Other mother tongue`)
# 19

# another way to calculate means quickly
colMeans(data_row_males[, 4:6])
##            Estonian             Russian Other mother tongue 
##              544.75              106.75               52.25
colMeans(data_row_females[, 4:6])
##            Estonian             Russian Other mother tongue 
##              811.25              139.75               19.25
sd(data_row_males$Estonian)
## [1] 77.73168
# 78
sd(data_row_males$Russian)
## [1] 35.6561
# 36
sd(data_row_males$`Other mother tongue`)
## [1] 19.84763
# 20
sd(data_row_females$Estonian)
## [1] 98.14676
# 98
sd(data_row_females$Russian)
## [1] 40.61228
# 41
sd(data_row_females$`Other mother tongue`)
## [1] 8.631338
# 9

# all above is the method of calculating means and standard deviations without using ggplot2

# What are the measures of central tendency and dispersion? Anything that surprises you?
# Are there any outliers? Are there any values that look like errors?
boxplot(data_row_males$Estonian)

# From the boxplot we can learn there is no outlier. The range is around 250 (max is around 700; min is around 450). The median is around 525. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low.
boxplot(data_row_males$Russian)

# From the boxplot we can learn there is no outlier. The range is around 100 (max is around 160; min is around 60). The median is around 100. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low.
boxplot(data_row_males$`Other mother tongue`)

# From the boxplot we can learn there is no outlier. The range is around 55 (max is around 85; min is around 30). The median is around 48. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low. 
boxplot(data_row_females$Estonian)

# From the boxplot we can learn there is no outlier. The range is around 290 (max is around 950; min is around 660). The median is around 780. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low. Compared with male dropouts, this data set has 
boxplot(data_row_females$Russian)

# From the boxplot we can learn there is no outlier. The range is around 120 (max is around 210; min is around 90). The median is around 125. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low.
boxplot(data_row_females$`Other mother tongue`)

# From the boxplot we can learn there is one ouliers: the maximum value which is around 40. The range is 28. The median is around 17. The data should be right skewed; extreme values are more likely to be far from the high end more frequently than on the low.

# the data of Female dropouts whose native languages are neither Estonian nor Russian generally skwed left; by contrast, the data of female dropouts whose mother tongue is Russian approximately skewed right.

# Are there any Null / NA values?
#There is no null values

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

# 3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).

# 4. Include at least one plot into your report. 
data_row_males
## # A tibble: 8 × 6
##    Year Sex   `Level of study` Estonian Russian `Other mother tongue`
##   <dbl> <chr> <chr>               <dbl>   <dbl>                 <dbl>
## 1  2014 Males Master's study        606     154                    57
## 2  2015 Males Master's study        695     161                    29
## 3  2016 Males Master's study        567     120                    44
## 4  2017 Males Master's study        501      96                    33
## 5  2018 Males Master's study        518      83                    48
## 6  2019 Males Master's study        490      80                    45
## 7  2020 Males Master's study        446      61                    77
## 8  2021 Males Master's study        535      99                    85
theme_set(theme_minimal())

num_data_males <- data_row_males %>% 
  select(Year, Estonian, Russian, `Other mother tongue`) %>%
  pivot_longer(cols = c("Estonian", "Russian", "Other mother tongue"),
               names_to = "Native language",
               values_to = "the number of dropouts")
num_data_males
## # A tibble: 24 × 3
##     Year `Native language`   `the number of dropouts`
##    <dbl> <chr>                                  <dbl>
##  1  2014 Estonian                                 606
##  2  2014 Russian                                  154
##  3  2014 Other mother tongue                       57
##  4  2015 Estonian                                 695
##  5  2015 Russian                                  161
##  6  2015 Other mother tongue                       29
##  7  2016 Estonian                                 567
##  8  2016 Russian                                  120
##  9  2016 Other mother tongue                       44
## 10  2017 Estonian                                 501
## # … with 14 more rows
# %>% (pipe operator) has not builtin meaning. It is used to forward a value or the result of an expression into the next function call or expression
# function select() is used to select columns from a dataset here

?gather
?pivot_longer

# it is said now gather() function has been superseded by pivot_longer function. Function pivot_longer() is used to increase the number of rows and decrease the number of colmuns.had it ust selected without using this function, there would have been four coulumns. Using this, the dataset is reorganised into three columns with new names


# line graph is used to visualise a trend in data over intervals of time

ggplot(data = num_data_males, aes(x = Year, y = `the number of dropouts`)) +
  geom_line(aes(color = `Native language`)) +
  scale_color_manual(values = c("blue", "red", "black"))

#an aesthetic denotes a visual property of the object in the plot, including the names of x- and y- axises, the x and y locations of points, lines, etc, and the size, shape, or the colour of points. We could call the size, shape, and the colour of points the levels of aesthetics properties

#The line graph gives us an apparent trend of increase and decrease in dropouts in students by mother tongue throughout 8 years. Generally, dropouts whose mother tongue is either Russian or other languages were below 200 within 8 years. The dropouts whose native language is Russian decreased from 2015 to 2020; by now the dropouts have increased to 100. Differently, the dropouts whose mother tongue is neither Estonian nor Russian gradually increased from 2015 after the first decrease. However, the fluctuations of both of them are far slower than that of the dropouts whose mother tongue is Estonian. The dropout used to peak in 2015, around 700 students leaving their master's studies. From 2015 to 2017, the trend shows a sharp decrease; in 2018 the number of dropouts increased a little bit. In the following 2 years, the number decreased quickly until 2020. In 2021, it rose to around 550.

# If ggplot2 is too complicated for you now, create a plot with R base functions.
# 5. Create a pdf or html file with short EDA of your data set. Send a pdf file directly to me, send html code
# through a gist on GitHub.
num_data_females <- data_row_females %>%
  select(Year, Estonian, Russian, `Other mother tongue`) %>%
  pivot_longer(cols = c("Estonian", "Russian", "Other mother tongue"),
               names_to = "Native language",
               values_to = "the number of dropouts")
num_data_females
## # A tibble: 24 × 3
##     Year `Native language`   `the number of dropouts`
##    <dbl> <chr>                                  <dbl>
##  1  2014 Estonian                                 894
##  2  2014 Russian                                  179
##  3  2014 Other mother tongue                       17
##  4  2015 Estonian                                 973
##  5  2015 Russian                                  211
##  6  2015 Other mother tongue                       16
##  7  2016 Estonian                                 880
##  8  2016 Russian                                  164
##  9  2016 Other mother tongue                       12
## 10  2017 Estonian                                 770
## # … with 14 more rows
ggplot(data = num_data_females, aes(x = Year, y = `the number of dropouts`)) +
  geom_line(aes(color = `Native language`)) +
  scale_color_manual(values = c("blue", "red", "black"))

#Unlike male dropouts whose native language is Estonian, the numbers female dropouts within 8 years were almost above 750. The trends are similar to that of the male dropouts. 
# Here is another way to visualise boxplots, using ggplot2 
ggplot(data = num_data_males, 
       aes(x = `Native language`, y = `the number of dropouts`, fill = `Native language`)) +
  geom_boxplot(alpha = 0.3)

# using ggplot2 to visualise data produces some results jarred with what we have gained through functions boxplot (from 108 to 102). The former marks an outlier while the latter displays no outlier in the group of students whose native language is Estonian
# unlike colouring cattered point plots and line graphs, colouring the aesthetic properties in boxplots needs argument fill inside the function aes()
#alpha refers to the opacity of a geom, ranging from 0 to 1. The lower values, the more transparent is the colour
ggplot(data = num_data_females, 
       aes(x = `Native language`, y = `the number of dropouts`, fill = `Native language`)) +
  geom_boxplot(alpha = 0.3)

# the visualisation results with the package ggplot2 is same as the visualisation with the function boxplot
# here is the way to visualise means and standard deviations with ggplot packages 
ggplot(data = num_data_males, 
       aes(x = `Native language`, y = `the number of dropouts`, color = `Native language`)) +
  stat_summary(fun = mean, 
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

#stat_summary() is a function operating on unique x and y. We can use it in ggplot to visualise confidence interval, standard errors, means, and medians of a variable. It has many arguments: mapping, data, geom, position, etc
#but here is a question: if I add "" to the argument fun, like fun = "mean", there will be no differences from the argument fun = mean. 
ggplot(data = num_data_females, 
       aes(x = `Native language`, y = `the number of dropouts`, color = `Native language`)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

################# END OF HOME ASSIGNMENT ################