R Markdown

Homework assignment #4

Your full name: Regina Habilainen

Date: 17.10.2022

TASK 1: 5 points. Short EDA report

The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot,

export data using R Markdown and ‘knit’ by creating an html.file

I activated the necessary packages from the library:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rmarkdown)
library(knitr)
library(readxl)
library(ggplot2)
library(tibble)

1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.

I imported the data from excel:

disabled <- read_excel("C:\\Users\\Regu\\OneDrive\\Työpöytä\\THV02_20221020-133041.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`

For a cleaner dataset I removed unnecessary rows and a column:

disabled <- disabled[-c(1,2, 8:31), -c(1)]

I also renamed columns:

names(disabled)[1] <- "Area"
names(disabled)[2] <- "2018"
names(disabled)[3] <- "2019"
names(disabled)[4] <- "2020"
names(disabled)[5] <- "2021"

2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

Questions that i chose (from: https://medium.com/@gauravtopre9/questions-to-ask-while-eda-1a19f82fbc5d):

  1. What are the observed population, the observation unit and the reference period? Disabled males and females, observation unit = person, the reference period is between the years 2018-2021.

  2. What do we know about the population / sampling? Disabled people who are able to work in Tallinn, Ida-Viru county, Jõgeva county, Pärnu county, Tartu county.

  3. Are there null/NA values? No.

  4. What are the data types of the variables? Do we need to change them? Data type seems to be “character”, we need to change the numeric data to be seen as numeric data:

  5. What is the mean for each variable?

mean(as.numeric(disabled$`2018`))
## [1] 5870.2
mean(as.numeric(disabled$`2019`))
## [1] 5571
mean(as.numeric(disabled$`2020`))
## [1] 5546.8
mean(as.numeric(disabled$`2021`))
## [1] 6477.8

Means for the years: 2018 = 5870,2 2019 = 5571 2020 = 5546,8 2021 = 6477,8

3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).

4. Include at least one plot into your report.

If ggplot2 is too complicated for you now, create a plot with R base functions.

With the function “as.numeric”, I changed the classes of required values to numeric:

disabled$`2018` = as.numeric(as.character(disabled$`2018`))
disabled$`2019` = as.numeric(as.character(disabled$`2019`))
disabled$`2020` = as.numeric(as.character(disabled$`2020`))
disabled$`2021` = as.numeric(as.character(disabled$`2021`))

To make sure it was successfully converted, I checked it with the sapply() function.

sapply(disabled, class)
##        Area        2018        2019        2020        2021 
## "character"   "numeric"   "numeric"   "numeric"   "numeric"

I created a barplot with ggplot2. The plot is showing the differences in the amount of disabled people who are able to work in different areas in the year of 2018:

ggplot(disabled, aes(x=Area, y=`2018`, fill = Area)) +
  geom_bar(stat = "summary")
## No summary function supplied, defaulting to `mean_se()`

Another plot is the same, but the year is 2021:

ggplot(disabled, aes(x=Area, y=`2021`, fill = Area)) +
  geom_bar(stat = "summary")
## No summary function supplied, defaulting to `mean_se()`

Conclusions: Jõgeva seems to have the least amount of disabled people who are able to work. Both plots are looking the same, but the amount of overall people has been growing, which is most likely because of improving healthcare. The apparent differences in number can be explained with the differences in populations between the observed areas.

5. Create a pdf or html file with short EDA of your data set. Send a pdf file directly to me, send html code

through a gist on GitHub.

END OF HOME ASSIGNMENT