TASK 1: 5 points. Short EDA report
I activated the necessary packages from the library:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rmarkdown)
library(knitr)
library(readxl)
library(ggplot2)
library(tibble)
I imported the data from excel:
disabled <- read_excel("C:\\Users\\Regu\\OneDrive\\Työpöytä\\THV02_20221020-133041.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
For a cleaner dataset I removed unnecessary rows and a column:
disabled <- disabled[-c(1,2, 8:31), -c(1)]
I also renamed columns:
names(disabled)[1] <- "Area"
names(disabled)[2] <- "2018"
names(disabled)[3] <- "2019"
names(disabled)[4] <- "2020"
names(disabled)[5] <- "2021"
Questions that i chose (from: https://medium.com/@gauravtopre9/questions-to-ask-while-eda-1a19f82fbc5d):
What are the observed population, the observation unit and the reference period? Disabled males and females, observation unit = person, the reference period is between the years 2018-2021.
What do we know about the population / sampling? Disabled people who are able to work in Tallinn, Ida-Viru county, Jõgeva county, Pärnu county, Tartu county.
Are there null/NA values? No.
What are the data types of the variables? Do we need to change them? Data type seems to be “character”, we need to change the numeric data to be seen as numeric data:
What is the mean for each variable?
mean(as.numeric(disabled$`2018`))
## [1] 5870.2
mean(as.numeric(disabled$`2019`))
## [1] 5571
mean(as.numeric(disabled$`2020`))
## [1] 5546.8
mean(as.numeric(disabled$`2021`))
## [1] 6477.8
Means for the years: 2018 = 5870,2 2019 = 5571 2020 = 5546,8 2021 = 6477,8
With the function “as.numeric”, I changed the classes of required values to numeric:
disabled$`2018` = as.numeric(as.character(disabled$`2018`))
disabled$`2019` = as.numeric(as.character(disabled$`2019`))
disabled$`2020` = as.numeric(as.character(disabled$`2020`))
disabled$`2021` = as.numeric(as.character(disabled$`2021`))
To make sure it was successfully converted, I checked it with the sapply() function.
sapply(disabled, class)
## Area 2018 2019 2020 2021
## "character" "numeric" "numeric" "numeric" "numeric"
I created a barplot with ggplot2. The plot is showing the differences in the amount of disabled people who are able to work in different areas in the year of 2018:
ggplot(disabled, aes(x=Area, y=`2018`, fill = Area)) +
geom_bar(stat = "summary")
## No summary function supplied, defaulting to `mean_se()`
Another plot is the same, but the year is 2021:
ggplot(disabled, aes(x=Area, y=`2021`, fill = Area)) +
geom_bar(stat = "summary")
## No summary function supplied, defaulting to `mean_se()`
Conclusions: Jõgeva seems to have the least amount of disabled people who are able to work. Both plots are looking the same, but the amount of overall people has been growing, which is most likely because of improving healthcare. The apparent differences in number can be explained with the differences in populations between the observed areas.
END OF HOME ASSIGNMENT