Your full name: Isabella König
Date: 19.10.2022/02.11.2022
The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot, export data using R Markdown and ‘knit’ by creating an html.file
library(readxl)
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ dplyr 1.0.10
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rmarkdown)
library(knitr)
Importing data3.xlsx which includes data on the registration of marriages by month in Estonia:
marriages_estonia <- read_xlsx("data3.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
Creating a data frame for the imported data:
df_marriages <- data.frame(marriages_estonia)
Yes, there are and I will remove them.
Removing rows that only have NA values:
new_df_marriages <- df_marriages[-c(1, 2, 9:33), -c(1, 2)]
Currently, the names don’t make sense so I will change them.
Changing the name of the columns:
colnames(new_df_marriages) <- as.character(c(Month = "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Sum per year"))
Changing the name of the rows:
rownames(new_df_marriages) <- as.character(c(Year = "2016", "2017", "2018", "2019", "2020", "2021"))
Shows how the data frame looks like now:
new_df_marriages
## January February March April May June July August September October
## 2016 264 304 311 332 411 603 1184 1014 567 384
## 2017 241 284 340 323 433 601 1242 965 588 373
## 2018 248 262 248 323 415 648 920 1292 458 393
## 2019 276 299 340 348 521 644 887 964 596 395
## 2020 272 548 272 218 277 480 920 1032 499 491
## 2021 251 260 286 279 388 584 1077 920 520 429
## November December Sum per year
## 2016 317 380 6071
## 2017 305 355 6050
## 2018 354 386 5947
## 2019 324 305 5899
## 2020 269 350 5628
## 2021 374 406 5774
Answer: As we can see in the above data frame, we look at the number of marriages in Estonia for the years 2016, 2017, 2018, 2019, 2020 and 2021.
Here is a summary of the data:
summary(new_df_marriages)
## January February March April
## Length:6 Length:6 Length:6 Length:6
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## May June July August
## Length:6 Length:6 Length:6 Length:6
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## September October November December
## Length:6 Length:6 Length:6 Length:6
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Sum per year
## Min. :5628
## 1st Qu.:5805
## Median :5923
## Mean :5895
## 3rd Qu.:6024
## Max. :6071
Answer: I will change the data for the marriages per month from character into numeric data and create vectors (will need this later).
Converting the numbers for 2016 into numeric data, calculating the sum and assigning it to marriages_2016:
marriages_2016 <- sum(as.numeric(c(new_df_marriages[1,1:12])))
marriages_2016
## [1] 6071
Converting the numbers for 2017 into numeric data, calculating the sum and assigning it to marriages_2017:
marriages_2017 <- sum(as.numeric(c(new_df_marriages[2,1:12])))
marriages_2017
## [1] 6050
Converting the numbers for 2018 into numeric data, calculating the sum and assigning it to marriages_2018:
marriages_2018 <- sum(as.numeric(c(new_df_marriages[3,1:12])))
marriages_2018
## [1] 5947
Converting the numbers for 2019 into numeric data, calculating the sum and assigning it to marriages_2019:
marriages_2019 <- sum(as.numeric(c(new_df_marriages[4,1:12])))
marriages_2019
## [1] 5899
Converting the numbers for 2020 into numeric data, calculating the sum and assigning it to marriages_2020:
marriages_2020 <- sum(as.numeric(c(new_df_marriages[5,1:12])))
marriages_2020
## [1] 5628
Converting the numbers for 2021 into numeric data, calculating the sum and assigning it to marriages_2021:
marriages_2021 <- sum(as.numeric(c(new_df_marriages[6,1:12])))
marriages_2021
## [1] 5774
Calculating the sum of all marriages from 2016 to 2021 and assigning it to sum_marriages:
sum_marriages <- sum(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021)
sum_marriages
## [1] 35369
Creating a vector with the numbers of marriages per year and assigning it to ‘marriages_per_year’:
marriages_per_year <- c(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021)
marriages_per_year
## [1] 6071 6050 5947 5899 5628 5774
If we look at the years 2016-2021, we do not see a normal distribution, the data is also not skewed. It looks more like a uniform distribution (each year has a similar number of marriages).
No, there are no outliers (which is already apparent from the barplot but we can also look at it with a scatterplot). The number of marriages for the years 2016 to 2021 ranges from 5628 to 6071.
It makes sense to look at the data per year (it could also make sense to look at the individual months).
Calculating the mean of all marriages in 2016 and assigning it to ‘mean_marriages_2016’:
mean_marriages_2016 <- mean(as.numeric(c(new_df_marriages[1,1:12])))
mean_marriages_2016
## [1] 505.9167
It does not make sense to have 505.9167 marriages, therefore I convert the number to integer:
mean_marriages_2016 <- as.integer(mean_marriages_2016)
mean_marriages_2016
## [1] 505
Calculating the mean of all marriages in 2017 and assigning it to ‘mean_marriages_2017’:
mean_marriages_2017 <- mean(as.numeric(c(new_df_marriages[2,1:12])))
mean_marriages_2017
## [1] 504.1667
It does not make sense to have 504.1667 marriages, therefore I convert the number to integer
mean_marriages_2017 <- as.integer(mean_marriages_2017)
mean_marriages_2017
## [1] 504
Calculating the mean of all marriages in 2018 and assigning it to ‘mean_marriages_2018’:
mean_marriages_2018 <- mean(as.numeric(c(new_df_marriages[3,1:12])))
mean_marriages_2018
## [1] 495.5833
It does not make sense to have 495.5833 marriages, therefore I convert the number to integer
mean_marriages_2018 <- as.integer(mean_marriages_2018)
mean_marriages_2018
## [1] 495
Calculating the mean of all marriages in 2019 and assigning it to ’mean_marriages_2019:
mean_marriages_2019 <- mean(as.numeric(c(new_df_marriages[4,1:12])))
mean_marriages_2019
## [1] 491.5833
It does not make sense to have 491.5833 marriages, therefore I convert the number to integer:
mean_marriages_2019 <- as.integer(mean_marriages_2019)
mean_marriages_2019
## [1] 491
Calculating the mean of all marriages in 2020 and assigning it to ‘mean_marriages_2020’:
mean_marriages_2020 <- mean(as.numeric(c(new_df_marriages[5,1:12])))
mean_marriages_2020
## [1] 469
Calculating the mean of all marriages in 2021 and assigning it to ‘mean_marriages_2021’:
mean_marriages_2021 <- mean(as.numeric(c(new_df_marriages[6,1:12])))
mean_marriages_2021
## [1] 481.1667
It does not make sense to have 481.1667 marriages, therefore I convert the number to integer:
mean_marriages_2021 <- as.integer(mean_marriages_2021)
mean_marriages_2021
## [1] 481
Calculating the mean of marriages per year from 2016 to 2021 and assigning it to ‘mean_total’:
mean_total <- mean(c(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021))
mean_total
## [1] 5894.833
as.integer(mean_total)
## [1] 5894
On average, 5894 marriages per year happened in Estonia from 2016 to 2021.
Calculating the variance of all marriages in 2016 and assigning it to ‘var_marriages_2016’:
var_marriages_2016 <- var(as.numeric(c(new_df_marriages[1,1:12])))
var_marriages_2016
## [1] 88550.27
Calculating the standard deviation of all marriages in 2016 and assigns it to ‘sd_marriages_2016’:
sd_marriages_2016 <- sd(as.numeric(c(new_df_marriages[1,1:12])))
sd_marriages_2016
## [1] 297.574
Calculating the variance of all marriages in 2017 and assigning it to ‘var_marriages_2017’:
var_marriages_2017 <- var(as.numeric(c(new_df_marriages[2,1:12])))
var_marriages_2017
## [1] 94078.15
calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2017’:
sd_marriages_2017 <- sd(as.numeric(c(new_df_marriages[2,1:12])))
sd_marriages_2017
## [1] 306.7216
Calculating the variance of all marriages in 2018 and assigning it to ‘var_marriages_2018’:
var_marriages_2018 <- var(as.numeric(c(new_df_marriages[3,1:12])))
var_marriages_2018
## [1] 99551.36
Calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2018’:
sd_marriages_2018 <- sd(as.numeric(c(new_df_marriages[3,1:12])))
sd_marriages_2018
## [1] 315.5176
Calculating the variance of all marriages in 2019 and assigning it to ‘var_marriages_2019’:
var_marriages_2019 <- var(as.numeric(c(new_df_marriages[4,1:12])))
var_marriages_2019
## [1] 55810.45
Calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2019’:
sd_marriages_2019 <- sd(as.numeric(c(new_df_marriages[4,1:12])))
sd_marriages_2019
## [1] 236.2423
Calculating the variance of all marriages in 2020 and assigning it to ‘var_marriages_2020’:
var_marriages_2020 <- var(as.numeric(c(new_df_marriages[5,1:12])))
var_marriages_2020
## [1] 69069.09
Calculating the standard deviation of all marriages in 2020 and assigns it to ‘sd_marriages_2020’:
sd_marriages_2020 <- sd(as.numeric(c(new_df_marriages[5,1:12])))
sd_marriages_2020
## [1] 262.81
Calculating the variance of all marriages in 2021 and assigning it to ‘var_marriages_2021’:
var_marriages_2021 <- var(as.numeric(c(new_df_marriages[6,1:12])))
var_marriages_2021
## [1] 69914.88
Calculating the standard deviation of all marriages in 2021 and assigning it to ‘sd_marriages_2021’:
sd_marriages_2021 <- sd(as.numeric(c(new_df_marriages[6,1:12])))
sd_marriages_2021
## [1] 264.4142
Assigning the marriages that happened in March, April and May to ‘marriages_spring’ and calculating the sum:
marriages_spring <- as.numeric(c(new_df_marriages[, 3], new_df_marriages[, 4], new_df_marriages[, 5]))
sum(marriages_spring)
## [1] 6065
Assigning the marriages that happened in June, July and August to ‘marriages_summer’ and calculating the sum:
marriages_summer <- as.numeric(c(new_df_marriages[, 6], new_df_marriages[, 7], new_df_marriages[, 8]))
sum(marriages_summer)
## [1] 15977
Indeed, a lot more marriages happened in summer than in spring (more than double in summer compared to spring).
Comparing marriages in 2020 with those in 2016:
marriages_2020 < marriages_2016
## [1] TRUE
Comparing marriages in 2020 with those in 2017:
marriages_2020 < marriages_2017
## [1] TRUE
Comparing marriages in 2020 with those in 2018:
marriages_2020 < marriages_2018
## [1] TRUE
Comparing marriages in 2020 with those in 2019:
marriages_2020 < marriages_2019
## [1] TRUE
Comparing marriages in 2020 with those in 2021:
marriages_2020 < marriages_2021
## [1] TRUE
Less people got married in Estonia in 2020 than in the years of 2016, 2017, 2018, 2019 and 2021, respectively.
To answer this question, it is easiest to look at a barplot with the numbers of marriages per year:
Apparently, marriages per year have gone down from 2016 to 2020 and increased a little again in 2021.
See above plots.
See above.