TASK 1: 5 points. Short EDA report

The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot, export data using R Markdown and ‘knit’ by creating an html.file

1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.

    library(readxl)

    library(readr)

    library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ dplyr   1.0.10
## ✔ tibble  3.1.8      ✔ stringr 1.4.1 
## ✔ tidyr   1.2.1      ✔ forcats 0.5.2 
## ✔ purrr   0.3.5      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

    library(rmarkdown)

    library(knitr)

Importing data3.xlsx which includes data on the registration of marriages by month in Estonia:

marriages_estonia <- read_xlsx("data3.xlsx")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`

2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

EDA AND DATA PREPARATION

Creating a data frame for the imported data:

df_marriages <- data.frame(marriages_estonia)

Question: Are there any Null / NA values?

Yes, there are and I will remove them.

Removing rows that only have NA values:

    new_df_marriages <- df_marriages[-c(1, 2, 9:33), -c(1, 2)]

Question: Do the names of columns and rows make sense? Should something be changed about it?

Currently, the names don’t make sense so I will change them.

Changing the name of the columns:

colnames(new_df_marriages) <- as.character(c(Month = "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Sum per year"))

Changing the name of the rows:

rownames(new_df_marriages) <- as.character(c(Year = "2016", "2017", "2018", "2019", "2020", "2021"))

Shows how the data frame looks like now:

new_df_marriages

##      January February March April May June July August September October
## 2016     264      304   311   332 411  603 1184   1014       567     384
## 2017     241      284   340   323 433  601 1242    965       588     373
## 2018     248      262   248   323 415  648  920   1292       458     393
## 2019     276      299   340   348 521  644  887    964       596     395
## 2020     272      548   272   218 277  480  920   1032       499     491
## 2021     251      260   286   279 388  584 1077    920       520     429
##      November December Sum per year
## 2016      317      380         6071
## 2017      305      355         6050
## 2018      354      386         5947
## 2019      324      305         5899
## 2020      269      350         5628
## 2021      374      406         5774

Question: Which data are we even looking at: what is the observation unit and the reference period?

Answer: As we can see in the above data frame, we look at the number of marriages in Estonia for the years 2016, 2017, 2018, 2019, 2020 and 2021.

Here is a summary of the data:

summary(new_df_marriages)

##    January            February            March              April          
##  Length:6           Length:6           Length:6           Length:6          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      May                June               July              August         
##  Length:6           Length:6           Length:6           Length:6          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   September           October            November           December        
##  Length:6           Length:6           Length:6           Length:6          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Sum per year 
##  Min.   :5628  
##  1st Qu.:5805  
##  Median :5923  
##  Mean   :5895  
##  3rd Qu.:6024  
##  Max.   :6071

Question: What are the data types of the variables? Do we need to change them?

Answer: I will change the data for the marriages per month from character into numeric data and create vectors (will need this later).

Converting the numbers for 2016 into numeric data, calculating the sum and assigning it to marriages_2016:

marriages_2016 <- sum(as.numeric(c(new_df_marriages[1,1:12])))

marriages_2016

## [1] 6071

Converting the numbers for 2017 into numeric data, calculating the sum and assigning it to marriages_2017:

marriages_2017 <- sum(as.numeric(c(new_df_marriages[2,1:12])))

marriages_2017

## [1] 6050

Converting the numbers for 2018 into numeric data, calculating the sum and assigning it to marriages_2018:

marriages_2018 <- sum(as.numeric(c(new_df_marriages[3,1:12])))

marriages_2018

## [1] 5947

Converting the numbers for 2019 into numeric data, calculating the sum and assigning it to marriages_2019:

marriages_2019 <- sum(as.numeric(c(new_df_marriages[4,1:12])))

marriages_2019

## [1] 5899

Converting the numbers for 2020 into numeric data, calculating the sum and assigning it to marriages_2020:

marriages_2020 <- sum(as.numeric(c(new_df_marriages[5,1:12])))

marriages_2020

## [1] 5628

Converting the numbers for 2021 into numeric data, calculating the sum and assigning it to marriages_2021:

marriages_2021 <- sum(as.numeric(c(new_df_marriages[6,1:12])))

marriages_2021

## [1] 5774

Calculating the sum of all marriages from 2016 to 2021 and assigning it to sum_marriages:

sum_marriages <- sum(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021)

sum_marriages

## [1] 35369

Creating a vector with the numbers of marriages per year and assigning it to ‘marriages_per_year’:

marriages_per_year <- c(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021)

marriages_per_year

## [1] 6071 6050 5947 5899 5628 5774

Question: How is the data distributed? Is it skewed?

If we look at the years 2016-2021, we do not see a normal distribution, the data is also not skewed. It looks more like a uniform distribution (each year has a similar number of marriages).

Question: Are there any outliers (when it comes to the marriages per year)?

No, there are no outliers (which is already apparent from the barplot but we can also look at it with a scatterplot). The number of marriages for the years 2016 to 2021 ranges from 5628 to 6071.

3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).

It makes sense to look at the data per year (it could also make sense to look at the individual months).

CENTRAL TENDENCY (for the individual years):

2016

Calculating the mean of all marriages in 2016 and assigning it to ‘mean_marriages_2016’:

mean_marriages_2016 <- mean(as.numeric(c(new_df_marriages[1,1:12])))

mean_marriages_2016

## [1] 505.9167

It does not make sense to have 505.9167 marriages, therefore I convert the number to integer:

mean_marriages_2016 <- as.integer(mean_marriages_2016)

mean_marriages_2016

## [1] 505

2017

Calculating the mean of all marriages in 2017 and assigning it to ‘mean_marriages_2017’:

mean_marriages_2017 <- mean(as.numeric(c(new_df_marriages[2,1:12])))

mean_marriages_2017

## [1] 504.1667

It does not make sense to have 504.1667 marriages, therefore I convert the number to integer

mean_marriages_2017 <- as.integer(mean_marriages_2017)

mean_marriages_2017

## [1] 504

2018

Calculating the mean of all marriages in 2018 and assigning it to ‘mean_marriages_2018’:

mean_marriages_2018 <- mean(as.numeric(c(new_df_marriages[3,1:12])))

mean_marriages_2018

## [1] 495.5833

It does not make sense to have 495.5833 marriages, therefore I convert the number to integer

mean_marriages_2018 <- as.integer(mean_marriages_2018)

mean_marriages_2018

## [1] 495

2019

Calculating the mean of all marriages in 2019 and assigning it to ’mean_marriages_2019:

mean_marriages_2019 <- mean(as.numeric(c(new_df_marriages[4,1:12])))

mean_marriages_2019

## [1] 491.5833

It does not make sense to have 491.5833 marriages, therefore I convert the number to integer:

mean_marriages_2019 <- as.integer(mean_marriages_2019)

mean_marriages_2019

## [1] 491

2020

Calculating the mean of all marriages in 2020 and assigning it to ‘mean_marriages_2020’:

mean_marriages_2020 <- mean(as.numeric(c(new_df_marriages[5,1:12])))

mean_marriages_2020

## [1] 469

2021

Calculating the mean of all marriages in 2021 and assigning it to ‘mean_marriages_2021’:

mean_marriages_2021 <- mean(as.numeric(c(new_df_marriages[6,1:12])))

mean_marriages_2021

## [1] 481.1667

It does not make sense to have 481.1667 marriages, therefore I convert the number to integer:

mean_marriages_2021 <- as.integer(mean_marriages_2021)

mean_marriages_2021

## [1] 481

Question: How many marriages have happened in Estonia on average per year from 2016 to 2021?

Calculating the mean of marriages per year from 2016 to 2021 and assigning it to ‘mean_total’:

mean_total <- mean(c(marriages_2016, marriages_2017, marriages_2018, marriages_2019, marriages_2020, marriages_2021))

mean_total

## [1] 5894.833

as.integer(mean_total)

## [1] 5894

On average, 5894 marriages per year happened in Estonia from 2016 to 2021.

DISPERSION (for the individual years)

2016

Calculating the variance of all marriages in 2016 and assigning it to ‘var_marriages_2016’:

var_marriages_2016 <- var(as.numeric(c(new_df_marriages[1,1:12])))

var_marriages_2016

## [1] 88550.27

Calculating the standard deviation of all marriages in 2016 and assigns it to ‘sd_marriages_2016’:

sd_marriages_2016 <- sd(as.numeric(c(new_df_marriages[1,1:12])))

sd_marriages_2016

## [1] 297.574

2017

Calculating the variance of all marriages in 2017 and assigning it to ‘var_marriages_2017’:

var_marriages_2017 <- var(as.numeric(c(new_df_marriages[2,1:12])))

var_marriages_2017

## [1] 94078.15

calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2017’:

sd_marriages_2017 <- sd(as.numeric(c(new_df_marriages[2,1:12])))

sd_marriages_2017

## [1] 306.7216

2018

Calculating the variance of all marriages in 2018 and assigning it to ‘var_marriages_2018’:

var_marriages_2018 <- var(as.numeric(c(new_df_marriages[3,1:12])))

var_marriages_2018

## [1] 99551.36

Calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2018’:

sd_marriages_2018 <- sd(as.numeric(c(new_df_marriages[3,1:12])))

sd_marriages_2018

## [1] 315.5176

2019

Calculating the variance of all marriages in 2019 and assigning it to ‘var_marriages_2019’:

var_marriages_2019 <- var(as.numeric(c(new_df_marriages[4,1:12])))

var_marriages_2019

## [1] 55810.45

Calculating the standard deviation of all marriages in 2017 and assigns it to ‘sd_marriages_2019’:

sd_marriages_2019 <- sd(as.numeric(c(new_df_marriages[4,1:12])))

sd_marriages_2019

## [1] 236.2423

2020

Calculating the variance of all marriages in 2020 and assigning it to ‘var_marriages_2020’:

var_marriages_2020 <- var(as.numeric(c(new_df_marriages[5,1:12])))

var_marriages_2020

## [1] 69069.09

Calculating the standard deviation of all marriages in 2020 and assigns it to ‘sd_marriages_2020’:

sd_marriages_2020 <- sd(as.numeric(c(new_df_marriages[5,1:12])))

sd_marriages_2020

## [1] 262.81

2021

Calculating the variance of all marriages in 2021 and assigning it to ‘var_marriages_2021’:

var_marriages_2021 <- var(as.numeric(c(new_df_marriages[6,1:12])))

var_marriages_2021

## [1] 69914.88

Calculating the standard deviation of all marriages in 2021 and assigning it to ‘sd_marriages_2021’:

sd_marriages_2021 <- sd(as.numeric(c(new_df_marriages[6,1:12])))

sd_marriages_2021

## [1] 264.4142

FURTHER ANALYSIS

Question: From 2016 to 2021, did more people get married in summer (June, July, August) than in spring (March, April, May)?

Assigning the marriages that happened in March, April and May to ‘marriages_spring’ and calculating the sum:

marriages_spring <- as.numeric(c(new_df_marriages[, 3], new_df_marriages[, 4], new_df_marriages[, 5]))

sum(marriages_spring)

## [1] 6065

Assigning the marriages that happened in June, July and August to ‘marriages_summer’ and calculating the sum:

marriages_summer <- as.numeric(c(new_df_marriages[, 6], new_df_marriages[, 7], new_df_marriages[, 8]))

sum(marriages_summer)

## [1] 15977

Indeed, a lot more marriages happened in summer than in spring (more than double in summer compared to spring).

Question: Did less people get married in the year the COVID pandemic started in Europe (2020) compared to previous years?

Comparing marriages in 2020 with those in 2016:

marriages_2020 < marriages_2016

## [1] TRUE

Comparing marriages in 2020 with those in 2017:

marriages_2020 < marriages_2017

## [1] TRUE

Comparing marriages in 2020 with those in 2018:

marriages_2020 < marriages_2018

## [1] TRUE

Comparing marriages in 2020 with those in 2019:

marriages_2020 < marriages_2019

## [1] TRUE

Comparing marriages in 2020 with those in 2021:

marriages_2020 < marriages_2021

## [1] TRUE

Less people got married in Estonia in 2020 than in the years of 2016, 2017, 2018, 2019 and 2021, respectively.

Question: Do we see a steady increase or decrease of marriages from 2016 to 2021?

To answer this question, it is easiest to look at a barplot with the numbers of marriages per year:

Apparently, marriages per year have gone down from 2016 to 2020 and increased a little again in 2021.

4. Include at least one plot into your report. If ggplot2 is too complicated for you now, create a plot with R base functions.

See above plots.

5. Create a pdf or html file with short EDA of your data set. Send a pdf file directly to me, send html code through a gist on GitHub.

See above.

HW4_Koenig

Isabella Koenig

2022-11-01

Homework assignment #4

TASK 1: 5 points. Short EDA report

1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.

2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

EDA AND DATA PREPARATION

Question: Are there any Null / NA values?

Question: Do the names of columns and rows make sense? Should something be changed about it?

Question: Which data are we even looking at: what is the observation unit and the reference period?

Question: What are the data types of the variables? Do we need to change them?

Question: How is the data distributed? Is it skewed?

Question: Are there any outliers (when it comes to the marriages per year)?

3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).

CENTRAL TENDENCY (for the individual years):

2016

2017

2018

2019

2020

2021

Question: How many marriages have happened in Estonia on average per year from 2016 to 2021?

DISPERSION (for the individual years)

2016

2017

2018

2019

2020

2021

FURTHER ANALYSIS

Question: From 2016 to 2021, did more people get married in summer (June, July, August) than in spring (March, April, May)?

Question: Did less people get married in the year the COVID pandemic started in Europe (2020) compared to previous years?

Question: Do we see a steady increase or decrease of marriages from 2016 to 2021?

4. Include at least one plot into your report. If ggplot2 is too complicated for you now, create a plot with R base functions.

5. Create a pdf or html file with short EDA of your data set. Send a pdf file directly to me, send html code through a gist on GitHub.

END OF HOME ASSIGNMENT