R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

##############################
##  Homework assignment #4  ##
##############################
# Your full name: Na Zhu
# Date: 19/10/2022
####### TASK 1: 5 points. Short EDA report ###############
# The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot,
# export data using R Markdown and 'knit' by creating an html.file

# 1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.
library(tidyverse) # install "tidyerse" package
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readxl) # install "readxl" package 
library(ggplot2)
Livebirth <- read_excel("~/LiveBirths.xlsx") # read_excel function 
Livebirthtable <- data.frame(Livebirth) # store the data in data frame

# 2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.
# (You can check the slides from Session 5 - Data Exploration (EDA) for ideas or come up with your own questions).

#(1)What are the observed population, the observation unit and the reference period?
# The observed population is live birth babies of all ethnic nationalities in Estonia evry year.
# The observation unit is live birth baby per year, the reference period from 2011 to 2021

#(2)What are the data types of the variables? Do we need to change them?
summary(Livebirthtable) 
##     Years                Boys          Girls     
##  Length:11          Min.   :6736   Min.   :6459  
##  Class :character   1st Qu.:6911   1st Qu.:6612  
##  Mode  :character   Median :7191   Median :6688  
##                     Mean   :7128   Mean   :6737  
##                     3rd Qu.:7278   3rd Qu.:6876  
##                     Max.   :7555   Max.   :7124
# The types of the variables are character. we need to change the number of boys and girls to numeric.
Livebirthtable$Boys <- as.numeric(Livebirthtable$Boys)
Livebirthtable$Girls <- as.numeric(Livebirthtable$Girls)

#(3)Are there any outliers? Are there any values that look like errors?
# There are not outliers. no errors,either.

#(4)What is the mean for each variable?
# The mean of live birth boys is 7128.The mean of live birth girls is 6737.
#(5)Are there any Null / NA values? 
# No, there are not.

# 3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).
Boys_range = range(Livebirthtable$Boys) # The number of Live birth boys ranges from 6736 to 7555
Girls_range = range(Livebirthtable$Girls) # The number of Live birth girls ranges from 6459 to 7124
#The mean of live birth boys is 7128.The mean of live birth girls is 6737.
Boys_sd = sd(Livebirthtable$Boys) # The standard deviation of boys is 254,93
Girls_sd = sd(Livebirthtable$Girls) # The standard deviation of boys is 215.09
# Data for boys are more discrete than those for girls

Including Plots

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## tlmgr option sys_bin ~/bin
## tlmgr option repository 'https://ctan.math.washington.edu/tex-archive/systems/texlive/tlnet'
## tlmgr update --list