This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
##############################
## Homework assignment #4 ##
##############################
# Your full name: Na Zhu
# Date: 19/10/2022
####### TASK 1: 5 points. Short EDA report ###############
# The main aim of the task is to import data into R, perform brief explanatory analysis, build at least one plot,
# export data using R Markdown and 'knit' by creating an html.file
# 1. Go to https://andmed.stat.ee/en/stat and create a data set you want to explore. Import the data into R.
library(tidyverse) # install "tidyerse" package
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readxl) # install "readxl" package
library(ggplot2)
Livebirth <- read_excel("~/LiveBirths.xlsx") # read_excel function
Livebirthtable <- data.frame(Livebirth) # store the data in data frame
# 2. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.
# (You can check the slides from Session 5 - Data Exploration (EDA) for ideas or come up with your own questions).
#(1)What are the observed population, the observation unit and the reference period?
# The observed population is live birth babies of all ethnic nationalities in Estonia evry year.
# The observation unit is live birth baby per year, the reference period from 2011 to 2021
#(2)What are the data types of the variables? Do we need to change them?
summary(Livebirthtable)
## Years Boys Girls
## Length:11 Min. :6736 Min. :6459
## Class :character 1st Qu.:6911 1st Qu.:6612
## Mode :character Median :7191 Median :6688
## Mean :7128 Mean :6737
## 3rd Qu.:7278 3rd Qu.:6876
## Max. :7555 Max. :7124
# The types of the variables are character. we need to change the number of boys and girls to numeric.
Livebirthtable$Boys <- as.numeric(Livebirthtable$Boys)
Livebirthtable$Girls <- as.numeric(Livebirthtable$Girls)
#(3)Are there any outliers? Are there any values that look like errors?
# There are not outliers. no errors,either.
#(4)What is the mean for each variable?
# The mean of live birth boys is 7128.The mean of live birth girls is 6737.
#(5)Are there any Null / NA values?
# No, there are not.
# 3. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).
Boys_range = range(Livebirthtable$Boys) # The number of Live birth boys ranges from 6736 to 7555
Girls_range = range(Livebirthtable$Girls) # The number of Live birth girls ranges from 6459 to 7124
#The mean of live birth boys is 7128.The mean of live birth girls is 6737.
Boys_sd = sd(Livebirthtable$Boys) # The standard deviation of boys is 254,93
Girls_sd = sd(Livebirthtable$Girls) # The standard deviation of boys is 215.09
# Data for boys are more discrete than those for girls
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## tlmgr option sys_bin ~/bin
## tlmgr option repository 'https://ctan.math.washington.edu/tex-archive/systems/texlive/tlnet'
## tlmgr update --list