IMPORT THE DATA

y <- read.csv("MYMALES.csv")
y
##   Year Age.group Males.Estonia
## 1 2013     30-34           172
## 2 2014     30-34           197
## 3 2015     30-34           717
## 4 2016     30-34           647
## 5 2017     30-34           714
## 6 2018     30-34           657
## 7 2019     30-34           618
## 8 2020     30-34           458
## 9 2021     30-34           578
y <- y[ ,-2]
y
##   Year Males.Estonia
## 1 2013           172
## 2 2014           197
## 3 2015           717
## 4 2016           647
## 5 2017           714
## 6 2018           657
## 7 2019           618
## 8 2020           458
## 9 2021           578
  1. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

do i need to clean my data, does it have any null values or outliers that might affect my graphical analysis

The data being analysed is a sample of Estonian males immigrants that falls in age range of 30-34 within the year 2013 to 2021

The data set consists of two variables with column 1 having the year, column 2 population of male immigrants in Estonia

i actually had to clean my data because the initial second coloumn had the same age for all the years in the data

Firstly i need to identify what kind of data this, this is a continuous data.

what is the relationship between the variables year and number of Females.Estonia for each year

Do the low values in variable year correspond to the low values in variable Females.Estonia?

Do the large values in variable year correspond to the large values in variable Female.Estonia?

how does the values in the variable column year change in comparison to the values in the column female.Estonia

(You can check the slides from Session 5 - Data Exploration (EDA) for ideas or come up with your own questions).

to check the first 6 rows of my data i would use the head() function

head(y)
##   Year Males.Estonia
## 1 2013           172
## 2 2014           197
## 3 2015           717
## 4 2016           647
## 5 2017           714
## 6 2018           657

to check the last 6 rows of my_data i would use the tail() function

tail(y)
##   Year Males.Estonia
## 4 2016           647
## 5 2017           714
## 6 2018           657
## 7 2019           618
## 8 2020           458
## 9 2021           578

to check the structure of my_data i would use the str()

str(y)
## 'data.frame':    9 obs. of  2 variables:
##  $ Year         : int  2013 2014 2015 2016 2017 2018 2019 2020 2021
##  $ Males.Estonia: int  172 197 717 647 714 657 618 458 578

this gives me on R 9 observations and 2 variables.

  1. Provide brief descriptive statistical analysis of your data set (like measures of central tendency and dispersion).
mean(y[,2])
## [1] 528.6667
hist(y[,2])

median(y[,2])
## [1] 618
boxplot(y[,2])

  1. Include at least one plot into your report. If ggplot2 is too complicated for you now, create a plot with R base functions.