R Markdown

I have imported data regarding disposable income per household member in a month.

mydata <- read.csv("C:/Users/paul-/Desktop/OST/Ranalytics/HW4.csv", header = TRUE, sep = ",", quote = "\"", dec = ",", stringsAsFactors=FALSE, fill = TRUE)
mydata
##   Source.of.income X2020.Males X2020.Females
## 1             Wage       620.7         554.6
## 2  Self-employment        10.6           8.3
## 3         Property        16.8          14.3
## 4        Transfers       187.0         225.6
## 5          pension       120.9         162.2
## 6    Child benefit        16.7          16.1
## 7            Other        13.5          12.3
## 8     Non-monetary        18.8          14.8
summary(mydata)
##  Source.of.income   X2020.Males        X2020.Females     
##  Length:8           Length:8           Length:8          
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
  1. Prepare EDA (Explanatory Data Analysis). Write at least 5 questions and provide your answers during the first stage of EDA.

2.1. How many variables are in this data set and what are they?

There are 2 independent variables (IV) and one dependent variable (DV). The IVs are type of income and gender. The DV is income.

2.2. Whose income has been measured and in how many different categories?

Males and females have been measured in 8 different types of income and also regarding a total income. Each income has a comparison between the two.

2.3. Which years have the data been taken and what population?

The data has been taken from 2020. The data is derived from the Estonian population.

2.4. How does the data seem to be distributed?

The data is not equally distributed among income types.

2.5. What stands out about the data?

Males tend to have a higher income in most categories, however females have higher incomes through Transfers and also pension.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
tibble <- as_tibble(mydata)
tibble
## # A tibble: 8 × 3
##   Source.of.income X2020.Males X2020.Females
##   <chr>            <chr>       <chr>        
## 1 Wage             620.7       554.6        
## 2 Self-employment  10.6        8.3          
## 3 Property         16.8        14.3         
## 4 Transfers        187.0       225.6        
## 5 pension          120.9       162.2        
## 6 Child benefit    16.7        16.1         
## 7 Other            13.5        12.3         
## 8 Non-monetary     18.8        14.8
tibble$X2020.Females <- as.integer(tibble$X2020.Females)
tibble$X2020.Males <- as.integer(tibble$X2020.Males)
tibble$Source.of.income <- as.integer(tibble$Source.of.income)
## Warning: NAs introduced by coercion
barplot(tibble$X2020.Females)

barplot(tibble$X2020.Males)

str(tibble)
## tibble [8 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Source.of.income: int [1:8] NA NA NA NA NA NA NA NA
##  $ X2020.Males     : int [1:8] 620 10 16 187 120 16 13 18
##  $ X2020.Females   : int [1:8] 554 8 14 225 162 16 12 14
tibble$X2020.Males > tibble$X2020.Females
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE

Males have higher disposable incomes in 6 categories, while Women are higher in 2 categories (pension and transfers).

Males <- tibble$X2020.Males
Females <- tibble$X2020.Females
range(Males)
## [1]  10 620
range(Females)
## [1]   8 554

Men seem to have a larger range throughout income types. Men have one of the lowest incomes but also one of the highest.

Including Plots

Here is a box plot, although currently I am unaware on how to add proper titles for the parameters.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.