I have imported data regarding disposable income per household member in a month.
mydata <- read.csv("C:/Users/paul-/Desktop/OST/Ranalytics/HW4.csv", header = TRUE, sep = ",", quote = "\"", dec = ",", stringsAsFactors=FALSE, fill = TRUE)
mydata
## Source.of.income X2020.Males X2020.Females
## 1 Wage 620.7 554.6
## 2 Self-employment 10.6 8.3
## 3 Property 16.8 14.3
## 4 Transfers 187.0 225.6
## 5 pension 120.9 162.2
## 6 Child benefit 16.7 16.1
## 7 Other 13.5 12.3
## 8 Non-monetary 18.8 14.8
summary(mydata)
## Source.of.income X2020.Males X2020.Females
## Length:8 Length:8 Length:8
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
2.1. How many variables are in this data set and what are they?
There are 2 independent variables (IV) and one dependent variable (DV). The IVs are type of income and gender. The DV is income.
2.2. Whose income has been measured and in how many different categories?
Males and females have been measured in 8 different types of income and also regarding a total income. Each income has a comparison between the two.
2.3. Which years have the data been taken and what population?
The data has been taken from 2020. The data is derived from the Estonian population.
2.4. How does the data seem to be distributed?
The data is not equally distributed among income types.
2.5. What stands out about the data?
Males tend to have a higher income in most categories, however females have higher incomes through Transfers and also pension.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
tibble <- as_tibble(mydata)
tibble
## # A tibble: 8 × 3
## Source.of.income X2020.Males X2020.Females
## <chr> <chr> <chr>
## 1 Wage 620.7 554.6
## 2 Self-employment 10.6 8.3
## 3 Property 16.8 14.3
## 4 Transfers 187.0 225.6
## 5 pension 120.9 162.2
## 6 Child benefit 16.7 16.1
## 7 Other 13.5 12.3
## 8 Non-monetary 18.8 14.8
tibble$X2020.Females <- as.integer(tibble$X2020.Females)
tibble$X2020.Males <- as.integer(tibble$X2020.Males)
tibble$Source.of.income <- as.integer(tibble$Source.of.income)
## Warning: NAs introduced by coercion
barplot(tibble$X2020.Females)
barplot(tibble$X2020.Males)
str(tibble)
## tibble [8 × 3] (S3: tbl_df/tbl/data.frame)
## $ Source.of.income: int [1:8] NA NA NA NA NA NA NA NA
## $ X2020.Males : int [1:8] 620 10 16 187 120 16 13 18
## $ X2020.Females : int [1:8] 554 8 14 225 162 16 12 14
tibble$X2020.Males > tibble$X2020.Females
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
Males have higher disposable incomes in 6 categories, while Women are higher in 2 categories (pension and transfers).
Males <- tibble$X2020.Males
Females <- tibble$X2020.Females
range(Males)
## [1] 10 620
range(Females)
## [1] 8 554
Men seem to have a larger range throughout income types. Men have one of the lowest incomes but also one of the highest.
Here is a box plot, although currently I am unaware on how to add proper titles for the parameters.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.