Homework Assignment 7

Download the gapminder dataset into R Markdown session.

  1. Perform quick EDA and pick up two variables you want to explore in more depth (for example, life expectancy and gdp) and a subset of data set (for instance, only certain continents, or countries, etc).

EDA: 1. Are there any null values? No. 2. How many observations? 1704.

  1. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dplyr package and tidyr when necessary).

  2. Explore two variables and how they are associated with each other (correlation analysis).

print(data)
## # A tibble: 1,704 × 3
## # Groups:   pop [1,704]
##    lifeExp   pop continent
##      <dbl> <int> <fct>    
##  1    46.5 60011 Africa   
##  2    48.9 61325 Africa   
##  3    34.8 63149 Africa   
##  4    51.9 65345 Africa   
##  5    54.4 70787 Africa   
##  6    37.3 71851 Africa   
##  7    56.5 76595 Africa   
##  8    58.6 86796 Africa   
##  9    39.7 89898 Africa   
## 10    60.4 98593 Africa   
## # … with 1,694 more rows
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
ggplot(data, aes(x=pop, y=lifeExp)) +
  geom_point() +
  labs(x="Population", y="Life expectancy", title="Correlation between life expectancy and population") +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

ggplot(data, aes(x=lifeExp)) +
  geom_histogram() +
  labs(x="Life expectancy")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data, aes(x=pop)) +
  geom_histogram() +
  labs(x="Population")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.2 Calculate correlation coefficient and provide your interpretation.

cor.test(data$pop, data$lifeExp, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  data$pop and data$lifeExp
## S = 675689264, p-value = 5.825e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1806119
  1. Hypothesis testing.
    4.1 State the null hypothesis and the alternative hypothesis.
    4.2 Report on collected data and sample size.
    4.3 Perform Pearson correlation test between two variables.
cor.test(data$pop, data$lifeExp, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  data$pop and data$lifeExp
## t = 2.6854, df = 1702, p-value = 0.007314
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01752303 0.11209600
## sample estimates:
##        cor 
## 0.06495537

4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level.
4.5 Interpret and report the results.

  1. Create a report in R Markdown with the following sections:

Introduction

‘Gapminder’ is a dataset with variables like life expectancy, population, and GDP from the years 1952-2007 in different countries of different continents. I chose to explore life expectancy and population in different continents.

Data cleaning and transformation

Datasets can be modified with the help of dplyr and tidyr packages. With the “select()” function I was able to make a new dataset with the variables I needed. I grouped the data by population with group_by() function. I also arranged the values of population from high to low by adding desc() into the arrange() function.

Correlation analysis

In correlation analysis we are trying to see if there is a relationship between two variables. For that two hypotheses were set:

H0 = There is no linear relationship between population and life expectancy.

H1 = There is a linear relationship between population and life expectancy.

Scatter plots are often used to visualize correlation, which can be created with ggplot2 package, using the geom_point() function. We can add regression line with geom_smooth().

Pearson correlation can be calculated more accurately if the distributions are normal, which is not the case in the chosen variables. The p-value from Pearson gave the result of 0.007314, which is below significance level p<0.05. According to Pearson correlation, there is a strong correlation between population and life expectancy. Sample estimates show the value of correlation to be 0.06495537. This means that the correlation is positive.

Spearman’s correlation analysis would be more fitting as it will give a correlation regardless of skewness. The p-value is 5.825e-14, which is p<0.05, meaning the relationship between the two variables is significant. Correlation 0.1806119 means that it is positive.

With these results the null hypothesis can be rejected.