Homework Assignment 7

Download the gapminder dataset into R Markdown session.

Create a report in R Markdown with the following sections:

Introduction (and some EDA)

We are looking at the data set gapminder. In this section, I briefly want to describe the data set and its variables. To look at a description of the data set, I can use library(help = "gapminder"). The data set is an excerpt of the data available on the gapminder.org website. The data set includes information on life expectancy, GDP per capita, and population, every five years, from 1952 to 2007, for 142 countries.

The data set consists of a tibble (1,706 x 6) and includes the variables country, continent, year, lifeExp (life expectancy), pop (population) and gdpPercap (GDP per capita). The variables country and continent are factors, the variables year and population are integers, the variables lifeExp and gdpPercap are doubles.

Questions for EDA

Are there any NA values?

There are no NA values in the data set.

Do the names of columns and rows make sense? Do I need to change something about it?

The names of columns and rows make sense. I do not have to change anything about that. Also, the data set is in long format and variables are in columns, observations in rows and values in cells.

What are the data types of the variables? Do we need to change them?

To answer that question, I can look at the structure of the data set:

  • country: Factor with 142 levels
  • continent: Factor with 5 levels
  • year: integer
  • lifeExp: numeric
  • pop: integer
  • gdpPercap: numeric

The data types make sense, I do not have to change anything there.

Visualisation of data

I specifically want to look at the variables lifeExp (life expectancy) and gdpPercap (GDP per capita). They are both ratio levels of measurement, so I can later potentially look at the correlation using the Pearson’s r correlation coefficient. Here is a scatterplot of lifeExp plotted against gdpPercap:

Are there any outliers?

When plotting lifeExp and gdpPercap against each other, there are some outliers due to very high GDP per capita in some countries. There is one outlier for lifeExp (very low).

More plots

Let’s look at boxplots for life expectancy and GDP per capita for the continents:

We can see that life expectancy is on average lowest in Africa and highest in Oceania.

We can see that GDP per capita is on average lowest in Africa and highest in Oceania.

Description of data transformation

As a next step, I want to prepare a new data set that only includes the variables I want to look at. I want to look at life expectancy and GDP per capita separated by continents. In a next step, I want to specifically look at Africa. I will first create vectors for lifeExp, gdpPercap as well as continent and then create a tibble:

## # A tibble: 1,704 × 3
##    lifeExp gdpPercap continent
##      <dbl>     <dbl> <fct>    
##  1    28.8      779. Asia     
##  2    30.3      821. Asia     
##  3    32.0      853. Asia     
##  4    34.0      836. Asia     
##  5    36.1      740. Asia     
##  6    38.4      786. Asia     
##  7    39.9      978. Asia     
##  8    40.8      852. Asia     
##  9    41.7      649. Asia     
## 10    41.8      635. Asia     
## # … with 1,694 more rows

Here is a summary of the tibble:

##     lifeExp        gdpPercap           continent  
##  Min.   :23.60   Min.   :   241.2   Africa  :624  
##  1st Qu.:48.20   1st Qu.:  1202.1   Americas:300  
##  Median :60.71   Median :  3531.8   Asia    :396  
##  Mean   :59.47   Mean   :  7215.3   Europe  :360  
##  3rd Qu.:70.85   3rd Qu.:  9325.5   Oceania : 24  
##  Max.   :82.60   Max.   :113523.1

I want to look at life expectancy and GDP per capita specifically in Africa so I use the pipe operator and filter() function for that and get the following table:

## # A tibble: 624 × 3
##    lifeExp gdpPercap continent
##      <dbl>     <dbl> <fct>    
##  1    43.1     2449. Africa   
##  2    45.7     3014. Africa   
##  3    48.3     2551. Africa   
##  4    51.4     3247. Africa   
##  5    54.5     4183. Africa   
##  6    58.0     4910. Africa   
##  7    61.4     5745. Africa   
##  8    65.8     5681. Africa   
##  9    67.7     5023. Africa   
## 10    69.2     4797. Africa   
## # … with 614 more rows

More EDA and data visualisation

How is the distribution of the data?

  • lifeExp: looks like a normal distribution

  • GDP per capita: not a normal distribution, rather positively skewed

Correlation analysis

Visualisation

Let’s look at a ggplot for the correlation of life expectancy and GDP per capita in Africa:

Checking assumptions for correlation analysis

I first planned to use Pearson’s r as a correlation coefficient but I already found out that GDP per capita is not normally distributed. For using Pearson’s r, data should be normally distributed. We can also see in the above plot that there are outliers which might also be a problem. But let’s still look at some of the assumptions for Pearson’s r:

Assumptions Pearson’s r:
  • Level of measurement: Pearson’s r requires continuous data at interval/ratio level of measurement. lifeExp and gdpPercap both have a ratio level of measurement.

  • Linear relationship: For Pearson’s r, the co-variation should be linear, i.e. the variables should have a linear relationship. We can see in the above scatterplot that this is the case.

  • No outliers: For using Pearson’s r, there should not be outliers. There are actually some outliers.

  • Variables should be normally/near-to-normally distributed: Life expectancy in Africa is normally distributed, GDP per capita is not. One option would be to transform the data but I am not yet familiar with that.

Assumptions Spearman’s rho:

Due to the violation of assumptions for Pearson’s r, it could make more sense to use Spearman’s rho as a correlation coefficient in this case. Let’s look at the assumptions:

  • Level of measurement: Spearman’s rho requires an ordinal, interval or ratio scale of measurement for the variables. Both lifeExp and gdpPercapita have a ratio level of measurement.

  • The two variables represent paired observations: This assumption is met.

  • Monotonic relationship between the two variables: Assumption is met as we can see in the above scatterplot.

Interpretation of correlation coefficient

First, I let R calculate the correlation coefficient. As explained above, I decided to calculate Spearman’s rho.

## [1] 0.4893888

Nevertheless, I also want to look at Pearson’s r:

## [1] 0.4256076

There is not much of a difference between Spearman’s rho and Pearson’s r in that case. I deem Spearman’s rho to be more accurate in that case because of the violated assumptions for Pearson’s r.

Both correlation coefficients indicate a moderate positive correlation between life expectancy and GDP per capita for the data that was included in this data set (sample), i.e. if one of the variables goes up/down, the other will too. We can also see the positive correlation in the above graph. That does not yet mean it is significant, i.e. we do not yet know if we can make conclusions for the underlying population.

Hypothesis testing

Null and alternative hypothesis

Null hypothesis: There is no relationship between lifeExp and gdpPercap in Africa.

Alternative hypothesis: There is a relationship between lifeExp and gdpPercap in Africa.

Collected data and sample size

We are looking at the data set gapminder which includes data from the gapminder.org website. Specifically, we look at countries in Africa (52 countries) included in the data set. Life Expectancy, GDP per capita and population for these countries was obtained for 12 years (every five years, from 1952 to 2007).

Testing for significance

I will test for significance at the 5 % level.

First, I will calculate Spearman’s rho:

## 
##  Spearman's rank correlation rho
## 
## data:  tibble.variables.of.interest.africa$lifeExp and tibble.variables.of.interest.africa$gdpPercap
## S = 20677201, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4893888

And also, just to check, I will calculate Pearson’s r:

## 
##  Pearson's product-moment correlation
## 
## data:  tibble.variables.of.interest.africa$lifeExp and tibble.variables.of.interest.africa$gdpPercap
## t = 11.73, df = 622, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3591151 0.4878012
## sample estimates:
##       cor 
## 0.4256076

We can see that the p-value is really small (p-value < 2.2e-16), both in case of Spearman’s rho and Pearson’s r. It is much lower than 0.05. We can therefore reject the null hypothesis (“There is no relationship between lifeExp and gdpPercap in Africa.”). The alternative hypothesis (“There is a relationship between lifeExp and gdpPercap in Africa.”) is supported.

Reporting the results

The correlation coefficient and significance test indicate that there is a significant (p < 2.2e-16) moderate positive correlation between life expectancy and GDP per capita in Africa. Since the correlation is significant, we can make conclusions from the sample to the population. In that case, the population are countries in Africa.

It is important to note that correlation does not mean causation. We cannot say that higher life expectancy is due to a higher GDP per capita, for example. The correlation only tells us that these variables correlate (i.e. that there is a relationship between them) but not if they cause changes in the other variable or if there are other variables involved that lead to this correlation.