Download the gapminder dataset into R Markdown
session.
Create a report in R Markdown with the following sections:
Introduction (brief description of the data set and variables)
Description of data transformation
Correlation analysis (steps for visualisation, checking assumption for correlation analysis, interpretation of correlation coefficient)
Hypothesis testing: using the Pearson r statistic to conduct hypothesis tests about population correlation values
Reporting the results
We are looking at the data set gapminder. In this
section, I briefly want to describe the data set and its variables. To
look at a description of the data set, I can use
library(help = "gapminder"). The data set is an excerpt of
the data available on the gapminder.org website. The data set includes
information on life expectancy, GDP per capita, and population, every
five years, from 1952 to 2007, for 142 countries.
The data set consists of a tibble (1,706 x 6) and includes the variables country, continent, year, lifeExp (life expectancy), pop (population) and gdpPercap (GDP per capita). The variables country and continent are factors, the variables year and population are integers, the variables lifeExp and gdpPercap are doubles.
There are no NA values in the data set.
The names of columns and rows make sense. I do not have to change anything about that. Also, the data set is in long format and variables are in columns, observations in rows and values in cells.
To answer that question, I can look at the structure of the data set:
The data types make sense, I do not have to change anything there.
I specifically want to look at the variables lifeExp (life expectancy) and gdpPercap (GDP per capita). They are both ratio levels of measurement, so I can later potentially look at the correlation using the Pearson’s r correlation coefficient. Here is a scatterplot of lifeExp plotted against gdpPercap:
When plotting lifeExp and gdpPercap against each other, there are some outliers due to very high GDP per capita in some countries. There is one outlier for lifeExp (very low).
Let’s look at boxplots for life expectancy and GDP per capita for the continents:
We can see that life expectancy is on average lowest in Africa and highest in Oceania.
We can see that GDP per capita is on average lowest in Africa and highest in Oceania.
As a next step, I want to prepare a new data set that only includes the variables I want to look at. I want to look at life expectancy and GDP per capita separated by continents. In a next step, I want to specifically look at Africa. I will first create vectors for lifeExp, gdpPercap as well as continent and then create a tibble:
## # A tibble: 1,704 × 3
## lifeExp gdpPercap continent
## <dbl> <dbl> <fct>
## 1 28.8 779. Asia
## 2 30.3 821. Asia
## 3 32.0 853. Asia
## 4 34.0 836. Asia
## 5 36.1 740. Asia
## 6 38.4 786. Asia
## 7 39.9 978. Asia
## 8 40.8 852. Asia
## 9 41.7 649. Asia
## 10 41.8 635. Asia
## # … with 1,694 more rows
Here is a summary of the tibble:
## lifeExp gdpPercap continent
## Min. :23.60 Min. : 241.2 Africa :624
## 1st Qu.:48.20 1st Qu.: 1202.1 Americas:300
## Median :60.71 Median : 3531.8 Asia :396
## Mean :59.47 Mean : 7215.3 Europe :360
## 3rd Qu.:70.85 3rd Qu.: 9325.5 Oceania : 24
## Max. :82.60 Max. :113523.1
I want to look at life expectancy and GDP per capita specifically in
Africa so I use the pipe operator and filter() function for
that and get the following table:
## # A tibble: 624 × 3
## lifeExp gdpPercap continent
## <dbl> <dbl> <fct>
## 1 43.1 2449. Africa
## 2 45.7 3014. Africa
## 3 48.3 2551. Africa
## 4 51.4 3247. Africa
## 5 54.5 4183. Africa
## 6 58.0 4910. Africa
## 7 61.4 5745. Africa
## 8 65.8 5681. Africa
## 9 67.7 5023. Africa
## 10 69.2 4797. Africa
## # … with 614 more rows
Let’s look at a ggplot for the correlation of life expectancy and GDP per capita in Africa:
I first planned to use Pearson’s r as a correlation coefficient but I already found out that GDP per capita is not normally distributed. For using Pearson’s r, data should be normally distributed. We can also see in the above plot that there are outliers which might also be a problem. But let’s still look at some of the assumptions for Pearson’s r:
Level of measurement: Pearson’s r requires continuous data at interval/ratio level of measurement. lifeExp and gdpPercap both have a ratio level of measurement.
Linear relationship: For Pearson’s r, the co-variation should be linear, i.e. the variables should have a linear relationship. We can see in the above scatterplot that this is the case.
No outliers: For using Pearson’s r, there should not be outliers. There are actually some outliers.
Variables should be normally/near-to-normally distributed: Life expectancy in Africa is normally distributed, GDP per capita is not. One option would be to transform the data but I am not yet familiar with that.
Due to the violation of assumptions for Pearson’s r, it could make more sense to use Spearman’s rho as a correlation coefficient in this case. Let’s look at the assumptions:
Level of measurement: Spearman’s rho requires an ordinal, interval or ratio scale of measurement for the variables. Both lifeExp and gdpPercapita have a ratio level of measurement.
The two variables represent paired observations: This assumption is met.
Monotonic relationship between the two variables: Assumption is met as we can see in the above scatterplot.
First, I let R calculate the correlation coefficient. As explained above, I decided to calculate Spearman’s rho.
## [1] 0.4893888
Nevertheless, I also want to look at Pearson’s r:
## [1] 0.4256076
There is not much of a difference between Spearman’s rho and Pearson’s r in that case. I deem Spearman’s rho to be more accurate in that case because of the violated assumptions for Pearson’s r.
Both correlation coefficients indicate a moderate positive correlation between life expectancy and GDP per capita for the data that was included in this data set (sample), i.e. if one of the variables goes up/down, the other will too. We can also see the positive correlation in the above graph. That does not yet mean it is significant, i.e. we do not yet know if we can make conclusions for the underlying population.
Null hypothesis: There is no relationship between lifeExp and gdpPercap in Africa.
Alternative hypothesis: There is a relationship between lifeExp and gdpPercap in Africa.
We are looking at the data set gapminder which includes
data from the gapminder.org website. Specifically, we look at countries
in Africa (52 countries) included in the data set. Life Expectancy, GDP
per capita and population for these countries was obtained for 12 years
(every five years, from 1952 to 2007).
I will test for significance at the 5 % level.
First, I will calculate Spearman’s rho:
##
## Spearman's rank correlation rho
##
## data: tibble.variables.of.interest.africa$lifeExp and tibble.variables.of.interest.africa$gdpPercap
## S = 20677201, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4893888
And also, just to check, I will calculate Pearson’s r:
##
## Pearson's product-moment correlation
##
## data: tibble.variables.of.interest.africa$lifeExp and tibble.variables.of.interest.africa$gdpPercap
## t = 11.73, df = 622, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3591151 0.4878012
## sample estimates:
## cor
## 0.4256076
We can see that the p-value is really small (p-value < 2.2e-16), both in case of Spearman’s rho and Pearson’s r. It is much lower than 0.05. We can therefore reject the null hypothesis (“There is no relationship between lifeExp and gdpPercap in Africa.”). The alternative hypothesis (“There is a relationship between lifeExp and gdpPercap in Africa.”) is supported.
The correlation coefficient and significance test indicate that there is a significant (p < 2.2e-16) moderate positive correlation between life expectancy and GDP per capita in Africa. Since the correlation is significant, we can make conclusions from the sample to the population. In that case, the population are countries in Africa.
It is important to note that correlation does not mean causation. We cannot say that higher life expectancy is due to a higher GDP per capita, for example. The correlation only tells us that these variables correlate (i.e. that there is a relationship between them) but not if they cause changes in the other variable or if there are other variables involved that lead to this correlation.