Introduction

We will look at the gapminder dataset. According to the package documentation, gapminder package is an example dataset taken from Gapminder project which combines several country data from multiple sources into unique coherent time-series.

Let’s take a look into gapmider tibble

## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The tibble contain 1,704 rows and 6 columns including country, continent, year, lifeExp (life expectancy at birth), pop (total population), and gdpPercap (per-capita GDP) as we compared with the documentation.

The number of countries is

## [1] 142

The data contain total 5 continents as following:

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

The year is interesting. Each country contain data for 12 years, but not in consecutive Instead, we got the data in the following years

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

Then the rest are numeric data containing following statistics

##     lifeExp           pop              gdpPercap       
##  Min.   :23.60   Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:48.20   1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :60.71   Median :7.024e+06   Median :  3531.8  
##  Mean   :59.47   Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:70.85   3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :82.60   Max.   :1.319e+09   Max.   :113523.1

Questions for EDA

First, I want to see the overall life expectancy and GDP per capita for each country, if there is any different in each continent.

Second, I’m curious about the relationship between life expectancy and GDP per Capita. My assumption is that The higher is GDP per capita, is the higher life expectancy because I think economic can improve technology and healthcare system, which help people archive longer life.

Data cleaning and transformation

The data is already clean, there is no NA cell that we want to remove. However I want to:

  1. Remove population column that I’m not interest.
  2. Remove year and group data into each country with average lifeExp and gdpPercap.

First, I create a new tibble with average lifeExp and gdpPercap for each country

## # A tibble: 142 × 3
##    country     avgLifeExp avgGdpPercap
##    <fct>            <dbl>        <dbl>
##  1 Afghanistan       37.5         803.
##  2 Albania           68.4        3255.
##  3 Algeria           59.0        4426.
##  4 Angola            37.9        3607.
##  5 Argentina         69.1        8956.
##  6 Australia         74.7       19981.
##  7 Austria           73.1       20412.
##  8 Bahrain           65.6       18078.
##  9 Bangladesh        49.8         818.
## 10 Belgium           73.6       19901.
## # … with 132 more rows

Then I create a table of country and continent to join with previous table. Here is out final tibble:

## # A tibble: 142 × 4
##    country     continent avgLifeExp avgGdpPercap
##    <fct>       <fct>          <dbl>        <dbl>
##  1 Afghanistan Asia            37.5         803.
##  2 Albania     Europe          68.4        3255.
##  3 Algeria     Africa          59.0        4426.
##  4 Angola      Africa          37.9        3607.
##  5 Argentina   Americas        69.1        8956.
##  6 Australia   Oceania         74.7       19981.
##  7 Austria     Europe          73.1       20412.
##  8 Bahrain     Asia            65.6       18078.
##  9 Bangladesh  Asia            49.8         818.
## 10 Belgium     Europe          73.6       19901.
## # … with 132 more rows

Let’s take an overview look of Life expectancy and GDP per Capita for each country

Look like we have 1 outstanding outlier with massive highest GDP per capita.

## # A tibble: 3 × 4
##   country     continent avgLifeExp avgGdpPercap
##   <fct>       <fct>          <dbl>        <dbl>
## 1 Kuwait      Asia            68.9       65333.
## 2 Switzerland Europe          75.6       27074.
## 3 Norway      Europe          75.8       26747.

Kuwait have the highest average GDP per capita, which is almost triple Switzerland, the 2nd highest. So I decide to remove it from the data as an outlier.

Correlation analysis

Looking at the scatter plot, data point from each continent are grouped together. Let’s plot box plot of average life expectancy and GDP per capita for each continent.

We can see some continent characteristic in both average life expectancy and GDP per capita. Africa have relatively low for both average life expectancy and GDP per capita, in contrast to Europe and Oceania which have both value relatively high.

From the observation, there should be some level of correlation between average life expectancy and GDP per capita. I will use Spearman’s rho to analyse correlation because the relation look like exponential more than a linear. The result is

## [1] 0.8765658

Spearman’s rho indicate a very strong positive correlation between average life expectancy and GDP per capita. Let’s plot the exponential fitting curve on the dataset.

Hypothesis testing

  • Null hypothesis: There is no relationship between average life expectancy and GDP per capita
  • Alternative hypothesis: There is a relationship between average life expectancy and GDP per capita

Our datasets contain average life expectancy and GDP per capita for 142 countries as describe above.

Both variables are continueus but don’t have a normal distribution as shown by histograms below, so Spearman’s rho is more appropriate to test corelation than Pearson’s r.

I set my significant level to be 5%. The Spearman’s rho and p-value from the corelation test are:

## 
##  Spearman's rank correlation rho
## 
## data:  data$avgLifeExp and data$avgGdpPercap
## S = 57666, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8765658

Spearman’s rho indicate a very strong positive correlation between average life expectancy and GDP per capita. The p-value is very low, 2.2e-16 is lower than 0.05 which mean we can reject the null hypothesis.

Data interpretation and conclusions

First, each continent has a characteristic in both average life expectancy and GDP per capita as we can see a vague group in the scatter plot and distinct interquatile range in the box plots. Africa have relatively low for both average life expectancy and GDP per capita, in contrast to Europe and Oceania which have both value relatively high.

Second, Spearman’s rho indicate a strong positive correlation between average life expectancy and GDP per capita which is not linear but more similar to exponential rate. P-value is less than selected significant level, allow us to infer that there is a relationship between average life expectancy and GDP per capita.