We will look at the gapminder dataset. According to the
package
documentation, gapminder package is an example dataset
taken from Gapminder
project which combines several country data from multiple sources
into unique coherent time-series.
Let’s take a look into gapmider tibble
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
The tibble contain 1,704 rows and 6 columns including country, continent, year, lifeExp (life expectancy at birth), pop (total population), and gdpPercap (per-capita GDP) as we compared with the documentation.
The number of countries is
## [1] 142
The data contain total 5 continents as following:
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
The year is interesting. Each country contain data for 12 years, but not in consecutive Instead, we got the data in the following years
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Then the rest are numeric data containing following
statistics
## lifeExp pop gdpPercap
## Min. :23.60 Min. :6.001e+04 Min. : 241.2
## 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :60.71 Median :7.024e+06 Median : 3531.8
## Mean :59.47 Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :82.60 Max. :1.319e+09 Max. :113523.1
First, I want to see the overall life expectancy and GDP per capita for each country, if there is any different in each continent.
Second, I’m curious about the relationship between life expectancy and GDP per Capita. My assumption is that The higher is GDP per capita, is the higher life expectancy because I think economic can improve technology and healthcare system, which help people archive longer life.
The data is already clean, there is no NA cell that we
want to remove. However I want to:
population column that I’m not interest.year and group data into each
country with average lifeExp and
gdpPercap.First, I create a new tibble with average lifeExp and
gdpPercap for each country
## # A tibble: 142 × 3
## country avgLifeExp avgGdpPercap
## <fct> <dbl> <dbl>
## 1 Afghanistan 37.5 803.
## 2 Albania 68.4 3255.
## 3 Algeria 59.0 4426.
## 4 Angola 37.9 3607.
## 5 Argentina 69.1 8956.
## 6 Australia 74.7 19981.
## 7 Austria 73.1 20412.
## 8 Bahrain 65.6 18078.
## 9 Bangladesh 49.8 818.
## 10 Belgium 73.6 19901.
## # … with 132 more rows
Then I create a table of country and
continent to join with previous table. Here is out final
tibble:
## # A tibble: 142 × 4
## country continent avgLifeExp avgGdpPercap
## <fct> <fct> <dbl> <dbl>
## 1 Afghanistan Asia 37.5 803.
## 2 Albania Europe 68.4 3255.
## 3 Algeria Africa 59.0 4426.
## 4 Angola Africa 37.9 3607.
## 5 Argentina Americas 69.1 8956.
## 6 Australia Oceania 74.7 19981.
## 7 Austria Europe 73.1 20412.
## 8 Bahrain Asia 65.6 18078.
## 9 Bangladesh Asia 49.8 818.
## 10 Belgium Europe 73.6 19901.
## # … with 132 more rows
Let’s take an overview look of Life expectancy and GDP per Capita for each country
Look like we have 1 outstanding outlier with massive highest GDP per capita.
## # A tibble: 3 × 4
## country continent avgLifeExp avgGdpPercap
## <fct> <fct> <dbl> <dbl>
## 1 Kuwait Asia 68.9 65333.
## 2 Switzerland Europe 75.6 27074.
## 3 Norway Europe 75.8 26747.
Kuwait have the highest average GDP per capita, which is almost triple Switzerland, the 2nd highest. So I decide to remove it from the data as an outlier.
Looking at the scatter plot, data point from each continent are grouped together. Let’s plot box plot of average life expectancy and GDP per capita for each continent.
We can see some continent characteristic in both average life expectancy and GDP per capita. Africa have relatively low for both average life expectancy and GDP per capita, in contrast to Europe and Oceania which have both value relatively high.
From the observation, there should be some level of correlation between average life expectancy and GDP per capita. I will use Spearman’s rho to analyse correlation because the relation look like exponential more than a linear. The result is
## [1] 0.8765658
Spearman’s rho indicate a very strong positive correlation between average life expectancy and GDP per capita. Let’s plot the exponential fitting curve on the dataset.
Our datasets contain average life expectancy and GDP per capita for 142 countries as describe above.
Both variables are continueus but don’t have a normal distribution as shown by histograms below, so Spearman’s rho is more appropriate to test corelation than Pearson’s r.
I set my significant level to be 5%. The Spearman’s rho and p-value from the corelation test are:
##
## Spearman's rank correlation rho
##
## data: data$avgLifeExp and data$avgGdpPercap
## S = 57666, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8765658
Spearman’s rho indicate a very strong positive correlation between average life expectancy and GDP per capita. The p-value is very low, 2.2e-16 is lower than 0.05 which mean we can reject the null hypothesis.
First, each continent has a characteristic in both average life expectancy and GDP per capita as we can see a vague group in the scatter plot and distinct interquatile range in the box plots. Africa have relatively low for both average life expectancy and GDP per capita, in contrast to Europe and Oceania which have both value relatively high.
Second, Spearman’s rho indicate a strong positive correlation between average life expectancy and GDP per capita which is not linear but more similar to exponential rate. P-value is less than selected significant level, allow us to infer that there is a relationship between average life expectancy and GDP per capita.