Download the gapminder dataset into
R Markdown session.
library(gapminder)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(tibble)
View(gapminder)
I installed new packages as they would play a role in visualising the P-value and r in the plot.
install.packages('devtools', repos = "smin95/smplot.io")
## Warning: unable to access index for repository smin95/smplot.io/src/contrib:
## cannot open URL 'smin95/smplot.io/src/contrib/PACKAGES'
## Warning: package 'devtools' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository smin95/smplot.io/bin/macosx/contrib/4.1:
## cannot open URL 'smin95/smplot.io/bin/macosx/contrib/4.1/PACKAGES'
devtools::install_github('smin95/smplot', force = TRUE)
## Downloading GitHub repo smin95/smplot@HEAD
##
##
checking for file ‘/private/var/folders/v5/z0kd8v2n2lxd9dcj7n9wddm80000gn/T/Rtmpm32z4X/remotes510a59f80e8c/smin95-smplot2-f207d79/DESCRIPTION’ ...
✔ checking for file ‘/private/var/folders/v5/z0kd8v2n2lxd9dcj7n9wddm80000gn/T/Rtmpm32z4X/remotes510a59f80e8c/smin95-smplot2-f207d79/DESCRIPTION’
##
─ preparing ‘smplot2’:
##
checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
Omitted ‘LazyData’ from DESCRIPTION
##
─ building ‘smplot2_0.1.0.tar.gz’
##
##
library(smplot2)
## Updated tutorial for smplot: smin95.github.io/dataviz/
library(cowplot)
dlyr package and
tidyr when necessary).2.1 Explore two variables and how they are associated with each other (correlation analysis). Include the description of assumptions for correlation analysis, concluding with the type of analysis you choose. Create necessary visualisations to support your analysis (a histogram, a scatter plot, etc) and include your interpretation of the graphs.
3.1 Visualize your data using a scatter plot and include the description of assumptions for correlation analysis:
Is the co-variation linear?
Are the data from each of the 2 variables (x, y) follow a normal distribution (visual inspection of the data normality using histograms)?
Questions for EDA
Description of data cleaning and transformation
Description of correlation analysis (steps for visualisation, checking assumption for correlation analysis, performing correlation analysis)
Data interpretation and conclusions
This data set includes 6 variables: two factorial variables (country and continent), two integer variables (year and population), and two numeric variables (life expectancy and GDP). Using the function sapply, I found there is no value unavailable in this data set. This data set contains data of 142 countries in five continents: Africa, Americas (including NA and SA), Asia, Europe, and Oceania, within 12 years from 1952 to 2007, with an interval of 5 years.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
#using structure to view the structure if the data set.
if_na <- gapminder %>% sapply(anyNA) %>% data.frame()
#sapply is a variatio of apply function. This function treats list, vector, and data frame as an input but give outout in a vector or a matrix. Here, the function anyNA is passed as the argument inside the sapply function.
#anyNA is a function intending to check if an object contains any NA or NA vakues, more quickly than the function any(is.na())
if_na
## .
## country FALSE
## continent FALSE
## year FALSE
## lifeExp FALSE
## pop FALSE
## gdpPercap FALSE
str(if_na)
## 'data.frame': 6 obs. of 1 variable:
## $ .: logi FALSE FALSE FALSE FALSE FALSE FALSE
#there is only one vector of six observations of variables
colnames(if_na) <- c("NA")
#change the name of colnumn
if_na
## NA
## country FALSE
## continent FALSE
## year FALSE
## lifeExp FALSE
## pop FALSE
## gdpPercap FALSE
final_if_na <- as.data.frame(t(if_na))
final_if_na
## country continent year lifeExp pop gdpPercap
## NA FALSE FALSE FALSE FALSE FALSE FALSE
#We could use a function composition as.data.frame(t()) to change columns into rows and change rows into columns
#t() returns the transpose of the data frame or matrix
?unique
## Help on topic 'unique' was found in the following packages:
##
## Package Library
## base /Library/Frameworks/R.framework/Resources/library
## data.table /Library/Frameworks/R.framework/Versions/4.1/Resources/library
##
##
## Using the first match ...
unique(gapminder$continent)
## [1] Asia Europe Africa Americas Oceania
## Levels: Africa Americas Asia Europe Oceania
unique(gapminder$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
# the unique function returns a vector, data frame or an array with duplicate elements /rows removed. We can use it to check how many levels a vector of factor has and how many unique values a numeric vector has.
# Another way to find what continents this data set includes is to rely on geographical knowledge. The data.frame(gapminder) shows some countries. Since there is no countries in Antarctic, if we find a South American country has a value of America instead of South America, we can conclude this data set only concludes 5 continents.
General assumption: I hypothesis that there be a strong positive linear relationship between the GDP per capital and life expectancy whether in low-income countries (gdpPercap below 2000) or high-income countries (gdpPercap between 9000 and 30000) in 1977 when the resource crisis reached the peak. In order to carry out the hypothesis testing for Pearson’s correlation, this test has to meet at least three critical assumption:
Hence in the first stage, let’s explore the central tendency of the filtered data set. Some qustions emerge:
Q1, What is the mean of life expectancy and GDP per capital in high-income countries and low-income countries in 1977 ? How does the data distribute?
Q2, Is there any outlier in the data set? Does the data form a linear correlation?
Using filter function to filter all the rows of high-income countries that meet out standard. Using select function to keep variables this EDA focuses: country, year, gdpPercap, lifeExp.
highI_Coun <- gapminder %>%
filter(year %in% c(1977), between(gdpPercap, 9000, 30000)) %>%
select(country, year, gdpPercap, lifeExp) %>%
na.omit()
Using geom_point and geom_boxplot, we could visualise the data distribution of GDP per capital and life expectancy in high-income countries in 1977.
ggplot(highI_Coun, aes(gdpPercap, lifeExp)) +
geom_point()
ggplot(highI_Coun, aes(year, gdpPercap)) +
geom_boxplot()
ggplot(highI_Coun, aes(year, lifeExp)) +
geom_boxplot()
The point plot show that basically two variables form a linear
correlation. Since there are outliers in life expectancy, we need to
filter the dataset again.
After filtering the data, we use the histogram to check if the data of GDP per capital and the data of life expectancy follow the normal distribution. I set the bin (interval) value as 5.
highI_Coun_update <- highI_Coun %>%
filter(lifeExp >= 62)
ggplot(highI_Coun_update, aes(gdpPercap)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(highI_Coun_update, aes(lifeExp)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Two histograms display that both GDP per capital and life expectancy are
generally follow the normal distribution despite a certain degree of
right or left-skewness.
Then, we prepare the sample data using the function sample_n. The function sample_n is used for randomly select observations from a data frame. If we are unsatisfied with the sample of data frame, we could repeat performing until we get the most reliable sample. Since there are 42 observations in the data frame highI_Coun_update, I determine the sample volume to be 21.
sample.1977.h <- highI_Coun_update %>%
sample_n(21)
sample.1977.h
## # A tibble: 21 × 4
## country year gdpPercap lifeExp
## <fct> <int> <dbl> <dbl>
## 1 Israel 1977 13307. 73.1
## 2 Puerto Rico 1977 9771. 73.4
## 3 United States 1977 24073. 73.4
## 4 Spain 1977 13237. 74.4
## 5 Ireland 1977 11151. 72.0
## 6 Italy 1977 14256. 73.5
## 7 Norway 1977 23311. 75.4
## 8 Singapore 1977 11210. 70.8
## 9 Hungary 1977 11675. 70.0
## 10 Slovenia 1977 15277. 71.0
## # … with 11 more rows
ggplot(sample.1977.h, aes(gdpPercap, lifeExp)) +
geom_point()
ggplot(sample.1977.h, aes(year, gdpPercap)) +
geom_boxplot()
ggplot(sample.1977.h, aes(gdpPercap)) +
geom_histogram() +
stat_bin(bins = 7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(sample.1977.h, aes(year, lifeExp)) +
geom_boxplot()
ggplot(sample.1977.h, aes(lifeExp)) +
geom_histogram() +
stat_bin(bins = 6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both box plots and histograms demonstrate that in this sample two
variables basically form the linear correlation without outliers, and
that variables generally follow the normal distribution.
Then, we use the point-range to visualise the mean and standard deviation of this sample data.
ggplot(sample.1977.h, aes(year, lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))
ggplot(sample.1977.h, aes(year, gdpPercap)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))
The mean of life expectancy in 21 sample high-income countries in 1977
is around 72; the SD is around 2.3. The mean of GDP per capital in 21
sample high-income countries in 1977 is around 15000; the SD is around
4000.
Using filter function to filter all the rows of low-income countries that meet out standard. Using select function to keep variables this EDA focuses: country, year, gdpPercap, lifeExp.
lowI_Coun <- gapminder %>%
filter(year == 1977, gdpPercap < 2000) %>%
select(country, year, gdpPercap, lifeExp) %>%
na.omit()
Using geom_point and geom_boxplot, we could visualise the data distribution of GDP per capital and life expectancy in high-income countries in 1977.
ggplot(lowI_Coun, aes(gdpPercap, lifeExp)) +
geom_point()
ggplot(lowI_Coun, aes(year, gdpPercap)) +
geom_boxplot()
ggplot(lowI_Coun, aes(year, lifeExp)) +
geom_boxplot()
The point plot show that basically two variables form a linear correlation. Since there are outliers in life expectanct, we need to filter the dataset again.
After filtering the data, we use the histogram to check if the data of GDP per capital and the data of life expectancy follow the normal distribution. I set the bin (interval) value as 5.
lowI_Coun_update <- lowI_Coun %>%
filter(lifeExp <65)
ggplot(lowI_Coun_update, aes(gdpPercap)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(lowI_Coun_update, aes(lifeExp)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Two histograms display that both GDP per capital and life expectancy are
generally follow the normal distribution despite a certain degree of
right or left-skewness.
Then, we prepare the sample data using the function sample_n. Since there are 49 observations in the data frame highI_Coun_update, I determine the sample volume to be 24.
sample.1977.l <- lowI_Coun_update %>%
sample_n(24)
ggplot(sample.1977.l, aes(gdpPercap, lifeExp)) +
geom_point()
ggplot(sample.1977.l, aes(year, gdpPercap)) +
geom_boxplot()
ggplot(sample.1977.l, aes(gdpPercap)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(sample.1977.l, aes(year, lifeExp)) +
geom_boxplot()
ggplot(sample.1977.l, aes(lifeExp)) +
geom_histogram() +
stat_bin(bins = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both box plots and histograms demonstrate that in this sample two
variables basically form the linear correlation without outliers, and
that variables generally follow the normal distribution.
Then, we use the point-range to visualise the mean and standard deviation of sample data.
ggplot(sample.1977.l, aes(year, lifeExp)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))
ggplot(sample.1977.l, aes(year, gdpPercap)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x),
fun.min = function(x) mean(x) - sd(x))
The mean of life expectancy in 21 sample low-income countries in 1977 is
around 48; the SD is around 54.5. The mean of GDP per capital in 24
sample low-income countries in 1977 is around 975; the SD is around
400.
3.2 Calculate correlation coefficient and provide your interpretation.
round(cor(sample.1977.h$gdpPercap, sample.1977.h$lifeExp), 3)
## [1] 0.612
round(cor(sample.1977.l$gdpPercap, sample.1977.l$lifeExp), 3)
## [1] 0.247
The linear correlation coefficient for the sample of high-income countries (size: 21) is around 0.573, suggesting a stong positive correlation between GDP per capital and life expectancy; The linear correlation coefficient for the sample of low-income countries (size: 24) is around 0.099, suggesting an extreme weak positive correlation between GDP per capital and life expectancy.
For the sample of low-income countries (size: 24): The null hypothesis: there is a negative correlation or no correlation between GDP per capital and life expectancy in low-income countries in 1977. The alternative hypothesis: There is a positive correlation between GDP per capital and life expectancy in low-income countries in 1977.
The questions are: At the 5 per cent significance level, do the data provide sufficient evidence to conclude that GDP per capital is positively correlated to life expectancy in high-income countries in 1977?
At the 5 per cent significance level, do the data provide sufficient evidence to conclude that GDP per capital is positively correlated to life expectancy in low-income countries in 1977?
4.3 Perform Pearson correlation test between two variables. 4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level. 4.5 Interpret and report the results.
sample_h_r <- cor.test(sample.1977.h$gdpPercap, sample.1977.h$lifeExp, method = "pearson")
sample_h_r
##
## Pearson's product-moment correlation
##
## data: sample.1977.h$gdpPercap and sample.1977.h$lifeExp
## t = 3.3747, df = 19, p-value = 0.003181
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2453240 0.8256671
## sample estimates:
## cor
## 0.6121791
The P-value is 0.00659, far lower than the specify significance level α = 0.05; hence, seeing P < α, we conclude that at 5 percent significance level, the data provide very strong evidence against the null hypothesis in favour of the alternative hypothesis.
sample_l_r <- cor.test(sample.1977.l$gdpPercap, sample.1977.l$lifeExp, method = "pearson")
sample_l_r
##
## Pearson's product-moment correlation
##
## data: sample.1977.l$gdpPercap and sample.1977.l$lifeExp
## t = 1.193, df = 22, p-value = 0.2456
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1742199 0.5911182
## sample estimates:
## cor
## 0.2465008
The P-value is 0.6437, far higher than the specify significance level α = 0.05; hence, seeing P > α, we conclude that at 5 percent significance level, the data provides weak or none evidence against the null hypothesis in favour of the alternative hypothesis.
Since we have the population, we could model the linear regression of population to see if the hypothesis test result is reliable.
ggplot(highI_Coun_update, aes(gdpPercap, lifeExp)) +
geom_point() +
sm_corr_theme() +
sm_statCorr() +
theme(legend.position = "bottom")
## sm_corr_theme is equivalent to sm_hvgrid.
## `geom_smooth()` using formula = 'y ~ x'
ggplot(lowI_Coun_update, aes(gdpPercap, lifeExp)) +
geom_point() +
sm_corr_theme() +
sm_statCorr() +
theme(legend.position = "bottom")
## sm_corr_theme is equivalent to sm_hvgrid.
## `geom_smooth()` using formula = 'y ~ x'
The plots finding can support the validity of hypothesis test
finding.