HW.7

Homework Assignment 7

Download the gapminder dataset into R Markdown session.

Perform quick EDA and pick up two variables you want to explore in more depth (for example, life expectancy and gdp).

library(gapminder)
library(ggplot2) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readr)
library(tibble)
View(gapminder)

I installed new packages as they would play a role in visualising the P-value and r in the plot.

install.packages('devtools', repos = "smin95/smplot.io")

## Warning: unable to access index for repository smin95/smplot.io/src/contrib:
##   cannot open URL 'smin95/smplot.io/src/contrib/PACKAGES'

## Warning: package 'devtools' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

## Warning: unable to access index for repository smin95/smplot.io/bin/macosx/contrib/4.1:
##   cannot open URL 'smin95/smplot.io/bin/macosx/contrib/4.1/PACKAGES'

devtools::install_github('smin95/smplot', force = TRUE)

## Downloading GitHub repo smin95/smplot@HEAD

## 
##   
   checking for file ‘/private/var/folders/v5/z0kd8v2n2lxd9dcj7n9wddm80000gn/T/Rtmpm32z4X/remotes510a59f80e8c/smin95-smplot2-f207d79/DESCRIPTION’ ...
  
✔  checking for file ‘/private/var/folders/v5/z0kd8v2n2lxd9dcj7n9wddm80000gn/T/Rtmpm32z4X/remotes510a59f80e8c/smin95-smplot2-f207d79/DESCRIPTION’
## 
  
─  preparing ‘smplot2’:
## 
  
   checking DESCRIPTION meta-information ...
  
✔  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
   Omitted ‘LazyData’ from DESCRIPTION
## 
  
─  building ‘smplot2_0.1.0.tar.gz’
## 
  
   
##

library(smplot2)

## Updated tutorial for smplot: smin95.github.io/dataviz/

library(cowplot)

Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

2.1 Explore two variables and how they are associated with each other (correlation analysis). Include the description of assumptions for correlation analysis, concluding with the type of analysis you choose. Create necessary visualisations to support your analysis (a histogram, a scatter plot, etc) and include your interpretation of the graphs.

Explore two variables and how they are associated with each other (correlation analysis).

3.1 Visualize your data using a scatter plot and include the description of assumptions for correlation analysis:

Is the co-variation linear?
Are the data from each of the 2 variables (x, y) follow a normal distribution (visual inspection of the data normality using histograms)?
Questions for EDA
Description of data cleaning and transformation
Description of correlation analysis (steps for visualisation, checking assumption for correlation analysis, performing correlation analysis)
Data interpretation and conclusions

Introduction:

This data set includes 6 variables: two factorial variables (country and continent), two integer variables (year and population), and two numeric variables (life expectancy and GDP). Using the function sapply, I found there is no value unavailable in this data set. This data set contains data of 142 countries in five continents: Africa, Americas (including NA and SA), Asia, Europe, and Oceania, within 12 years from 1952 to 2007, with an interval of 5 years.

str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

#using structure to view the structure if the data set.
if_na <- gapminder %>% sapply(anyNA) %>% data.frame()
#sapply is a variatio of apply function. This function treats list, vector, and data frame as an input but give outout in a vector or a matrix. Here, the function anyNA is passed as the argument inside the sapply function.
#anyNA is a function intending to check if an object contains any NA or NA vakues, more quickly than the function any(is.na())
if_na

##               .
## country   FALSE
## continent FALSE
## year      FALSE
## lifeExp   FALSE
## pop       FALSE
## gdpPercap FALSE

str(if_na)

## 'data.frame':    6 obs. of  1 variable:
##  $ .: logi  FALSE FALSE FALSE FALSE FALSE FALSE

#there is only one vector of six observations of variables
colnames(if_na) <- c("NA")
#change the name of colnumn
if_na

##              NA
## country   FALSE
## continent FALSE
## year      FALSE
## lifeExp   FALSE
## pop       FALSE
## gdpPercap FALSE

final_if_na <- as.data.frame(t(if_na))
final_if_na

##    country continent  year lifeExp   pop gdpPercap
## NA   FALSE     FALSE FALSE   FALSE FALSE     FALSE

#We could use a function composition as.data.frame(t()) to change columns into rows and change rows into columns
#t() returns the transpose of the data frame or matrix
?unique

## Help on topic 'unique' was found in the following packages:
## 
##   Package               Library
##   base                  /Library/Frameworks/R.framework/Resources/library
##   data.table            /Library/Frameworks/R.framework/Versions/4.1/Resources/library
## 
## 
## Using the first match ...

unique(gapminder$continent)

## [1] Asia     Europe   Africa   Americas Oceania 
## Levels: Africa Americas Asia Europe Oceania

unique(gapminder$year)

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

# the unique function returns a vector, data frame or an array with duplicate elements /rows removed. We can use it to check how many levels a vector of factor has and how many unique values a numeric vector has.
# Another way to find what continents this data set includes is to rely on geographical knowledge. The data.frame(gapminder) shows some countries. Since there is no countries in Antarctic, if we find a South American country has a value of America instead of South America, we can conclude this data set only concludes 5 continents.

My EDA questions:

General assumption: I hypothesis that there be a strong positive linear relationship between the GDP per capital and life expectancy whether in low-income countries (gdpPercap below 2000) or high-income countries (gdpPercap between 9000 and 30000) in 1977 when the resource crisis reached the peak. In order to carry out the hypothesis testing for Pearson’s correlation, this test has to meet at least three critical assumption:

both two variables are continuous and linear correlated;
variables follow or almost follow a normal distribution, and assumption;
no major outliers, are all met. We can start hypothesis testing for Pearson’s correlation.

Hence in the first stage, let’s explore the central tendency of the filtered data set. Some qustions emerge:

Q1, What is the mean of life expectancy and GDP per capital in high-income countries and low-income countries in 1977 ? How does the data distribute?

Q2, Is there any outlier in the data set? Does the data form a linear correlation?

High-income countries:

Using filter function to filter all the rows of high-income countries that meet out standard. Using select function to keep variables this EDA focuses: country, year, gdpPercap, lifeExp.

highI_Coun <- gapminder %>% 
  filter(year %in% c(1977), between(gdpPercap, 9000, 30000)) %>%
  select(country, year, gdpPercap, lifeExp) %>%
  na.omit()

Using geom_point and geom_boxplot, we could visualise the data distribution of GDP per capital and life expectancy in high-income countries in 1977.

ggplot(highI_Coun, aes(gdpPercap, lifeExp)) +
  geom_point()

ggplot(highI_Coun, aes(year, gdpPercap)) +
  geom_boxplot()

ggplot(highI_Coun, aes(year, lifeExp)) +
  geom_boxplot()

The point plot show that basically two variables form a linear correlation. Since there are outliers in life expectancy, we need to filter the dataset again.

After filtering the data, we use the histogram to check if the data of GDP per capital and the data of life expectancy follow the normal distribution. I set the bin (interval) value as 5.

highI_Coun_update <- highI_Coun %>% 
  filter(lifeExp >= 62)

ggplot(highI_Coun_update, aes(gdpPercap)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(highI_Coun_update, aes(lifeExp)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Two histograms display that both GDP per capital and life expectancy are generally follow the normal distribution despite a certain degree of right or left-skewness.

Then, we prepare the sample data using the function sample_n. The function sample_n is used for randomly select observations from a data frame. If we are unsatisfied with the sample of data frame, we could repeat performing until we get the most reliable sample. Since there are 42 observations in the data frame highI_Coun_update, I determine the sample volume to be 21.

sample.1977.h <- highI_Coun_update %>% 
  sample_n(21)
sample.1977.h

## # A tibble: 21 × 4
##    country        year gdpPercap lifeExp
##    <fct>         <int>     <dbl>   <dbl>
##  1 Israel         1977    13307.    73.1
##  2 Puerto Rico    1977     9771.    73.4
##  3 United States  1977    24073.    73.4
##  4 Spain          1977    13237.    74.4
##  5 Ireland        1977    11151.    72.0
##  6 Italy          1977    14256.    73.5
##  7 Norway         1977    23311.    75.4
##  8 Singapore      1977    11210.    70.8
##  9 Hungary        1977    11675.    70.0
## 10 Slovenia       1977    15277.    71.0
## # … with 11 more rows

ggplot(sample.1977.h, aes(gdpPercap, lifeExp)) + 
  geom_point()

ggplot(sample.1977.h, aes(year, gdpPercap)) +
  geom_boxplot()

ggplot(sample.1977.h, aes(gdpPercap)) +
  geom_histogram() +
  stat_bin(bins = 7)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(sample.1977.h, aes(year, lifeExp)) +
  geom_boxplot()

ggplot(sample.1977.h, aes(lifeExp)) +
  geom_histogram() +
  stat_bin(bins = 6)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both box plots and histograms demonstrate that in this sample two variables basically form the linear correlation without outliers, and that variables generally follow the normal distribution.

Then, we use the point-range to visualise the mean and standard deviation of this sample data.

ggplot(sample.1977.h, aes(year, lifeExp)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

ggplot(sample.1977.h, aes(year, gdpPercap)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

The mean of life expectancy in 21 sample high-income countries in 1977 is around 72; the SD is around 2.3. The mean of GDP per capital in 21 sample high-income countries in 1977 is around 15000; the SD is around 4000.

Low-income countries:

Using filter function to filter all the rows of low-income countries that meet out standard. Using select function to keep variables this EDA focuses: country, year, gdpPercap, lifeExp.

lowI_Coun <- gapminder %>% 
  filter(year == 1977, gdpPercap < 2000) %>%
  select(country, year, gdpPercap, lifeExp) %>%
  na.omit()

Using geom_point and geom_boxplot, we could visualise the data distribution of GDP per capital and life expectancy in high-income countries in 1977.

ggplot(lowI_Coun, aes(gdpPercap, lifeExp)) +
  geom_point()

ggplot(lowI_Coun, aes(year, gdpPercap)) +
  geom_boxplot()

ggplot(lowI_Coun, aes(year, lifeExp)) +
  geom_boxplot()

The point plot show that basically two variables form a linear correlation. Since there are outliers in life expectanct, we need to filter the dataset again.

After filtering the data, we use the histogram to check if the data of GDP per capital and the data of life expectancy follow the normal distribution. I set the bin (interval) value as 5.

lowI_Coun_update <- lowI_Coun %>% 
  filter(lifeExp <65)

ggplot(lowI_Coun_update, aes(gdpPercap)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(lowI_Coun_update, aes(lifeExp)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Two histograms display that both GDP per capital and life expectancy are generally follow the normal distribution despite a certain degree of right or left-skewness.

Then, we prepare the sample data using the function sample_n. Since there are 49 observations in the data frame highI_Coun_update, I determine the sample volume to be 24.

sample.1977.l <- lowI_Coun_update %>%
  sample_n(24)

ggplot(sample.1977.l, aes(gdpPercap, lifeExp)) + 
  geom_point()

ggplot(sample.1977.l, aes(year, gdpPercap)) +
  geom_boxplot()

ggplot(sample.1977.l, aes(gdpPercap)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(sample.1977.l, aes(year, lifeExp)) +
  geom_boxplot()

ggplot(sample.1977.l, aes(lifeExp)) +
  geom_histogram() +
  stat_bin(bins = 5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both box plots and histograms demonstrate that in this sample two variables basically form the linear correlation without outliers, and that variables generally follow the normal distribution.

Then, we use the point-range to visualise the mean and standard deviation of sample data.

ggplot(sample.1977.l, aes(year, lifeExp)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

ggplot(sample.1977.l, aes(year, gdpPercap)) +
  stat_summary(fun = mean,
               geom = "pointrange",
               fun.max = function(x) mean(x) + sd(x),
               fun.min = function(x) mean(x) - sd(x))

The mean of life expectancy in 21 sample low-income countries in 1977 is around 48; the SD is around 54.5. The mean of GDP per capital in 24 sample low-income countries in 1977 is around 975; the SD is around 400.

3.2 Calculate correlation coefficient and provide your interpretation.

round(cor(sample.1977.h$gdpPercap, sample.1977.h$lifeExp), 3)

## [1] 0.612

round(cor(sample.1977.l$gdpPercap, sample.1977.l$lifeExp), 3)

## [1] 0.247

The linear correlation coefficient for the sample of high-income countries (size: 21) is around 0.573, suggesting a stong positive correlation between GDP per capital and life expectancy; The linear correlation coefficient for the sample of low-income countries (size: 24) is around 0.099, suggesting an extreme weak positive correlation between GDP per capital and life expectancy.

Hypothesis testing. 4.1 the null hypothesis and the alternative hypothesis. 4.2 Report on collected data and sample size. Based on finding above, I decided two hypotheses for the sample of high-income countries (size: 21): The null hypothesis: There is a negative correlation or no correlation between GDP per capital and life expectancy in high-income countries in 1977. The alternative hypothesis: There is a positive correlation between GDP per capital and life expectancy in high-income countries in 1977.

For the sample of low-income countries (size: 24): The null hypothesis: there is a negative correlation or no correlation between GDP per capital and life expectancy in low-income countries in 1977. The alternative hypothesis: There is a positive correlation between GDP per capital and life expectancy in low-income countries in 1977.

The questions are: At the 5 per cent significance level, do the data provide sufficient evidence to conclude that GDP per capital is positively correlated to life expectancy in high-income countries in 1977?

At the 5 per cent significance level, do the data provide sufficient evidence to conclude that GDP per capital is positively correlated to life expectancy in low-income countries in 1977?

4.3 Perform Pearson correlation test between two variables. 4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level. 4.5 Interpret and report the results.

sample_h_r <- cor.test(sample.1977.h$gdpPercap, sample.1977.h$lifeExp, method = "pearson")
sample_h_r

## 
##  Pearson's product-moment correlation
## 
## data:  sample.1977.h$gdpPercap and sample.1977.h$lifeExp
## t = 3.3747, df = 19, p-value = 0.003181
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2453240 0.8256671
## sample estimates:
##       cor 
## 0.6121791

The P-value is 0.00659, far lower than the specify significance level α = 0.05; hence, seeing P < α, we conclude that at 5 percent significance level, the data provide very strong evidence against the null hypothesis in favour of the alternative hypothesis.

sample_l_r <- cor.test(sample.1977.l$gdpPercap, sample.1977.l$lifeExp, method = "pearson")
sample_l_r

## 
##  Pearson's product-moment correlation
## 
## data:  sample.1977.l$gdpPercap and sample.1977.l$lifeExp
## t = 1.193, df = 22, p-value = 0.2456
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1742199  0.5911182
## sample estimates:
##       cor 
## 0.2465008

The P-value is 0.6437, far higher than the specify significance level α = 0.05; hence, seeing P > α, we conclude that at 5 percent significance level, the data provides weak or none evidence against the null hypothesis in favour of the alternative hypothesis.

Since we have the population, we could model the linear regression of population to see if the hypothesis test result is reliable.

ggplot(highI_Coun_update, aes(gdpPercap, lifeExp)) +
  geom_point() +
  sm_corr_theme() +
  sm_statCorr() +
  theme(legend.position = "bottom")

## sm_corr_theme is equivalent to sm_hvgrid.

## `geom_smooth()` using formula = 'y ~ x'

ggplot(lowI_Coun_update, aes(gdpPercap, lifeExp)) +
  geom_point() +
  sm_corr_theme() +
  sm_statCorr() +
  theme(legend.position = "bottom")

## sm_corr_theme is equivalent to sm_hvgrid.

## `geom_smooth()` using formula = 'y ~ x'

The plots finding can support the validity of hypothesis test finding.