Introduction

The dateset I will anaysis is Gapminder dataset. It provides data about the population, life expectancy and GDP in different countries of the world from 1952 to 2007. There are six variables in the dataset:country, continent, year, lifeExp, pop, gdpPercap.

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Questions for EDA

1.What are the observed population, the observation unit and the reference period?

The observed population is the data on life expectancy, GDP per capita, and population by every country in the world. The observation is a country and the reference period is 1952 to 2007.

2.What are the data types of the variables? Do we need to change them?

The dataset include 6 variables. The data types of the variables shows as below:

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

We don’t need to change the data types of the variables.

3.Are there any outliers? Are there any values that look like errors?

Here I will check whether there are outliers in GDP per capit data. The lower and upper cutoffs for outliers of gdpPercap is 21511 and -10983. The maximum data of gdpPercap is 113523.13 and the minmum is 241.17. So there are outliers in GDP per capit data.

4.What is the mean for each variable?

The mean of lifeExp is 59.47 ; the mean of population is 29601212; the mean of gdpPercap is 7215.33

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

5.Are there any Null / NA values?

There are no Null / NA values.

sum(is.na(gapminder))
## [1] 0

Description of data cleaning and transformation

Here I want to explore the situation of Asia. I will select the data in Asia from 1952 to 2007.

## # A tibble: 6 × 5
##   country      year lifeExp      pop gdpPercap
##   <fct>       <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan  1952    28.8  8425333      779.
## 2 Afghanistan  1957    30.3  9240934      821.
## 3 Afghanistan  1962    32.0 10267083      853.
## 4 Afghanistan  1967    34.0 11537966      836.
## 5 Afghanistan  1972    36.1 13079460      740.
## 6 Afghanistan  1977    38.4 14880372      786.

Description of correlation analysis

Here I want to exlore the relatationship of lifeExp (life expectancy) and gdpPercap(GDP per capita). First, I will visualize the two variables to their relationship.

## `geom_smooth()` using formula 'y ~ x'

From the scatter plot, I think the two variables have moderate position correaltion.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histograms, The data in both variables do not follow a normal distribution. Moreover, I will calculate correlation coefficient of the two variables in order to exlpore the relatiionship between the two variables.

## [1] 0.3820476

The correlation cofficient of the two variables is 0.382, which means the two variables have moderate position correaltion.This means an increase in the value of one variable will lead to an increase in the value of the other variable.

Hypothesis testing

State the null hypothesis and the alternative hypothesis

H0: There is no relationship between lifeExp and gdpPercap.

H1: There is a relationship between lifeExp and gdpPercap.

Report on collected data and sample size

The data I collected is the data of Asia. The sample size is 396.

Perform Pearson correlation test between two variables

## 
##  Pearson's product-moment correlation
## 
## data:  df_cn$lifeExp and df_cn$gdpPercap
## t = 8.2059, df = 394, p-value = 3.287e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2945926 0.4631563
## sample estimates:
##       cor 
## 0.3820476

Decide whether to reject or fail to reject your null hypothesis, report selected significance level

The p-value of the correlation test between these 2 variables is 3.287e-15. P value is below cutoff of 0.05, so I reject the null hypothesis of no correlation.

Interpret and report the results

We can reject the null hypothesis that there is no relationship between life expectancy and GDP per capita. Therefore, we conclude that the longer people live, the higher the GDP per capita will be.