Introduction

The gapminder data set on population and life-expectancy of different countries and continents varied through years

## Warning: package 'gapminder' was built under R version 4.2.2
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

EDA

  1. check if there are any null values in the data
## [1] 0

Nope, none were found.

  1. Look for mean of variables
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

The mean life expectancy for the data set was found to be 59.47 years and mean for population was 2.960*10^7

Description of data cleaning and transformation

First I am going to select the continent of Europe and see what countries are available

## # A tibble: 30 × 1
##    country               
##    <fct>                 
##  1 Albania               
##  2 Austria               
##  3 Belgium               
##  4 Bosnia and Herzegovina
##  5 Bulgaria              
##  6 Croatia               
##  7 Czech Republic        
##  8 Denmark               
##  9 Finland               
## 10 France                
## # … with 20 more rows

Now, I have decided to select Sweden as my focus area, and select Life expectancy and Poluation for analysis

## # A tibble: 6 × 4
##   country  year lifeExp     pop
##   <fct>   <int>   <dbl>   <int>
## 1 Sweden   1952    71.9 7124673
## 2 Sweden   1957    72.5 7363802
## 3 Sweden   1962    73.4 7561588
## 4 Sweden   1967    74.2 7867931
## 5 Sweden   1972    74.7 8122293
## 6 Sweden   1977    75.4 8251648

Description of correlation analysis

3.1 Visualize data using a scatter plot and include the description of assumptions for correlation analysis:
- Is the co-variation linear?
- Are the data from each of the 2 variables (x, y) follow a normal distribution (visual inspection of the data normality using histograms)?

##         country        year         lifeExp           pop         
##  Sweden     :12   Min.   :1952   Min.   :71.86   Min.   :7124673  
##  Afghanistan: 0   1st Qu.:1966   1st Qu.:73.96   1st Qu.:7791345  
##  Albania    : 0   Median :1980   Median :75.93   Median :8288454  
##  Algeria    : 0   Mean   :1980   Mean   :76.18   Mean   :8220029  
##  Angola     : 0   3rd Qu.:1993   3rd Qu.:78.47   3rd Qu.:8763555  
##  Argentina  : 0   Max.   :2007   Max.   :80.88   Max.   :9031088  
##  (Other)    : 0

Strong positive correlation

3.2 Calculate correlation coefficient and provide your interpretation.

## [1] 0.9799371

0.7 to 1 is considered very strong correlation

They do not follow normal distribution

Hypothesis testing.

4.1 State the null hypothesis and the alternative hypothesis.

H0: There is no relationship between lifeExp and Population in Sweden H1: There is a relationship between lifeExp and Population in Sweden

4.2 Report on collected data and sample size.

The data I collected is of Sweden and the sample size is 12

4.3 Perform Pearson correlation test between two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  Eudata$lifeExp and Eudata$pop
## t = 15.548, df = 10, p-value = 2.475e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9278426 0.9945284
## sample estimates:
##       cor 
## 0.9799371

4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level.

2.475*10^-7<0.5

## [1] TRUE

The P-value is much less than 0.5, hence the Null hypothesis can be rejected.

4.5 Interpret and report the results.

We can reject the null hypothesis that there is no relationship between life expectancy and population. Therefore, we conclude that the longer people live, the higher the population.