Introduction

Download the gapminder data set into R Markdown session. To start with the EDA first I download the data set gap minder. The gap minder data contains data on life expectancy and GDP per capita by country and year. I will import the data set and perform EDA.

## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'gapminder' was built under R version 4.1.3
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Performing EDA

The data type of the Varaibles are as following

## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

There is no NA/Null values in the data set.

## [1] 0

What is the mean of the life expectancy, population and GDP Per Cap

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

Discription of data cleaning and transformation

## # A tibble: 6 x 5
##   country  year lifeExp     pop gdpPercap
##   <fct>   <int>   <dbl>   <int>     <dbl>
## 1 Albania  1952    55.2 1282697     1601.
## 2 Albania  1957    59.3 1476505     1942.
## 3 Albania  1962    64.8 1728137     2313.
## 4 Albania  1967    66.2 1984060     2760.
## 5 Albania  1972    67.7 2263554     3313.
## 6 Albania  1977    68.9 2509048     3533.

Corelation between two variables

Now I will do analysis on two variables and find the corelation.

## `geom_smooth()` using formula 'y ~ x'

From the plot we can see that they have weak correlation.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see they both dont follow any normal distribution.

Now i see if there is any relation lets check more!

## [1] 0.1097244

As we can see the corelation cofficent of both variables is 0.10. Which means the corelation is weak positive.

Hypothesis testing

State the null hypothesis and the alternative hypothesis.

Null hypothesis: There is no relation Alternative: There is relation

Report on collected data and sample size.

The data collected was from Europe, the dataset had 360 obs

Perform Pearson correlation test between two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  newdf$pop and newdf$gdpPercap
## t = 2.0887, df = 358, p-value = 0.03744
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.006435561 0.210696641
## sample estimates:
##       cor 
## 0.1097244

Decide whether to reject or fail to reject your null hypothesis, report selected significance level.

As we can see that the p value is 0.03744 which is less the 0.05 hence we can reject the null hypothesis.

Interpret and report the results.

We can reject the null hypothesis therefore population and GDP have relation.