Download the gapminder data set into R Markdown session. To start with the EDA first I download the data set gap minder. The gap minder data contains data on life expectancy and GDP per capita by country and year. I will import the data set and perform EDA.
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'gapminder' was built under R version 4.1.3
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
The data type of the Varaibles are as following
## tibble [1,704 x 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
There is no NA/Null values in the data set.
## [1] 0
What is the mean of the life expectancy, population and GDP Per Cap
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
## # A tibble: 6 x 5
## country year lifeExp pop gdpPercap
## <fct> <int> <dbl> <int> <dbl>
## 1 Albania 1952 55.2 1282697 1601.
## 2 Albania 1957 59.3 1476505 1942.
## 3 Albania 1962 64.8 1728137 2313.
## 4 Albania 1967 66.2 1984060 2760.
## 5 Albania 1972 67.7 2263554 3313.
## 6 Albania 1977 68.9 2509048 3533.
Now I will do analysis on two variables and find the corelation.
## `geom_smooth()` using formula 'y ~ x'
From the plot we can see that they have weak correlation.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see they both dont follow any normal distribution.
Now i see if there is any relation lets check more!
## [1] 0.1097244
As we can see the corelation cofficent of both variables is 0.10. Which means the corelation is weak positive.
Null hypothesis: There is no relation Alternative: There is relation
The data collected was from Europe, the dataset had 360 obs
##
## Pearson's product-moment correlation
##
## data: newdf$pop and newdf$gdpPercap
## t = 2.0887, df = 358, p-value = 0.03744
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.006435561 0.210696641
## sample estimates:
## cor
## 0.1097244
As we can see that the p value is 0.03744 which is less the 0.05 hence we can reject the null hypothesis.
We can reject the null hypothesis therefore population and GDP have relation.