The gapminder data set on population and life-expectancy of different countries and continents varied through years
## Warning: package 'gapminder' was built under R version 4.2.2
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## [1] 0
Nope, none were found.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
The mean life expectancy for the data set was found to be 59.47 years and mean for population was 2.960*10^7
First I am going to select the continent of Europe and see what countries are available
## # A tibble: 30 × 1
## country
## <fct>
## 1 Albania
## 2 Austria
## 3 Belgium
## 4 Bosnia and Herzegovina
## 5 Bulgaria
## 6 Croatia
## 7 Czech Republic
## 8 Denmark
## 9 Finland
## 10 France
## # … with 20 more rows
Now, I have decided to select Sweden as my focus area, and select Life expectancy and Poluation for analysis
## # A tibble: 6 × 4
## country year lifeExp pop
## <fct> <int> <dbl> <int>
## 1 Sweden 1952 71.9 7124673
## 2 Sweden 1957 72.5 7363802
## 3 Sweden 1962 73.4 7561588
## 4 Sweden 1967 74.2 7867931
## 5 Sweden 1972 74.7 8122293
## 6 Sweden 1977 75.4 8251648
3.1 Visualize data using a scatter plot and include the description
of assumptions for correlation analysis:
- Is the co-variation linear?
- Are the data from each of the 2 variables (x, y) follow a normal
distribution (visual inspection of the data normality using
histograms)?
## country year lifeExp pop
## Sweden :12 Min. :1952 Min. :71.86 Min. :7124673
## Afghanistan: 0 1st Qu.:1966 1st Qu.:73.96 1st Qu.:7791345
## Albania : 0 Median :1980 Median :75.93 Median :8288454
## Algeria : 0 Mean :1980 Mean :76.18 Mean :8220029
## Angola : 0 3rd Qu.:1993 3rd Qu.:78.47 3rd Qu.:8763555
## Argentina : 0 Max. :2007 Max. :80.88 Max. :9031088
## (Other) : 0
Strong positive correlation
3.2 Calculate correlation coefficient and provide your interpretation.
## [1] 0.9799371
0.7 to 1 is considered very strong correlation
They do not follow normal distribution
4.1 State the null hypothesis and the alternative hypothesis.
H0: There is no relationship between lifeExp and Population in Sweden H1: There is a relationship between lifeExp and Population in Sweden
4.2 Report on collected data and sample size.
The data I collected is of Sweden and the sample size is 12
4.3 Perform Pearson correlation test between two variables.
##
## Pearson's product-moment correlation
##
## data: Eudata$lifeExp and Eudata$pop
## t = 15.548, df = 10, p-value = 2.475e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9278426 0.9945284
## sample estimates:
## cor
## 0.9799371
4.4 Decide whether to reject or fail to reject your null hypothesis, report selected significance level.
2.475*10^-7<0.5
## [1] TRUE
The P-value is much less than 0.5, hence the Null hypothesis can be rejected.
4.5 Interpret and report the results.
We can reject the null hypothesis that there is no relationship between life expectancy and population. Therefore, we conclude that the longer people live, the higher the population.