The dateset I will anaysis is Gapminder dataset. It provides data about the population, life expectancy and GDP in different countries of the world from 1952 to 2007. There are six variables in the dataset:country, continent, year, lifeExp, pop, gdpPercap.
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
1.What are the observed population, the observation unit and the reference period?
The observed population is the data on life expectancy, GDP per capita, and population by every country in the world. The observation is a country and the reference period is 1952 to 2007.
2.What are the data types of the variables? Do we need to change them?
The dataset include 6 variables. The data types of the variables shows as below:
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
We don’t need to change the data types of the variables.
3.Are there any outliers? Are there any values that look like errors?
Here I will check whether there are outliers in GDP per capit data. The lower and upper cutoffs for outliers of gdpPercap is 21511 and -10983. The maximum data of gdpPercap is 113523.13 and the minmum is 241.17. So there are outliers in GDP per capit data.
4.What is the mean for each variable?
The mean of lifeExp is 59.47 ; the mean of population is 29601212; the mean of gdpPercap is 7215.33
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
5.Are there any Null / NA values?
There are no Null / NA values.
sum(is.na(gapminder))
## [1] 0
Here I want to explore the situation of Asia. I will select the data in Asia from 1952 to 2007.
## # A tibble: 6 × 5
## country year lifeExp pop gdpPercap
## <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan 1952 28.8 8425333 779.
## 2 Afghanistan 1957 30.3 9240934 821.
## 3 Afghanistan 1962 32.0 10267083 853.
## 4 Afghanistan 1967 34.0 11537966 836.
## 5 Afghanistan 1972 36.1 13079460 740.
## 6 Afghanistan 1977 38.4 14880372 786.
Here I want to exlore the relatationship of lifeExp (life expectancy) and gdpPercap(GDP per capita). First, I will visualize the two variables to their relationship.
## `geom_smooth()` using formula 'y ~ x'
From the scatter plot, I think the two variables have moderate position correaltion.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histograms, The data in both variables do not follow a normal distribution. Moreover, I will calculate correlation coefficient of the two variables in order to exlpore the relatiionship between the two variables.
## [1] 0.3820476
The correlation cofficient of the two variables is 0.382, which means the two variables have moderate position correaltion.This means an increase in the value of one variable will lead to an increase in the value of the other variable.
H0: There is no relationship between lifeExp and gdpPercap.
H1: There is a relationship between lifeExp and gdpPercap.
The data I collected is the data of Asia. The sample size is 396.
##
## Pearson's product-moment correlation
##
## data: df_cn$lifeExp and df_cn$gdpPercap
## t = 8.2059, df = 394, p-value = 3.287e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2945926 0.4631563
## sample estimates:
## cor
## 0.3820476
The p-value of the correlation test between these 2 variables is 3.287e-15. P value is below cutoff of 0.05, so I reject the null hypothesis of no correlation.
We can reject the null hypothesis that there is no relationship between life expectancy and GDP per capita. Therefore, we conclude that the longer people live, the higher the GDP per capita will be.