Download the gapminder data set into
R Markdown session.
Perform quick EDA and pick up two variables you want to explore in more depth (for example, life expectancy and gdp) and a subset of data set (for instance, only certain continents, or countries, etc).
Prepare the data set that includes only variables of your
interest in a suitable format for analysis (use dlyr
package and tidyr when necessary).
#install.packages("gapminder")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gapminder)
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
View(gapminder)
summarize the gapminder dataset
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
find the dimension of the gapminder dataset
dim(gapminder)
## [1] 1704 6
create histogram of values for year
select two variables from dataset
gapminder %>%
select(country, lifeExp, gdpPercap)
## # A tibble: 1,704 × 3
## country lifeExp gdpPercap
## <fct> <dbl> <dbl>
## 1 Afghanistan 28.8 779.
## 2 Afghanistan 30.3 821.
## 3 Afghanistan 32.0 853.
## 4 Afghanistan 34.0 836.
## 5 Afghanistan 36.1 740.
## 6 Afghanistan 38.4 786.
## 7 Afghanistan 39.9 978.
## 8 Afghanistan 40.8 852.
## 9 Afghanistan 41.7 649.
## 10 Afghanistan 41.8 635.
## # … with 1,694 more rows
choosing Albania from dataset
Albania <- filter(gapminder, country == "Albania")
Albania
## # A tibble: 12 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
## 7 Albania Europe 1982 70.4 2780097 3631.
## 8 Albania Europe 1987 72 3075321 3739.
## 9 Albania Europe 1992 71.6 3326498 2497.
## 10 Albania Europe 1997 73.0 3428038 3193.
## 11 Albania Europe 2002 75.7 3508512 4604.
## 12 Albania Europe 2007 76.4 3600523 5937.
select(Albania, lifeExp, gdpPercap)
## # A tibble: 12 × 2
## lifeExp gdpPercap
## <dbl> <dbl>
## 1 55.2 1601.
## 2 59.3 1942.
## 3 64.8 2313.
## 4 66.2 2760.
## 5 67.7 3313.
## 6 68.9 3533.
## 7 70.4 3631.
## 8 72 3739.
## 9 71.6 2497.
## 10 73.0 3193.
## 11 75.7 4604.
## 12 76.4 5937.
mychosendata <- select(Albania, lifeExp, gdpPercap)
mychosendata
## # A tibble: 12 × 2
## lifeExp gdpPercap
## <dbl> <dbl>
## 1 55.2 1601.
## 2 59.3 1942.
## 3 64.8 2313.
## 4 66.2 2760.
## 5 67.7 3313.
## 6 68.9 3533.
## 7 70.4 3631.
## 8 72 3739.
## 9 71.6 2497.
## 10 73.0 3193.
## 11 75.7 4604.
## 12 76.4 5937.
performing EDA for selecting variables
## lifeExp gdpPercap
## Min. :55.23 Min. :1601
## 1st Qu.:65.87 1st Qu.:2451
## Median :69.67 Median :3253
## Mean :68.43 Mean :3255
## 3rd Qu.:72.24 3rd Qu.:3658
## Max. :76.42 Max. :5937
## [1] 6.322911
## [1] 1192.352
create scatterplot of lifeExp vs. gdpPercap
## `geom_smooth()` using formula = 'y ~ x'
Association between variables pop and year
##
## Call:
## lm(formula = pop ~ year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43318856 -27548179 -18558743 -9628265 1275164661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -972185807 294031308 -3.306 0.000965 ***
## year 506081 148532 3.407 0.000672 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105800000 on 1702 degrees of freedom
## Multiple R-squared: 0.006775, Adjusted R-squared: 0.006191
## F-statistic: 11.61 on 1 and 1702 DF, p-value: 0.0006716
Our conclusion is that since the p-value gotten is less than 0.05, then
there is significant relationship between year and pop
## # A tibble: 12 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
## 7 Albania Europe 1982 70.4 2780097 3631.
## 8 Albania Europe 1987 72 3075321 3739.
## 9 Albania Europe 1992 71.6 3326498 2497.
## 10 Albania Europe 1997 73.0 3428038 3193.
## 11 Albania Europe 2002 75.7 3508512 4604.
## 12 Albania Europe 2007 76.4 3600523 5937.
## [1] 0.2396469 -1.2672823 -0.6405017 1.0407503 1.8646130 2.4822764
Create the other variable
## [1] -1.1011159 -3.2285025 -2.5252555 -0.1810685 0.9898952 2.0770148
create a dataframe with the two variables and calculate Pearson’s correlation:
numbers <- data.frame(x,y)
head(numbers)
## x y
## 1 0.2396469 -1.1011159
## 2 -1.2672823 -3.2285025
## 3 -0.6405017 -2.5252555
## 4 1.0407503 -0.1810685
## 5 1.8646130 0.9898952
## 6 2.4822764 2.0770148
corr.1 <- round(cor(x, y), 2)
corr.1
## [1] 0.78
## # A tibble: 1,704 × 13
## country year pop lifeE…¹ lifeE…² lifeE…³ lifeE…⁴ lifeE…⁵ gdpPe…⁶ gdpPe…⁷
## <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… 1952 8.43e6 28.8 NA NA NA NA 779. NA
## 2 Afghani… 1957 9.24e6 30.3 NA NA NA NA 821. NA
## 3 Afghani… 1962 1.03e7 32.0 NA NA NA NA 853. NA
## 4 Afghani… 1967 1.15e7 34.0 NA NA NA NA 836. NA
## 5 Afghani… 1972 1.31e7 36.1 NA NA NA NA 740. NA
## 6 Afghani… 1977 1.49e7 38.4 NA NA NA NA 786. NA
## 7 Afghani… 1982 1.29e7 39.9 NA NA NA NA 978. NA
## 8 Afghani… 1987 1.39e7 40.8 NA NA NA NA 852. NA
## 9 Afghani… 1992 1.63e7 41.7 NA NA NA NA 649. NA
## 10 Afghani… 1997 2.22e7 41.8 NA NA NA NA 635. NA
## # … with 1,694 more rows, 3 more variables: gdpPercap_Africa <dbl>,
## # gdpPercap_Americas <dbl>, gdpPercap_Oceania <dbl>, and abbreviated variable
## # names ¹lifeExp_Asia, ²lifeExp_Europe, ³lifeExp_Africa, ⁴lifeExp_Americas,
## # ⁵lifeExp_Oceania, ⁶gdpPercap_Asia, ⁷gdpPercap_Europe
To find the data for Albania
Albania <- filter(gapminder, country == "Albania")
head(Albania)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
Life expectancy in Albania greater than 50
X <- filter(Albania, (lifeExp > 50))
print(X)
## # A tibble: 12 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 1952 55.2 1282697 1601.
## 2 Albania Europe 1957 59.3 1476505 1942.
## 3 Albania Europe 1962 64.8 1728137 2313.
## 4 Albania Europe 1967 66.2 1984060 2760.
## 5 Albania Europe 1972 67.7 2263554 3313.
## 6 Albania Europe 1977 68.9 2509048 3533.
## 7 Albania Europe 1982 70.4 2780097 3631.
## 8 Albania Europe 1987 72 3075321 3739.
## 9 Albania Europe 1992 71.6 3326498 2497.
## 10 Albania Europe 1997 73.0 3428038 3193.
## 11 Albania Europe 2002 75.7 3508512 4604.
## 12 Albania Europe 2007 76.4 3600523 5937.
create correlation matrix of (rounded to 2 decimal places)
round(cor(gapminder[c('lifeExp', 'gdpPercap')]), 2)
## lifeExp gdpPercap
## lifeExp 1.00 0.58
## gdpPercap 0.58 1.00
count total missing values in each column
sapply(gapminder, function(x) sum(is.na(x)))
## country continent year lifeExp pop gdpPercap
## 0 0 0 0 0 0
Create a report in R Markdown with the following
sections:
- Introduction (brief description of the data set and variables)
- Questions for EDA
- Description of data cleaning and transformation
- Description of correlation analysis (steps for visualisation, checking
assumption for correlation analysis, performing correlation
analysis)
- Data interpretation and conclusions
The gapminder data contains data on life expectancy and GDP per capital by country and year.
#install.packages("knitr")
#install.packages("psych")
#describe(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
library(dplyr)
unique(gapminder$country)
## [1] Afghanistan Albania Algeria
## [4] Angola Argentina Australia
## [7] Austria Bahrain Bangladesh
## [10] Belgium Benin Bolivia
## [13] Bosnia and Herzegovina Botswana Brazil
## [16] Bulgaria Burkina Faso Burundi
## [19] Cambodia Cameroon Canada
## [22] Central African Republic Chad Chile
## [25] China Colombia Comoros
## [28] Congo, Dem. Rep. Congo, Rep. Costa Rica
## [31] Cote d'Ivoire Croatia Cuba
## [34] Czech Republic Denmark Djibouti
## [37] Dominican Republic Ecuador Egypt
## [40] El Salvador Equatorial Guinea Eritrea
## [43] Ethiopia Finland France
## [46] Gabon Gambia Germany
## [49] Ghana Greece Guatemala
## [52] Guinea Guinea-Bissau Haiti
## [55] Honduras Hong Kong, China Hungary
## [58] Iceland India Indonesia
## [61] Iran Iraq Ireland
## [64] Israel Italy Jamaica
## [67] Japan Jordan Kenya
## [70] Korea, Dem. Rep. Korea, Rep. Kuwait
## [73] Lebanon Lesotho Liberia
## [76] Libya Madagascar Malawi
## [79] Malaysia Mali Mauritania
## [82] Mauritius Mexico Mongolia
## [85] Montenegro Morocco Mozambique
## [88] Myanmar Namibia Nepal
## [91] Netherlands New Zealand Nicaragua
## [94] Niger Nigeria Norway
## [97] Oman Pakistan Panama
## [100] Paraguay Peru Philippines
## [103] Poland Portugal Puerto Rico
## [106] Reunion Romania Rwanda
## [109] Sao Tome and Principe Saudi Arabia Senegal
## [112] Serbia Sierra Leone Singapore
## [115] Slovak Republic Slovenia Somalia
## [118] South Africa Spain Sri Lanka
## [121] Sudan Swaziland Sweden
## [124] Switzerland Syria Taiwan
## [127] Tanzania Thailand Togo
## [130] Trinidad and Tobago Tunisia Turkey
## [133] Uganda United Kingdom United States
## [136] Uruguay Venezuela Vietnam
## [139] West Bank and Gaza Yemen, Rep. Zambia
## [142] Zimbabwe
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
correlation of the variables year and pop
cor(my_year, my_pop)
## [1] 0.08230808
Exploratory Data Analysis
GDP_2007 <- gapminder_unfiltered %>% filter(year==2007) %>% select(continent, country, gdpPercap)
ggplot(GDP_2007, aes(x=gdpPercap)) + geom_histogram(fill="cyan", bins=40) +
ylab("GDP per Capita") +
ggtitle("Distribution of GDP Per Capita for 2007 for all Countries") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())