Download the gapminder data set into R Markdown session.

  1. Perform quick EDA and pick up two variables you want to explore in more depth (for example, life expectancy and gdp) and a subset of data set (for instance, only certain continents, or countries, etc).

  2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

#install.packages("gapminder")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
View(gapminder)

summarize the gapminder dataset

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

find the dimension of the gapminder dataset

dim(gapminder)
## [1] 1704    6

create histogram of values for year

select two variables from dataset

gapminder %>%
select(country, lifeExp, gdpPercap)
## # A tibble: 1,704 × 3
##    country     lifeExp gdpPercap
##    <fct>         <dbl>     <dbl>
##  1 Afghanistan    28.8      779.
##  2 Afghanistan    30.3      821.
##  3 Afghanistan    32.0      853.
##  4 Afghanistan    34.0      836.
##  5 Afghanistan    36.1      740.
##  6 Afghanistan    38.4      786.
##  7 Afghanistan    39.9      978.
##  8 Afghanistan    40.8      852.
##  9 Afghanistan    41.7      649.
## 10 Afghanistan    41.8      635.
## # … with 1,694 more rows

choosing Albania from dataset

Albania <-  filter(gapminder, country == "Albania")
Albania
## # A tibble: 12 × 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Albania Europe     1952    55.2 1282697     1601.
##  2 Albania Europe     1957    59.3 1476505     1942.
##  3 Albania Europe     1962    64.8 1728137     2313.
##  4 Albania Europe     1967    66.2 1984060     2760.
##  5 Albania Europe     1972    67.7 2263554     3313.
##  6 Albania Europe     1977    68.9 2509048     3533.
##  7 Albania Europe     1982    70.4 2780097     3631.
##  8 Albania Europe     1987    72   3075321     3739.
##  9 Albania Europe     1992    71.6 3326498     2497.
## 10 Albania Europe     1997    73.0 3428038     3193.
## 11 Albania Europe     2002    75.7 3508512     4604.
## 12 Albania Europe     2007    76.4 3600523     5937.
select(Albania, lifeExp, gdpPercap)
## # A tibble: 12 × 2
##    lifeExp gdpPercap
##      <dbl>     <dbl>
##  1    55.2     1601.
##  2    59.3     1942.
##  3    64.8     2313.
##  4    66.2     2760.
##  5    67.7     3313.
##  6    68.9     3533.
##  7    70.4     3631.
##  8    72       3739.
##  9    71.6     2497.
## 10    73.0     3193.
## 11    75.7     4604.
## 12    76.4     5937.
mychosendata <- select(Albania, lifeExp, gdpPercap)
mychosendata
## # A tibble: 12 × 2
##    lifeExp gdpPercap
##      <dbl>     <dbl>
##  1    55.2     1601.
##  2    59.3     1942.
##  3    64.8     2313.
##  4    66.2     2760.
##  5    67.7     3313.
##  6    68.9     3533.
##  7    70.4     3631.
##  8    72       3739.
##  9    71.6     2497.
## 10    73.0     3193.
## 11    75.7     4604.
## 12    76.4     5937.

performing EDA for selecting variables

##     lifeExp        gdpPercap   
##  Min.   :55.23   Min.   :1601  
##  1st Qu.:65.87   1st Qu.:2451  
##  Median :69.67   Median :3253  
##  Mean   :68.43   Mean   :3255  
##  3rd Qu.:72.24   3rd Qu.:3658  
##  Max.   :76.42   Max.   :5937
## [1] 6.322911
## [1] 1192.352

create scatterplot of lifeExp vs. gdpPercap

## `geom_smooth()` using formula = 'y ~ x'

  1. Explore two variables and how they are associated with each other (correlation analysis). Include the description of assumptions for correlation analysis, concluding with the type of analysis you choose. Create necessary visualisations to support your analysis (a histogram, a scatter plot, etc) and include your interpretation of the graphs.

Association between variables pop and year

## 
## Call:
## lm(formula = pop ~ year, data = gapminder)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -43318856  -27548179  -18558743   -9628265 1275164661 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -972185807  294031308  -3.306 0.000965 ***
## year            506081     148532   3.407 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105800000 on 1702 degrees of freedom
## Multiple R-squared:  0.006775,   Adjusted R-squared:  0.006191 
## F-statistic: 11.61 on 1 and 1702 DF,  p-value: 0.0006716

Our conclusion is that since the p-value gotten is less than 0.05, then there is significant relationship between year and pop

## # A tibble: 12 × 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Albania Europe     1952    55.2 1282697     1601.
##  2 Albania Europe     1957    59.3 1476505     1942.
##  3 Albania Europe     1962    64.8 1728137     2313.
##  4 Albania Europe     1967    66.2 1984060     2760.
##  5 Albania Europe     1972    67.7 2263554     3313.
##  6 Albania Europe     1977    68.9 2509048     3533.
##  7 Albania Europe     1982    70.4 2780097     3631.
##  8 Albania Europe     1987    72   3075321     3739.
##  9 Albania Europe     1992    71.6 3326498     2497.
## 10 Albania Europe     1997    73.0 3428038     3193.
## 11 Albania Europe     2002    75.7 3508512     4604.
## 12 Albania Europe     2007    76.4 3600523     5937.
## [1]  0.2396469 -1.2672823 -0.6405017  1.0407503  1.8646130  2.4822764

Create the other variable

## [1] -1.1011159 -3.2285025 -2.5252555 -0.1810685  0.9898952  2.0770148

create a dataframe with the two variables and calculate Pearson’s correlation:

numbers <- data.frame(x,y)
head(numbers)
##            x          y
## 1  0.2396469 -1.1011159
## 2 -1.2672823 -3.2285025
## 3 -0.6405017 -2.5252555
## 4  1.0407503 -0.1810685
## 5  1.8646130  0.9898952
## 6  2.4822764  2.0770148
corr.1 <- round(cor(x, y), 2)
corr.1                         
## [1] 0.78
## # A tibble: 1,704 × 13
##    country   year    pop lifeE…¹ lifeE…² lifeE…³ lifeE…⁴ lifeE…⁵ gdpPe…⁶ gdpPe…⁷
##    <fct>    <int>  <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Afghani…  1952 8.43e6    28.8      NA      NA      NA      NA    779.      NA
##  2 Afghani…  1957 9.24e6    30.3      NA      NA      NA      NA    821.      NA
##  3 Afghani…  1962 1.03e7    32.0      NA      NA      NA      NA    853.      NA
##  4 Afghani…  1967 1.15e7    34.0      NA      NA      NA      NA    836.      NA
##  5 Afghani…  1972 1.31e7    36.1      NA      NA      NA      NA    740.      NA
##  6 Afghani…  1977 1.49e7    38.4      NA      NA      NA      NA    786.      NA
##  7 Afghani…  1982 1.29e7    39.9      NA      NA      NA      NA    978.      NA
##  8 Afghani…  1987 1.39e7    40.8      NA      NA      NA      NA    852.      NA
##  9 Afghani…  1992 1.63e7    41.7      NA      NA      NA      NA    649.      NA
## 10 Afghani…  1997 2.22e7    41.8      NA      NA      NA      NA    635.      NA
## # … with 1,694 more rows, 3 more variables: gdpPercap_Africa <dbl>,
## #   gdpPercap_Americas <dbl>, gdpPercap_Oceania <dbl>, and abbreviated variable
## #   names ¹​lifeExp_Asia, ²​lifeExp_Europe, ³​lifeExp_Africa, ⁴​lifeExp_Americas,
## #   ⁵​lifeExp_Oceania, ⁶​gdpPercap_Asia, ⁷​gdpPercap_Europe

To find the data for Albania

Albania <-  filter(gapminder, country == "Albania")
head(Albania)
## # A tibble: 6 × 6
##   country continent  year lifeExp     pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Albania Europe     1952    55.2 1282697     1601.
## 2 Albania Europe     1957    59.3 1476505     1942.
## 3 Albania Europe     1962    64.8 1728137     2313.
## 4 Albania Europe     1967    66.2 1984060     2760.
## 5 Albania Europe     1972    67.7 2263554     3313.
## 6 Albania Europe     1977    68.9 2509048     3533.

Life expectancy in Albania greater than 50

X <- filter(Albania, (lifeExp > 50))
print(X)
## # A tibble: 12 × 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Albania Europe     1952    55.2 1282697     1601.
##  2 Albania Europe     1957    59.3 1476505     1942.
##  3 Albania Europe     1962    64.8 1728137     2313.
##  4 Albania Europe     1967    66.2 1984060     2760.
##  5 Albania Europe     1972    67.7 2263554     3313.
##  6 Albania Europe     1977    68.9 2509048     3533.
##  7 Albania Europe     1982    70.4 2780097     3631.
##  8 Albania Europe     1987    72   3075321     3739.
##  9 Albania Europe     1992    71.6 3326498     2497.
## 10 Albania Europe     1997    73.0 3428038     3193.
## 11 Albania Europe     2002    75.7 3508512     4604.
## 12 Albania Europe     2007    76.4 3600523     5937.

create correlation matrix of (rounded to 2 decimal places)

round(cor(gapminder[c('lifeExp', 'gdpPercap')]), 2)
##           lifeExp gdpPercap
## lifeExp      1.00      0.58
## gdpPercap    0.58      1.00

count total missing values in each column

sapply(gapminder, function(x) sum(is.na(x)))
##   country continent      year   lifeExp       pop gdpPercap 
##         0         0         0         0         0         0

Create a report in R Markdown with the following sections:
- Introduction (brief description of the data set and variables)
- Questions for EDA
- Description of data cleaning and transformation
- Description of correlation analysis (steps for visualisation, checking assumption for correlation analysis, performing correlation analysis)
- Data interpretation and conclusions

The gapminder data contains data on life expectancy and GDP per capital by country and year.

#install.packages("knitr")
#install.packages("psych")
#describe(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
library(dplyr)
unique(gapminder$country)
##   [1] Afghanistan              Albania                  Algeria                 
##   [4] Angola                   Argentina                Australia               
##   [7] Austria                  Bahrain                  Bangladesh              
##  [10] Belgium                  Benin                    Bolivia                 
##  [13] Bosnia and Herzegovina   Botswana                 Brazil                  
##  [16] Bulgaria                 Burkina Faso             Burundi                 
##  [19] Cambodia                 Cameroon                 Canada                  
##  [22] Central African Republic Chad                     Chile                   
##  [25] China                    Colombia                 Comoros                 
##  [28] Congo, Dem. Rep.         Congo, Rep.              Costa Rica              
##  [31] Cote d'Ivoire            Croatia                  Cuba                    
##  [34] Czech Republic           Denmark                  Djibouti                
##  [37] Dominican Republic       Ecuador                  Egypt                   
##  [40] El Salvador              Equatorial Guinea        Eritrea                 
##  [43] Ethiopia                 Finland                  France                  
##  [46] Gabon                    Gambia                   Germany                 
##  [49] Ghana                    Greece                   Guatemala               
##  [52] Guinea                   Guinea-Bissau            Haiti                   
##  [55] Honduras                 Hong Kong, China         Hungary                 
##  [58] Iceland                  India                    Indonesia               
##  [61] Iran                     Iraq                     Ireland                 
##  [64] Israel                   Italy                    Jamaica                 
##  [67] Japan                    Jordan                   Kenya                   
##  [70] Korea, Dem. Rep.         Korea, Rep.              Kuwait                  
##  [73] Lebanon                  Lesotho                  Liberia                 
##  [76] Libya                    Madagascar               Malawi                  
##  [79] Malaysia                 Mali                     Mauritania              
##  [82] Mauritius                Mexico                   Mongolia                
##  [85] Montenegro               Morocco                  Mozambique              
##  [88] Myanmar                  Namibia                  Nepal                   
##  [91] Netherlands              New Zealand              Nicaragua               
##  [94] Niger                    Nigeria                  Norway                  
##  [97] Oman                     Pakistan                 Panama                  
## [100] Paraguay                 Peru                     Philippines             
## [103] Poland                   Portugal                 Puerto Rico             
## [106] Reunion                  Romania                  Rwanda                  
## [109] Sao Tome and Principe    Saudi Arabia             Senegal                 
## [112] Serbia                   Sierra Leone             Singapore               
## [115] Slovak Republic          Slovenia                 Somalia                 
## [118] South Africa             Spain                    Sri Lanka               
## [121] Sudan                    Swaziland                Sweden                  
## [124] Switzerland              Syria                    Taiwan                  
## [127] Tanzania                 Thailand                 Togo                    
## [130] Trinidad and Tobago      Tunisia                  Turkey                  
## [133] Uganda                   United Kingdom           United States           
## [136] Uruguay                  Venezuela                Vietnam                 
## [139] West Bank and Gaza       Yemen, Rep.              Zambia                  
## [142] Zimbabwe                
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe

correlation of the variables year and pop

cor(my_year, my_pop)
## [1] 0.08230808

Exploratory Data Analysis

GDP_2007 <- gapminder_unfiltered %>% filter(year==2007) %>% select(continent, country, gdpPercap)
ggplot(GDP_2007, aes(x=gdpPercap)) + geom_histogram(fill="cyan", bins=40) +
  ylab("GDP per Capita") +
  ggtitle("Distribution of GDP Per Capita for 2007 for all Countries") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())