Introduction

Importing the data

#(a) head() - to view the first 6 rows

#(b) summary() - to summarize the gapminder dataset

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

#(c) tail() - to see the last 6 rows

## # A tibble: 6 × 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.

(d) nrow() and ncol() - calculate number of rows or columns

## [1] 1704
## [1] 6

(e) str() - compute standard deviation

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

#create histogram of values for year

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

To select my variables for analysis (year and population)

To find the data for China in the continent

## # A tibble: 12 × 3
##    country  year        pop
##    <fct>   <int>      <int>
##  1 China    1952  556263527
##  2 China    1957  637408000
##  3 China    1962  665770000
##  4 China    1967  754550000
##  5 China    1972  862030000
##  6 China    1977  943455000
##  7 China    1982 1000281000
##  8 China    1987 1084035000
##  9 China    1992 1164970000
## 10 China    1997 1230075000
## 11 China    2002 1280400000
## 12 China    2007 1318683096

#create histogram of values for year

## # A tibble: 12 × 2
##     year        pop
##    <int>      <int>
##  1  1952  556263527
##  2  1957  637408000
##  3  1962  665770000
##  4  1967  754550000
##  5  1972  862030000
##  6  1977  943455000
##  7  1982 1000281000
##  8  1987 1084035000
##  9  1992 1164970000
## 10  1997 1230075000
## 11  2002 1280400000
## 12  2007 1318683096

To get the summary for my selected variables

##       year           pop           
##  Min.   :1952   Min.   :5.563e+08  
##  1st Qu.:1966   1st Qu.:7.324e+08  
##  Median :1980   Median :9.719e+08  
##  Mean   :1980   Mean   :9.582e+08  
##  3rd Qu.:1993   3rd Qu.:1.181e+09  
##  Max.   :2007   Max.   :1.319e+09

To get the standard deviation for my selected variables

## [1] 264394873
## [1] 18.02776

To get the histogram for my selected variables

## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow

To plot ggplot for my selected variables

## `geom_smooth()` using formula = 'y ~ x'

3. Explore two variables and how they are associated with each other (correlation analysis).

## # A tibble: 1,704 × 13
##    country     lifeExp gdpPercap year_…¹ year_…² year_…³ year_…⁴ year_…⁵ pop_A…⁶
##    <fct>         <dbl>     <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
##  1 Afghanistan    28.8      779.    1952      NA      NA      NA      NA  8.43e6
##  2 Afghanistan    30.3      821.    1957      NA      NA      NA      NA  9.24e6
##  3 Afghanistan    32.0      853.    1962      NA      NA      NA      NA  1.03e7
##  4 Afghanistan    34.0      836.    1967      NA      NA      NA      NA  1.15e7
##  5 Afghanistan    36.1      740.    1972      NA      NA      NA      NA  1.31e7
##  6 Afghanistan    38.4      786.    1977      NA      NA      NA      NA  1.49e7
##  7 Afghanistan    39.9      978.    1982      NA      NA      NA      NA  1.29e7
##  8 Afghanistan    40.8      852.    1987      NA      NA      NA      NA  1.39e7
##  9 Afghanistan    41.7      649.    1992      NA      NA      NA      NA  1.63e7
## 10 Afghanistan    41.8      635.    1997      NA      NA      NA      NA  2.22e7
## # … with 1,694 more rows, 4 more variables: pop_Europe <int>, pop_Africa <int>,
## #   pop_Americas <int>, pop_Oceania <int>, and abbreviated variable names
## #   ¹​year_Asia, ²​year_Europe, ³​year_Africa, ⁴​year_Americas, ⁵​year_Oceania,
## #   ⁶​pop_Asia
## # A tibble: 6 × 3
##   country  year        pop
##   <fct>   <int>      <int>
## 1 China    1982 1000281000
## 2 China    1987 1084035000
## 3 China    1992 1164970000
## 4 China    1997 1230075000
## 5 China    2002 1280400000
## 6 China    2007 1318683096

#population that is less than 1000000000 in china

## # A tibble: 6 × 3
##   country  year       pop
##   <fct>   <int>     <int>
## 1 China    1952 556263527
## 2 China    1957 637408000
## 3 China    1962 665770000
## 4 China    1967 754550000
## 5 China    1972 862030000
## 6 China    1977 943455000
##   country year        pop country.1 year.1     pop.1
## 1   China 1982 1000281000     China   1952 556263527
## 2   China 1987 1084035000     China   1957 637408000
## 3   China 1992 1164970000     China   1962 665770000
## 4   China 1997 1230075000     China   1967 754550000
## 5   China 2002 1280400000     China   1972 862030000
## 6   China 2007 1318683096     China   1977 943455000
##   year        pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096
##   year        pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096

Correlation Analysis

## # A tibble: 6 × 3
##   country  year        pop
##   <fct>   <int>      <int>
## 1 China    1982 1000281000
## 2 China    1987 1084035000
## 3 China    1992 1164970000
## 4 China    1997 1230075000
## 5 China    2002 1280400000
## 6 China    2007 1318683096
## [1] -0.73762102  1.27326626 -0.08880451 -0.02558668  0.67754119 -0.95150812

#Next, we have to create a second variable:

## [1] -1.00431801  2.03719992 -0.08361005 -0.32942389 -0.82327318 -2.65937973

#Let’s create a data frame with these two variables and then use these data to calculate Pearson’s correlation:

##             x           y
## 1 -0.73762102 -1.00431801
## 2  1.27326626  2.03719992
## 3 -0.08880451 -0.08361005
## 4 -0.02558668 -0.32942389
## 5  0.67754119 -0.82327318
## 6 -0.95150812 -2.65937973
## [1] 0.69

###Conclusion #The output is 0.73, this is our Pearson correlation coefficient. Since the number is positive, it’s a positive correlation, i.e. our two variables move in the same direction and when one variable increases, the other one increases as well. The number 0.73 indicates that it is a strong correlation.

we can now further visualise our variables. Scatterplots are a great way to check quickly for correlation between pairs of continuous data.

Basic scatter plot

## `geom_smooth()` using formula = 'y ~ x'

We can use the cor function to calculate a correlation matrix for an entire data frame with several variables: # print head of example data

## # A tibble: 6 × 3
##   country      year      pop
##   <fct>       <int>    <int>
## 1 Afghanistan  1952  8425333
## 2 Afghanistan  1957  9240934
## 3 Afghanistan  1962 10267083
## 4 Afghanistan  1967 11537966
## 5 Afghanistan  1972 13079460
## 6 Afghanistan  1977 14880372
## [1] 0.08

Next, we can use the cor function to create a correlation matrix for for all our variables: # Pearson correlation + round the output

Basic scatter plot

## `geom_smooth()` using formula = 'y ~ x'

Let’s calculate Pearson correlation coefficient: # rounding the number to two decimals

## [1] 1