Introduction

Importing the data

#(a) head() - to view the first 6 rows

#(b) summary() - to summarize the gapminder dataset

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

#(c) tail() - to see the last 6 rows

## # A tibble: 6 × 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.

(d) nrow() and ncol() - calculate number of rows or columns

## [1] 1704

## [1] 6

(e) str() - compute standard deviation

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

#create histogram of values for year

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use `dlyr` package and `tidyr` when necessary).

To select my variables for analysis (year and population)

To find the data for China in the continent

## # A tibble: 12 × 3
##    country  year        pop
##    <fct>   <int>      <int>
##  1 China    1952  556263527
##  2 China    1957  637408000
##  3 China    1962  665770000
##  4 China    1967  754550000
##  5 China    1972  862030000
##  6 China    1977  943455000
##  7 China    1982 1000281000
##  8 China    1987 1084035000
##  9 China    1992 1164970000
## 10 China    1997 1230075000
## 11 China    2002 1280400000
## 12 China    2007 1318683096

#create histogram of values for year

## # A tibble: 12 × 2
##     year        pop
##    <int>      <int>
##  1  1952  556263527
##  2  1957  637408000
##  3  1962  665770000
##  4  1967  754550000
##  5  1972  862030000
##  6  1977  943455000
##  7  1982 1000281000
##  8  1987 1084035000
##  9  1992 1164970000
## 10  1997 1230075000
## 11  2002 1280400000
## 12  2007 1318683096

To get the summary for my selected variables

##       year           pop           
##  Min.   :1952   Min.   :5.563e+08  
##  1st Qu.:1966   1st Qu.:7.324e+08  
##  Median :1980   Median :9.719e+08  
##  Mean   :1980   Mean   :9.582e+08  
##  3rd Qu.:1993   3rd Qu.:1.181e+09  
##  Max.   :2007   Max.   :1.319e+09

To get the standard deviation for my selected variables

## [1] 264394873

## [1] 18.02776

To get the histogram for my selected variables

## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow

To plot ggplot for my selected variables

## `geom_smooth()` using formula = 'y ~ x'

3. Explore two variables and how they are associated with each other (correlation analysis).

## # A tibble: 1,704 × 13
##    country     lifeExp gdpPercap year_…¹ year_…² year_…³ year_…⁴ year_…⁵ pop_A…⁶
##    <fct>         <dbl>     <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
##  1 Afghanistan    28.8      779.    1952      NA      NA      NA      NA  8.43e6
##  2 Afghanistan    30.3      821.    1957      NA      NA      NA      NA  9.24e6
##  3 Afghanistan    32.0      853.    1962      NA      NA      NA      NA  1.03e7
##  4 Afghanistan    34.0      836.    1967      NA      NA      NA      NA  1.15e7
##  5 Afghanistan    36.1      740.    1972      NA      NA      NA      NA  1.31e7
##  6 Afghanistan    38.4      786.    1977      NA      NA      NA      NA  1.49e7
##  7 Afghanistan    39.9      978.    1982      NA      NA      NA      NA  1.29e7
##  8 Afghanistan    40.8      852.    1987      NA      NA      NA      NA  1.39e7
##  9 Afghanistan    41.7      649.    1992      NA      NA      NA      NA  1.63e7
## 10 Afghanistan    41.8      635.    1997      NA      NA      NA      NA  2.22e7
## # … with 1,694 more rows, 4 more variables: pop_Europe <int>, pop_Africa <int>,
## #   pop_Americas <int>, pop_Oceania <int>, and abbreviated variable names
## #   ¹year_Asia, ²year_Europe, ³year_Africa, ⁴year_Americas, ⁵year_Oceania,
## #   ⁶pop_Asia

## # A tibble: 6 × 3
##   country  year        pop
##   <fct>   <int>      <int>
## 1 China    1982 1000281000
## 2 China    1987 1084035000
## 3 China    1992 1164970000
## 4 China    1997 1230075000
## 5 China    2002 1280400000
## 6 China    2007 1318683096

#population that is less than 1000000000 in china

## # A tibble: 6 × 3
##   country  year       pop
##   <fct>   <int>     <int>
## 1 China    1952 556263527
## 2 China    1957 637408000
## 3 China    1962 665770000
## 4 China    1967 754550000
## 5 China    1972 862030000
## 6 China    1977 943455000

##   country year        pop country.1 year.1     pop.1
## 1   China 1982 1000281000     China   1952 556263527
## 2   China 1987 1084035000     China   1957 637408000
## 3   China 1992 1164970000     China   1962 665770000
## 4   China 1997 1230075000     China   1967 754550000
## 5   China 2002 1280400000     China   1972 862030000
## 6   China 2007 1318683096     China   1977 943455000

##   year        pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096

##   year        pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096

Correlation Analysis

## # A tibble: 6 × 3
##   country  year        pop
##   <fct>   <int>      <int>
## 1 China    1982 1000281000
## 2 China    1987 1084035000
## 3 China    1992 1164970000
## 4 China    1997 1230075000
## 5 China    2002 1280400000
## 6 China    2007 1318683096

## [1] -0.73762102  1.27326626 -0.08880451 -0.02558668  0.67754119 -0.95150812

#Next, we have to create a second variable:

## [1] -1.00431801  2.03719992 -0.08361005 -0.32942389 -0.82327318 -2.65937973

#Let’s create a data frame with these two variables and then use these data to calculate Pearson’s correlation:

##             x           y
## 1 -0.73762102 -1.00431801
## 2  1.27326626  2.03719992
## 3 -0.08880451 -0.08361005
## 4 -0.02558668 -0.32942389
## 5  0.67754119 -0.82327318
## 6 -0.95150812 -2.65937973

## [1] 0.69

###Conclusion #The output is 0.73, this is our Pearson correlation coefficient. Since the number is positive, it’s a positive correlation, i.e. our two variables move in the same direction and when one variable increases, the other one increases as well. The number 0.73 indicates that it is a strong correlation.

we can now further visualise our variables. Scatterplots are a great way to check quickly for correlation between pairs of continuous data.

Basic scatter plot

## `geom_smooth()` using formula = 'y ~ x'

We can use the cor function to calculate a correlation matrix for an entire data frame with several variables: # print head of example data

## # A tibble: 6 × 3
##   country      year      pop
##   <fct>       <int>    <int>
## 1 Afghanistan  1952  8425333
## 2 Afghanistan  1957  9240934
## 3 Afghanistan  1962 10267083
## 4 Afghanistan  1967 11537966
## 5 Afghanistan  1972 13079460
## 6 Afghanistan  1977 14880372

## [1] 0.08

Next, we can use the cor function to create a correlation matrix for for all our variables: # Pearson correlation + round the output

Basic scatter plot

## `geom_smooth()` using formula = 'y ~ x'

Let’s calculate Pearson correlation coefficient: # rounding the number to two decimals

## [1] 1

Homework assignment #7

OLORUNMAIYE

2022-29-11

Introduction

Importing the data

(d) nrow() and ncol() - calculate number of rows or columns

(e) str() - compute standard deviation

2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use `dlyr` package and `tidyr` when necessary).

To select my variables for analysis (year and population)

To find the data for China in the continent

To get the summary for my selected variables

To get the standard deviation for my selected variables

To get the histogram for my selected variables

To plot ggplot for my selected variables

3. Explore two variables and how they are associated with each other (correlation analysis).

Correlation Analysis

Basic scatter plot

Basic scatter plot

Homework assignment #7

OLORUNMAIYE

2022-29-11

Introduction

Importing the data

(d) nrow() and ncol() - calculate number of rows or columns

(e) str() - compute standard deviation

2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use dlyr package and tidyr when necessary).

To select my variables for analysis (year and population)

To find the data for China in the continent

To get the summary for my selected variables

To get the standard deviation for my selected variables

To get the histogram for my selected variables

To plot ggplot for my selected variables

3. Explore two variables and how they are associated with each other (correlation analysis).

Correlation Analysis

Basic scatter plot

Basic scatter plot

2. Prepare the data set that includes only variables of your interest in a suitable format for analysis (use `dlyr` package and `tidyr` when necessary).