#(a) head() - to view the first 6 rows
#(b) summary() - to summarize the gapminder dataset
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
#(c) tail() - to see the last 6 rows
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Zimbabwe Africa 1982 60.4 7636524 789.
## 2 Zimbabwe Africa 1987 62.4 9216418 706.
## 3 Zimbabwe Africa 1992 60.4 10704340 693.
## 4 Zimbabwe Africa 1997 46.8 11404948 792.
## 5 Zimbabwe Africa 2002 40.0 11926563 672.
## 6 Zimbabwe Africa 2007 43.5 12311143 470.
## [1] 1704
## [1] 6
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
#create histogram of values for year
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dlyr
package and tidyr when necessary).## # A tibble: 12 × 3
## country year pop
## <fct> <int> <int>
## 1 China 1952 556263527
## 2 China 1957 637408000
## 3 China 1962 665770000
## 4 China 1967 754550000
## 5 China 1972 862030000
## 6 China 1977 943455000
## 7 China 1982 1000281000
## 8 China 1987 1084035000
## 9 China 1992 1164970000
## 10 China 1997 1230075000
## 11 China 2002 1280400000
## 12 China 2007 1318683096
#create histogram of values for year
## # A tibble: 12 × 2
## year pop
## <int> <int>
## 1 1952 556263527
## 2 1957 637408000
## 3 1962 665770000
## 4 1967 754550000
## 5 1972 862030000
## 6 1977 943455000
## 7 1982 1000281000
## 8 1987 1084035000
## 9 1992 1164970000
## 10 1997 1230075000
## 11 2002 1280400000
## 12 2007 1318683096
## year pop
## Min. :1952 Min. :5.563e+08
## 1st Qu.:1966 1st Qu.:7.324e+08
## Median :1980 Median :9.719e+08
## Mean :1980 Mean :9.582e+08
## 3rd Qu.:1993 3rd Qu.:1.181e+09
## Max. :2007 Max. :1.319e+09
## [1] 264394873
## [1] 18.02776
## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow
## `geom_smooth()` using formula = 'y ~ x'
## # A tibble: 1,704 × 13
## country lifeExp gdpPercap year_…¹ year_…² year_…³ year_…⁴ year_…⁵ pop_A…⁶
## <fct> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
## 1 Afghanistan 28.8 779. 1952 NA NA NA NA 8.43e6
## 2 Afghanistan 30.3 821. 1957 NA NA NA NA 9.24e6
## 3 Afghanistan 32.0 853. 1962 NA NA NA NA 1.03e7
## 4 Afghanistan 34.0 836. 1967 NA NA NA NA 1.15e7
## 5 Afghanistan 36.1 740. 1972 NA NA NA NA 1.31e7
## 6 Afghanistan 38.4 786. 1977 NA NA NA NA 1.49e7
## 7 Afghanistan 39.9 978. 1982 NA NA NA NA 1.29e7
## 8 Afghanistan 40.8 852. 1987 NA NA NA NA 1.39e7
## 9 Afghanistan 41.7 649. 1992 NA NA NA NA 1.63e7
## 10 Afghanistan 41.8 635. 1997 NA NA NA NA 2.22e7
## # … with 1,694 more rows, 4 more variables: pop_Europe <int>, pop_Africa <int>,
## # pop_Americas <int>, pop_Oceania <int>, and abbreviated variable names
## # ¹year_Asia, ²year_Europe, ³year_Africa, ⁴year_Americas, ⁵year_Oceania,
## # ⁶pop_Asia
## # A tibble: 6 × 3
## country year pop
## <fct> <int> <int>
## 1 China 1982 1000281000
## 2 China 1987 1084035000
## 3 China 1992 1164970000
## 4 China 1997 1230075000
## 5 China 2002 1280400000
## 6 China 2007 1318683096
#population that is less than 1000000000 in china
## # A tibble: 6 × 3
## country year pop
## <fct> <int> <int>
## 1 China 1952 556263527
## 2 China 1957 637408000
## 3 China 1962 665770000
## 4 China 1967 754550000
## 5 China 1972 862030000
## 6 China 1977 943455000
## country year pop country.1 year.1 pop.1
## 1 China 1982 1000281000 China 1952 556263527
## 2 China 1987 1084035000 China 1957 637408000
## 3 China 1992 1164970000 China 1962 665770000
## 4 China 1997 1230075000 China 1967 754550000
## 5 China 2002 1280400000 China 1972 862030000
## 6 China 2007 1318683096 China 1977 943455000
## year pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096
## year pop
## 1 1982 1000281000
## 2 1987 1084035000
## 3 1992 1164970000
## 4 1997 1230075000
## 5 2002 1280400000
## 6 2007 1318683096
## # A tibble: 6 × 3
## country year pop
## <fct> <int> <int>
## 1 China 1982 1000281000
## 2 China 1987 1084035000
## 3 China 1992 1164970000
## 4 China 1997 1230075000
## 5 China 2002 1280400000
## 6 China 2007 1318683096
## [1] -0.73762102 1.27326626 -0.08880451 -0.02558668 0.67754119 -0.95150812
#Next, we have to create a second variable:
## [1] -1.00431801 2.03719992 -0.08361005 -0.32942389 -0.82327318 -2.65937973
#Let’s create a data frame with these two variables and then use these data to calculate Pearson’s correlation:
## x y
## 1 -0.73762102 -1.00431801
## 2 1.27326626 2.03719992
## 3 -0.08880451 -0.08361005
## 4 -0.02558668 -0.32942389
## 5 0.67754119 -0.82327318
## 6 -0.95150812 -2.65937973
## [1] 0.69
###Conclusion #The output is 0.73, this is our Pearson correlation coefficient. Since the number is positive, it’s a positive correlation, i.e. our two variables move in the same direction and when one variable increases, the other one increases as well. The number 0.73 indicates that it is a strong correlation.
we can now further visualise our variables. Scatterplots are a great way to check quickly for correlation between pairs of continuous data.
## `geom_smooth()` using formula = 'y ~ x'
We can use the cor function to calculate a correlation
matrix for an entire data frame with several variables: # print head of
example data
## # A tibble: 6 × 3
## country year pop
## <fct> <int> <int>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## [1] 0.08
Next, we can use the cor function to create a
correlation matrix for for all our variables: # Pearson correlation +
round the output
## `geom_smooth()` using formula = 'y ~ x'
Let’s calculate Pearson correlation coefficient: # rounding the number to two decimals
## [1] 1