Introduction

Aiming to explore the US COVID-19 deaths on National level, we aim to to find relationships between different health Attributes, Variables and Factors of Infected Individuals and explore the two data sets that we have selected for analysis, one is focused on when COVID 19 occurred and the other data set is focused on data gathered till Jan, 2023 having medical conditions in it. With respect to individual medical condition, we want to explore and find trends and relations between factors such COVID-19 death rate, people having pnemonea, people with influenza etc

The data about Covid comes from “The New York Times,” based on reports from state and local health agencies, and was brought from Kaggle and the other source of data is from the “CDA” Central Disease Control in United States for a deeper analysis and perspective. Finding relevant data was quiet the challenge but we were able to find suitable data.

Firstly we will analyze and clean the data which has the infection and death rate, we want to find the most affected and infected states in US over time, and later visualize that data as it is part of our project goals, we visualize the data and do analysis on national level after which we also investigate the relation between different factors of infection individuals in US.

head(covid_data)

## # A tibble: 6 × 6
##   date       county    state      fips  cases deaths
##   <date>     <chr>     <chr>      <chr> <dbl>  <dbl>
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24 Cook      Illinois   17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25 Orange    California 06059     1      0

Importing, Cleaning and Processing Data

We cleaned and process the data set, before visual and analytical insights.

Here is the summary of the dataset.

##       date               county             state               fips          
##  Min.   :2020-01-21   Length:2502832     Length:2502832     Length:2502832    
##  1st Qu.:2020-10-13   Class :character   Class :character   Class :character  
##  Median :2021-04-23   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2021-04-22                                                           
##  3rd Qu.:2021-11-02                                                           
##  Max.   :2022-05-13                                                           
##                                                                               
##      cases             deaths          survivors          fatality    
##  Min.   :      0   Min.   :    0.0   Min.   : -11797   Min.   :0.00   
##  1st Qu.:    382   1st Qu.:    6.0   1st Qu.:    373   1st Qu.:0.01   
##  Median :   1773   Median :   33.0   Median :   1766   Median :0.02   
##  Mean   :  10034   Mean   :  161.6   Mean   :  10052   Mean   : Inf   
##  3rd Qu.:   5884   3rd Qu.:  101.0   3rd Qu.:   5906   3rd Qu.:0.02   
##  Max.   :2908425   Max.   :40267.0   Max.   :2876403   Max.   : Inf   
##                    NA's   :57605     NA's   :57605     NA's   :57605  
##     l_cases           l_deaths        daily_cases         daily_deaths      
##  Min.   :      0   Min.   :    0.0   Min.   :-370251.0   Min.   :-11317.00  
##  1st Qu.:    377   1st Qu.:    6.0   1st Qu.:      0.0   1st Qu.:     0.00  
##  Median :   1764   Median :   33.0   Median :      2.0   Median :     0.00  
##  Mean   :  10001   Mean   :  161.2   Mean   :     32.9   Mean   :     0.41  
##  3rd Qu.:   5860   3rd Qu.:  101.0   3rd Qu.:     13.0   3rd Qu.:     0.00  
##  Max.   :2904483   Max.   :40261.0   Max.   : 193786.0   Max.   :  1554.00  
##                    NA's   :57527                         NA's   :57605

As we can see in the summary above initial details. We added new columns then we grouped the data by county and state. we also added ‘l_cases,’ ‘l_deaths,’ ‘daily_cases,’ and ‘daily_deaths’ varaibles. There are some negative number and NAs, that we need to do some processing on the data.

Cleaning and dealing with NAs in the Data

We check to see which varaibles has NAs, so we can deal with them for better processing and visualization.

## # A tibble: 6 × 12
## # Groups:   county, state [6]
##   date       county     state fips  cases deaths survi…¹ fatal…² l_cases l_dea…³
##   <date>     <chr>      <chr> <chr> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 2020-05-05 Adjuntas   Puer… 72001     3     NA      NA      NA       0       0
## 2 2020-05-05 Aguada     Puer… 72003     7     NA      NA      NA       0       0
## 3 2020-05-05 Aguadilla  Puer… 72005    11     NA      NA      NA       0       0
## 4 2020-05-05 Aguas Bue… Puer… 72007    22     NA      NA      NA       0       0
## 5 2020-05-05 Aibonito   Puer… 72009    13     NA      NA      NA       0       0
## 6 2020-05-05 Anasco     Puer… 72011     5     NA      NA      NA       0       0
## # … with 2 more variables: daily_cases <dbl>, daily_deaths <dbl>, and
## #   abbreviated variable names ¹survivors, ²fatality, ³l_deaths

## [1] "All of the NAs for the deaths variable are in  Puerto Rico"

##         date       county        state         fips        cases       deaths 
##            0            0            0        22886            0            0 
##    survivors     fatality      l_cases     l_deaths  daily_cases daily_deaths 
##            0            0            0            0            0            0

Dealing with the Impossible Jump Between Days

We also deal with jump days as it can affect results.

## # A tibble: 6 × 4
##   date          cases deaths count_counties
##   <date>        <dbl>  <dbl>          <int>
## 1 2021-05-31 33010904 590947           3143
## 2 2021-06-01 33033274 591457           3143
## 3 2021-06-02 33050894 592049           3143
## 4 2021-06-03 33070576 592668           3143
## 5 2021-06-04 33049256 592793           3143
## 6 2021-06-05 33059573 593161           3143

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

## # A tibble: 1 × 3
##   state     first  second
##   <chr>     <dbl>   <dbl>
## 1 Florida 2326842 2289276

## [1] "The number of rows where a county had less than zero cases is 39501"

## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

## [1] "The number of counties that have this issue is 3021"

## # A tibble: 3 × 3
## # Groups:   county [3]
##   county   state    count
##   <chr>    <chr>    <int>
## 1 Anderson Texas      107
## 2 Jones    Texas      127
## 3 Richmond Virginia   116

## # A tibble: 6 × 11
##   state     cases deaths survi…¹ fatal…² l_cases daily…³ daily…⁴ l_dea…⁵ daily…⁶
##   <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Alaska   2.42e5   1165  241193 0.00481 1289510 -1.05e6       0   18890  -17725
## 2 America… 5.55e2      0     555 0        242358 -2.42e5       0    1165   -1165
## 3 Arkansas 8.27e5  10950  815963 0.0132  1987318 -1.16e6       0   28097  -17147
## 4 Colorado 1.33e6  12103 1320099 0.00908 9044007 -7.71e6       0   87665  -75562
## 5 Connect… 7.31e5  10655  719851 0.0146  1332202 -6.02e5       0   12103   -1448
## 6 Delaware 2.58e5   2807  255419 0.0109   730506 -4.72e5       0   10655   -7848
## # … with 1 more variable: updated_fatality <dbl>, and abbreviated variable
## #   names ¹survivors, ²fatality, ³daily_cases, ⁴daily_cases_lag, ⁵l_deaths,
## #   ⁶daily_deaths

Create New tables

Now that we have dealt with the NAs we create a new table/df.

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

Investigating Spike in Numbers

## # A tibble: 6 × 11
##   date          cases deaths survivors fatality  l_cases daily…¹ daily…² l_dea…³
##   <date>        <dbl>  <dbl>     <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
## 1 2021-09-11 40778439 656561  40121878   0.0161 40704945   73494   84435  655881
## 2 2021-09-12 40817535 656856  40160679   0.0161 40778439   39096   41466  656561
## 3 2021-09-13 41102295 658576  40443719   0.0160 40817535  284760  279766  656856
## 4 2021-09-14 41254944 661207  40593737   0.0160 41102295  152649  160015  658576
## 5 2021-09-15 41424814 663772  40761042   0.0160 41254944  169870  210465  661207
## 6 2021-09-16 41579175 667179  40911996   0.0160 41424814  154361  175123  663772
## # … with 2 more variables: daily_deaths <dbl>, updated_fatality <dbl>, and
## #   abbreviated variable names ¹daily_cases, ²daily_cases_lag, ³l_deaths

Performing an Exploratory Dataset Analysis

Now that we have cleaned the data and have dealt with the NAs, we will do a deep EDA

National level covid diagnoses as can be seen below, The total covid deaths and diagnoses rise with time, but with the passage of time the fatalities decrease.

##             date            cases           deaths        survivors 
##     2.437861e+02     2.550421e+07     3.144461e+05     2.519801e+07 
##         fatality          l_cases      daily_cases  daily_cases_lag 
##     1.388590e-02     2.546109e+07     1.436738e+05     1.440814e+05 
##         l_deaths     daily_deaths updated_fatality 
##     3.143287e+05     1.014225e+03              NaN

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As it can be seen above in the Covid Diagnoses graph we draw a parallel black line to see the trend, and as it can be seen that with the passage of time the trend drops.

## Max lag exceeds data available. Using max lag: 843

## # A tibble: 31 × 2
##      lag CCF_daily_deaths
##    <dbl>            <dbl>
##  1     0            0.477
##  2     1            0.331
##  3     2            0.255
##  4     3            0.279
##  5     4            0.252
##  6     5            0.288
##  7     6            0.357
##  8     7            0.382
##  9     8            0.265
## 10     9            0.186
## # … with 21 more rows

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Same can be said in the case of Covid deaths, which means either with the passage of time the virus became weak, or people had developed an immunity. It can also be said that after winter season with less cases of pnemonea and influenza that cases became less fatel.

Analyzing and Visualizing By State

We do visualization and see the deaths and cases on state level in US.

Aiming to examine outliers and top states by deaths we visualize the data, through a dot graph and bar plot which can be seen above and below.

Here we visualize the fatality rate by state, here we see some difference when compared with deaths, it can be said some states have better medical facilities or it can be said that some states have healthier people.

We visualize the progress of cases and deaths overtime, for each states to obersve if there are some commonalities between them.

We calculate the SD to measure the degree of dispersion of data.

## # A tibble: 6 × 5
##   state          cases_sd deaths_sd daily_cases_sd daily_deaths_sd
##   <chr>             <dbl>     <dbl>          <dbl>           <dbl>
## 1 Alabama         410282.   6506.          410282.           6506.
## 2 Alaska           79242.    398.          332783.           6137.
## 3 American Samoa    2068.      6.78         49986.            223.
## 4 Arizona         639782.   9693.          639045.           9669.
## 5 Arkansas        265237.   3692.          365461.           5746.
## 6 California     2882682.  31485.         2614811.          27783.

Covid-19 Cases Visualization via Map

We visualize the data on a US National Map

Finding Relationships between Medical conditions and variables

Adjusting and cleaning dateset with varaibles of medical conditions, We aim find relation in the CDA gathered US data based on the previous observations.

## # A tibble: 6 × 16
##   Data A…¹ Start…² End D…³ Group  Year Month State Sex   Age G…⁴ COVID…⁵ Total…⁶
##   <chr>    <chr>   <chr>   <chr> <dbl> <dbl> <chr> <chr> <chr>     <dbl>   <dbl>
## 1 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … All Ag… 1091212  1.01e7
## 2 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … Under …     410  5.86e4
## 3 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … 0-17 y…    1433  1.04e5
## 4 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … 1-4 ye…     224  1.13e4
## 5 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … 5-14 y…     440  1.75e4
## 6 01/11/2… 01/01/… 01/07/… By T…    NA    NA Unit… All … 15-24 …    2900  1.08e5
## # … with 5 more variables: `Pneumonia Deaths` <dbl>,
## #   `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
## #   `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>, and
## #   abbreviated variable names ¹`Data As Of`, ²`Start Date`, ³`End Date`,
## #   ⁴`Age Group`, ⁵`COVID-19 Deaths`, ⁶`Total Deaths`

## # A tibble: 6 × 13
##   `Data As Of` Start…¹ End D…² Group State Sex   Age G…³ COVID…⁴ Total…⁵ Pneum…⁶
##   <chr>        <chr>   <chr>   <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl>
## 1 01/11/2023   01/01/… 01/07/… By T… Unit… All … All Ag… 1091212  1.01e7 1026050
## 2 01/11/2023   01/01/… 01/07/… By T… Unit… All … Under …     410  5.86e4     797
## 3 01/11/2023   01/01/… 01/07/… By T… Unit… All … 0-17 y…    1433  1.04e5    2228
## 4 01/11/2023   01/01/… 01/07/… By T… Unit… All … 1-4 ye…     224  1.13e4     485
## 5 01/11/2023   01/01/… 01/07/… By T… Unit… All … 5-14 y…     440  1.75e4     631
## 6 01/11/2023   01/01/… 01/07/… By T… Unit… All … 15-24 …    2900  1.08e5    2739
## # … with 3 more variables: `Pneumonia and COVID-19 Deaths` <dbl>,
## #   `Influenza Deaths` <dbl>, `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>,
## #   and abbreviated variable names ¹`Start Date`, ²`End Date`, ³`Age Group`,
## #   ⁴`COVID-19 Deaths`, ⁵`Total Deaths`, ⁶`Pneumonia Deaths`

Cleaning and dealing with NAs in the Data

##                        Data As Of                        Start Date 
##                                 0                                 0 
##                          End Date                             Group 
##                                 0                                 0 
##                             State                               Sex 
##                                 0                                 0 
##                          AgeGroup                     COVID19Deaths 
##                                 0                                87 
##                       TotalDeaths                   PneumoniaDeaths 
##                                 4                                24 
##         PneumoniaAndCOVID19Deaths                   InfluenzaDeaths 
##                               209                              1020 
## PneumoniaInfluenzaOrCOVID19Deaths 
##                                18

## # A tibble: 6 × 13
##   `Data As Of` Start…¹ End D…² Group State Sex   AgeGr…³ COVID…⁴ Total…⁵ Pneum…⁶
##   <chr>        <chr>   <chr>   <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl>
## 1 01/11/2023   01/01/… 01/07/… By Y… Dela… All … All Ag…      NA      22      NA
## 2 01/11/2023   01/01/… 01/07/… By Y… Dist… All … All Ag…       0      NA      NA
## 3 01/11/2023   01/01/… 01/07/… By Y… Hawa… All … All Ag…      NA      65      NA
## 4 01/11/2023   01/01/… 01/07/… By Y… Idaho All … All Ag…      NA      74      NA
## 5 01/11/2023   01/01/… 01/07/… By Y… Indi… All … All Ag…      NA     125      NA
## 6 01/11/2023   01/01/… 01/07/… By Y… Loui… All … All Ag…       0      20      NA
## # … with 3 more variables: PneumoniaAndCOVID19Deaths <dbl>,
## #   InfluenzaDeaths <dbl>, PneumoniaInfluenzaOrCOVID19Deaths <dbl>, and
## #   abbreviated variable names ¹`Start Date`, ²`End Date`, ³AgeGroup,
## #   ⁴COVID19Deaths, ⁵TotalDeaths, ⁶PneumoniaDeaths

## [1] "All of the NAs for the deaths variable are in  c(\"Delaware\", \"District of Columbia\", \"Hawaii\", \"Idaho\", \"Indiana\", \"Louisiana\", \"Nebraska\", \"North Dakota\", \"Rhode Island\", \"Utah\", \"Vermont\", \"Wyoming\")"

##                        Data As Of                        Start Date 
##                                 0                                 0 
##                          End Date                             Group 
##                                 0                                 0 
##                             State                               Sex 
##                                 0                                 0 
##                          AgeGroup                     COVID19Deaths 
##                                 0                                48 
##                       TotalDeaths                   PneumoniaDeaths 
##                                 2                                 0 
##         PneumoniaAndCOVID19Deaths                   InfluenzaDeaths 
##                               111                               826 
## PneumoniaInfluenzaOrCOVID19Deaths 
##                                 2

## [1] "All of the NAs for the deaths variable are in  Alaska"

##                        Data As Of                        Start Date 
##                                 0                                 0 
##                          End Date                             Group 
##                                 0                                 0 
##                             State                               Sex 
##                                 0                                 0 
##                          AgeGroup                     COVID19Deaths 
##                                 0                                 0 
##                       TotalDeaths                   PneumoniaDeaths 
##                                 0                                 0 
##         PneumoniaAndCOVID19Deaths                   InfluenzaDeaths 
##                                93                               812 
## PneumoniaInfluenzaOrCOVID19Deaths 
##                                 0

RQ1:Are covid-19 deaths related with PneumoniaDeaths relation between deaths and medical condition

Corealtion Analysis

Here we want to explore the relatationship of Covid19 deaths and Pnemonia deaths. First, we visualize the two variables to their relationship.

## `geom_smooth()` using formula = 'y ~ x'

From the visualization we can see that the variable have strong positive relation.

From the histograms, The data in both variables do not follow a normal distribution. Since it’s a large sample we can assume normality.

Hypothesis test

State the null hypothesis and the alternative hypothesis

Null hypothesis: the correlation coefficient is not significantly different from 0. There is no significant linear relationship between COVID19Deaths and PneumoniaDeaths in the population.

Alternative hypothesis: the population correlation coefficient is significantly different from 0. There is a significant linear relationship between COVID19Deaths and PneumoniaDeaths in the population.

Check the assumption of the chosen statistical test. Perform the required statistical test.

Here we will calculate correlation coefficient of the two variables in order to explore the relationship between the two variables.

## [1] 0.99

## 
##  Pearson's product-moment correlation
## 
## data:  covid_dataNew1$COVID19Deaths and covid_dataNew1$PneumoniaDeaths
## t = 264.16, df = 1678, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9870100 0.9892611
## sample estimates:
##       cor 
## 0.9881887

Decide whether to reject or fail to reject your null hypothesis, report selected significance level.

The p-value of the correlation test between these 2 variables is 2.2e-16. P value is below cutoff of 0.05, so we reject the null hypothesis of no correlation.

Interpret and report the results.

We can reject the null hypothesis that there is no relationship between COVID19Deaths and PneumoniaDeaths, Therefore, we conclude that people have existing medical factor such as pnemonea will have higher chance of fatality. Pearson correlation test revealed that the amount of people having medical condition and dcrease covid 19 life expectancy in in US and its positively correlated, This correlation is strong and positive, meaning if number of people with medical conditions increase covid deaths will also increase.

RQ2:Are covid-19 deaths has linear relationship wiith PneumoniaInfluenzaOrCOVID19Deaths

linear regression

Now we perform linear regression to find the relations of Covid-19 deaths and Pnemoneainfluzenza based deaths.

Creating a Scatterplot to check for Linear Relationship

First, we create a scatterplot to check for linear relationship of the two varialbes.

## `geom_smooth()` using formula = 'y ~ x'

Calculating correlation

Here we calculate the correlation of the two varialbes. they have stong positive relationship.

## [1] 0.9958919

Creating a boxplot to check for Outliers

Here we create two boxplot to check for outliers of the two variables. we can see both the two variables have a few outliers.

Fit simple Linear Regression Model

We apply the linear regression Model to the two varialbes.

## 
## Call:
## lm(formula = PneumoniaInfluenzaOrCOVID19Deaths ~ COVID19Deaths, 
##     data = covid_dataNew1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8120.5  -102.2     2.8   125.8  7953.6 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   75.053904  20.074986   3.739 0.000191 ***
## COVID19Deaths  1.418100   0.003148 450.528  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 788.6 on 1678 degrees of freedom
## Multiple R-squared:  0.9918, Adjusted R-squared:  0.9918 
## F-statistic: 2.03e+05 on 1 and 1678 DF,  p-value: < 2.2e-16

From the results we observe that Covid 19 deaths have linear relation with, medical variables such as Inflenze and pnemonea deaths. There is clear pattern, which indicates that the assumption relation is met and true. In other words, the coefficients of the regression model should be trustworthy and we don’t need to perform a transformation on the data.

Conclusion

We explored the data on national level from US and visualized and presented key details, after preforming Processing and Cleaing, we were able to find relations and trend setters in the dataset, We performed visualization and analysis on the data, we presented and found out the top states with respect to fatalites and infected cases over time, after that we processed the data and aimed to find relation between Covid deaths and people with existing medical condition, such that can we found through hypothesis testing and linear regression that there is strong relaton between number of covid deaths on national and state level and of deaths of people with existing medical conditions. We conclude that there is strong corelation between covid deaths and exsiting medical conditon such as pnemonea and influenze influce the number of deaths over time.

Final project

Waleed Khalid, Na Zhu

2022-12-20

Introduction

Importing, Cleaning and Processing Data

Cleaning and dealing with NAs in the Data

Dealing with the Impossible Jump Between Days

Create New tables

Investigating Spike in Numbers

Performing an Exploratory Dataset Analysis

Analyzing and Visualizing By State

Covid-19 Cases Visualization via Map

Finding Relationships between Medical conditions and variables

Cleaning and dealing with NAs in the Data

RQ2:Are covid-19 deaths has linear relationship wiith PneumoniaInfluenzaOrCOVID19Deaths

linear regression

Creating a Scatterplot to check for Linear Relationship

Calculating correlation

Creating a boxplot to check for Outliers

Fit simple Linear Regression Model

Conclusion