Aiming to explore the US COVID-19 deaths on National level, we aim to to find relationships between different health Attributes, Variables and Factors of Infected Individuals and explore the two data sets that we have selected for analysis, one is focused on when COVID 19 occurred and the other data set is focused on data gathered till Jan, 2023 having medical conditions in it. With respect to individual medical condition, we want to explore and find trends and relations between factors such COVID-19 death rate, people having pnemonea, people with influenza etc
The data about Covid comes from “The New York Times,” based on reports from state and local health agencies, and was brought from Kaggle and the other source of data is from the “CDA” Central Disease Control in United States for a deeper analysis and perspective. Finding relevant data was quiet the challenge but we were able to find suitable data.
Firstly we will analyze and clean the data which has the infection and death rate, we want to find the most affected and infected states in US over time, and later visualize that data as it is part of our project goals, we visualize the data and do analysis on national level after which we also investigate the relation between different factors of infection individuals in US.
head(covid_data)
## # A tibble: 6 × 6
## date county state fips cases deaths
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-01-21 Snohomish Washington 53061 1 0
## 2 2020-01-22 Snohomish Washington 53061 1 0
## 3 2020-01-23 Snohomish Washington 53061 1 0
## 4 2020-01-24 Cook Illinois 17031 1 0
## 5 2020-01-24 Snohomish Washington 53061 1 0
## 6 2020-01-25 Orange California 06059 1 0
We cleaned and process the data set, before visual and analytical insights.
Here is the summary of the dataset.
## date county state fips
## Min. :2020-01-21 Length:2502832 Length:2502832 Length:2502832
## 1st Qu.:2020-10-13 Class :character Class :character Class :character
## Median :2021-04-23 Mode :character Mode :character Mode :character
## Mean :2021-04-22
## 3rd Qu.:2021-11-02
## Max. :2022-05-13
##
## cases deaths survivors fatality
## Min. : 0 Min. : 0.0 Min. : -11797 Min. :0.00
## 1st Qu.: 382 1st Qu.: 6.0 1st Qu.: 373 1st Qu.:0.01
## Median : 1773 Median : 33.0 Median : 1766 Median :0.02
## Mean : 10034 Mean : 161.6 Mean : 10052 Mean : Inf
## 3rd Qu.: 5884 3rd Qu.: 101.0 3rd Qu.: 5906 3rd Qu.:0.02
## Max. :2908425 Max. :40267.0 Max. :2876403 Max. : Inf
## NA's :57605 NA's :57605 NA's :57605
## l_cases l_deaths daily_cases daily_deaths
## Min. : 0 Min. : 0.0 Min. :-370251.0 Min. :-11317.00
## 1st Qu.: 377 1st Qu.: 6.0 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 1764 Median : 33.0 Median : 2.0 Median : 0.00
## Mean : 10001 Mean : 161.2 Mean : 32.9 Mean : 0.41
## 3rd Qu.: 5860 3rd Qu.: 101.0 3rd Qu.: 13.0 3rd Qu.: 0.00
## Max. :2904483 Max. :40261.0 Max. : 193786.0 Max. : 1554.00
## NA's :57527 NA's :57605
As we can see in the summary above initial details. We added new columns then we grouped the data by county and state. we also added ‘l_cases,’ ‘l_deaths,’ ‘daily_cases,’ and ‘daily_deaths’ varaibles. There are some negative number and NAs, that we need to do some processing on the data.
We check to see which varaibles has NAs, so we can deal with them for better processing and visualization.
## # A tibble: 6 × 12
## # Groups: county, state [6]
## date county state fips cases deaths survi…¹ fatal…² l_cases l_dea…³
## <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020-05-05 Adjuntas Puer… 72001 3 NA NA NA 0 0
## 2 2020-05-05 Aguada Puer… 72003 7 NA NA NA 0 0
## 3 2020-05-05 Aguadilla Puer… 72005 11 NA NA NA 0 0
## 4 2020-05-05 Aguas Bue… Puer… 72007 22 NA NA NA 0 0
## 5 2020-05-05 Aibonito Puer… 72009 13 NA NA NA 0 0
## 6 2020-05-05 Anasco Puer… 72011 5 NA NA NA 0 0
## # … with 2 more variables: daily_cases <dbl>, daily_deaths <dbl>, and
## # abbreviated variable names ¹survivors, ²fatality, ³l_deaths
## [1] "All of the NAs for the deaths variable are in Puerto Rico"
## date county state fips cases deaths
## 0 0 0 22886 0 0
## survivors fatality l_cases l_deaths daily_cases daily_deaths
## 0 0 0 0 0 0
We also deal with jump days as it can affect results.
## # A tibble: 6 × 4
## date cases deaths count_counties
## <date> <dbl> <dbl> <int>
## 1 2021-05-31 33010904 590947 3143
## 2 2021-06-01 33033274 591457 3143
## 3 2021-06-02 33050894 592049 3143
## 4 2021-06-03 33070576 592668 3143
## 5 2021-06-04 33049256 592793 3143
## 6 2021-06-05 33059573 593161 3143
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
## # A tibble: 1 × 3
## state first second
## <chr> <dbl> <dbl>
## 1 Florida 2326842 2289276
## [1] "The number of rows where a county had less than zero cases is 39501"
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.
## [1] "The number of counties that have this issue is 3021"
## # A tibble: 3 × 3
## # Groups: county [3]
## county state count
## <chr> <chr> <int>
## 1 Anderson Texas 107
## 2 Jones Texas 127
## 3 Richmond Virginia 116
## # A tibble: 6 × 11
## state cases deaths survi…¹ fatal…² l_cases daily…³ daily…⁴ l_dea…⁵ daily…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska 2.42e5 1165 241193 0.00481 1289510 -1.05e6 0 18890 -17725
## 2 America… 5.55e2 0 555 0 242358 -2.42e5 0 1165 -1165
## 3 Arkansas 8.27e5 10950 815963 0.0132 1987318 -1.16e6 0 28097 -17147
## 4 Colorado 1.33e6 12103 1320099 0.00908 9044007 -7.71e6 0 87665 -75562
## 5 Connect… 7.31e5 10655 719851 0.0146 1332202 -6.02e5 0 12103 -1448
## 6 Delaware 2.58e5 2807 255419 0.0109 730506 -4.72e5 0 10655 -7848
## # … with 1 more variable: updated_fatality <dbl>, and abbreviated variable
## # names ¹survivors, ²fatality, ³daily_cases, ⁴daily_cases_lag, ⁵l_deaths,
## # ⁶daily_deaths
Now that we have dealt with the NAs we create a new table/df.
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 11
## date cases deaths survivors fatality l_cases daily…¹ daily…² l_dea…³
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-09-11 40778439 656561 40121878 0.0161 40704945 73494 84435 655881
## 2 2021-09-12 40817535 656856 40160679 0.0161 40778439 39096 41466 656561
## 3 2021-09-13 41102295 658576 40443719 0.0160 40817535 284760 279766 656856
## 4 2021-09-14 41254944 661207 40593737 0.0160 41102295 152649 160015 658576
## 5 2021-09-15 41424814 663772 40761042 0.0160 41254944 169870 210465 661207
## 6 2021-09-16 41579175 667179 40911996 0.0160 41424814 154361 175123 663772
## # … with 2 more variables: daily_deaths <dbl>, updated_fatality <dbl>, and
## # abbreviated variable names ¹daily_cases, ²daily_cases_lag, ³l_deaths
Now that we have cleaned the data and have dealt with the NAs, we will do a deep EDA
National level covid diagnoses as can be seen below, The total covid deaths and diagnoses rise with time, but with the passage of time the fatalities decrease.
## date cases deaths survivors
## 2.437861e+02 2.550421e+07 3.144461e+05 2.519801e+07
## fatality l_cases daily_cases daily_cases_lag
## 1.388590e-02 2.546109e+07 1.436738e+05 1.440814e+05
## l_deaths daily_deaths updated_fatality
## 3.143287e+05 1.014225e+03 NaN
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As it can be seen above in the Covid Diagnoses graph we draw a parallel black line to see the trend, and as it can be seen that with the passage of time the trend drops.
## Max lag exceeds data available. Using max lag: 843
## # A tibble: 31 × 2
## lag CCF_daily_deaths
## <dbl> <dbl>
## 1 0 0.477
## 2 1 0.331
## 3 2 0.255
## 4 3 0.279
## 5 4 0.252
## 6 5 0.288
## 7 6 0.357
## 8 7 0.382
## 9 8 0.265
## 10 9 0.186
## # … with 21 more rows
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Same can be said in the case of Covid deaths, which means either with the passage of time the virus became weak, or people had developed an immunity. It can also be said that after winter season with less cases of pnemonea and influenza that cases became less fatel.
We do visualization and see the deaths and cases on state level in US.
Aiming to examine outliers and top states by deaths we visualize the data, through a dot graph and bar plot which can be seen above and below.
Here we visualize the fatality rate by state, here we see some difference when compared with deaths, it can be said some states have better medical facilities or it can be said that some states have healthier people.
We visualize the progress of cases and deaths overtime, for each states to obersve if there are some commonalities between them.
We calculate the SD to measure the degree of dispersion of data.
## # A tibble: 6 × 5
## state cases_sd deaths_sd daily_cases_sd daily_deaths_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 410282. 6506. 410282. 6506.
## 2 Alaska 79242. 398. 332783. 6137.
## 3 American Samoa 2068. 6.78 49986. 223.
## 4 Arizona 639782. 9693. 639045. 9669.
## 5 Arkansas 265237. 3692. 365461. 5746.
## 6 California 2882682. 31485. 2614811. 27783.
We visualize the data on a US National Map
Adjusting and cleaning dateset with varaibles of medical conditions, We aim find relation in the CDA gathered US data based on the previous observations.
## # A tibble: 6 × 16
## Data A…¹ Start…² End D…³ Group Year Month State Sex Age G…⁴ COVID…⁵ Total…⁶
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … All Ag… 1091212 1.01e7
## 2 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … Under … 410 5.86e4
## 3 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … 0-17 y… 1433 1.04e5
## 4 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … 1-4 ye… 224 1.13e4
## 5 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … 5-14 y… 440 1.75e4
## 6 01/11/2… 01/01/… 01/07/… By T… NA NA Unit… All … 15-24 … 2900 1.08e5
## # … with 5 more variables: `Pneumonia Deaths` <dbl>,
## # `Pneumonia and COVID-19 Deaths` <dbl>, `Influenza Deaths` <dbl>,
## # `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>, Footnote <chr>, and
## # abbreviated variable names ¹`Data As Of`, ²`Start Date`, ³`End Date`,
## # ⁴`Age Group`, ⁵`COVID-19 Deaths`, ⁶`Total Deaths`
## # A tibble: 6 × 13
## `Data As Of` Start…¹ End D…² Group State Sex Age G…³ COVID…⁴ Total…⁵ Pneum…⁶
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 01/11/2023 01/01/… 01/07/… By T… Unit… All … All Ag… 1091212 1.01e7 1026050
## 2 01/11/2023 01/01/… 01/07/… By T… Unit… All … Under … 410 5.86e4 797
## 3 01/11/2023 01/01/… 01/07/… By T… Unit… All … 0-17 y… 1433 1.04e5 2228
## 4 01/11/2023 01/01/… 01/07/… By T… Unit… All … 1-4 ye… 224 1.13e4 485
## 5 01/11/2023 01/01/… 01/07/… By T… Unit… All … 5-14 y… 440 1.75e4 631
## 6 01/11/2023 01/01/… 01/07/… By T… Unit… All … 15-24 … 2900 1.08e5 2739
## # … with 3 more variables: `Pneumonia and COVID-19 Deaths` <dbl>,
## # `Influenza Deaths` <dbl>, `Pneumonia, Influenza, or COVID-19 Deaths` <dbl>,
## # and abbreviated variable names ¹`Start Date`, ²`End Date`, ³`Age Group`,
## # ⁴`COVID-19 Deaths`, ⁵`Total Deaths`, ⁶`Pneumonia Deaths`
## Data As Of Start Date
## 0 0
## End Date Group
## 0 0
## State Sex
## 0 0
## AgeGroup COVID19Deaths
## 0 87
## TotalDeaths PneumoniaDeaths
## 4 24
## PneumoniaAndCOVID19Deaths InfluenzaDeaths
## 209 1020
## PneumoniaInfluenzaOrCOVID19Deaths
## 18
## # A tibble: 6 × 13
## `Data As Of` Start…¹ End D…² Group State Sex AgeGr…³ COVID…⁴ Total…⁵ Pneum…⁶
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 01/11/2023 01/01/… 01/07/… By Y… Dela… All … All Ag… NA 22 NA
## 2 01/11/2023 01/01/… 01/07/… By Y… Dist… All … All Ag… 0 NA NA
## 3 01/11/2023 01/01/… 01/07/… By Y… Hawa… All … All Ag… NA 65 NA
## 4 01/11/2023 01/01/… 01/07/… By Y… Idaho All … All Ag… NA 74 NA
## 5 01/11/2023 01/01/… 01/07/… By Y… Indi… All … All Ag… NA 125 NA
## 6 01/11/2023 01/01/… 01/07/… By Y… Loui… All … All Ag… 0 20 NA
## # … with 3 more variables: PneumoniaAndCOVID19Deaths <dbl>,
## # InfluenzaDeaths <dbl>, PneumoniaInfluenzaOrCOVID19Deaths <dbl>, and
## # abbreviated variable names ¹`Start Date`, ²`End Date`, ³AgeGroup,
## # ⁴COVID19Deaths, ⁵TotalDeaths, ⁶PneumoniaDeaths
## [1] "All of the NAs for the deaths variable are in c(\"Delaware\", \"District of Columbia\", \"Hawaii\", \"Idaho\", \"Indiana\", \"Louisiana\", \"Nebraska\", \"North Dakota\", \"Rhode Island\", \"Utah\", \"Vermont\", \"Wyoming\")"
## Data As Of Start Date
## 0 0
## End Date Group
## 0 0
## State Sex
## 0 0
## AgeGroup COVID19Deaths
## 0 48
## TotalDeaths PneumoniaDeaths
## 2 0
## PneumoniaAndCOVID19Deaths InfluenzaDeaths
## 111 826
## PneumoniaInfluenzaOrCOVID19Deaths
## 2
## [1] "All of the NAs for the deaths variable are in Alaska"
## Data As Of Start Date
## 0 0
## End Date Group
## 0 0
## State Sex
## 0 0
## AgeGroup COVID19Deaths
## 0 0
## TotalDeaths PneumoniaDeaths
## 0 0
## PneumoniaAndCOVID19Deaths InfluenzaDeaths
## 93 812
## PneumoniaInfluenzaOrCOVID19Deaths
## 0
Now we perform linear regression to find the relations of Covid-19 deaths and Pnemoneainfluzenza based deaths.
First, we create a scatterplot to check for linear relationship of the two varialbes.
## `geom_smooth()` using formula = 'y ~ x'
Here we calculate the correlation of the two varialbes. they have stong positive relationship.
## [1] 0.9958919
Here we create two boxplot to check for outliers of the two
variables. we can see both the two variables have a few outliers.
We apply the linear regression Model to the two varialbes.
##
## Call:
## lm(formula = PneumoniaInfluenzaOrCOVID19Deaths ~ COVID19Deaths,
## data = covid_dataNew1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8120.5 -102.2 2.8 125.8 7953.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.053904 20.074986 3.739 0.000191 ***
## COVID19Deaths 1.418100 0.003148 450.528 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 788.6 on 1678 degrees of freedom
## Multiple R-squared: 0.9918, Adjusted R-squared: 0.9918
## F-statistic: 2.03e+05 on 1 and 1678 DF, p-value: < 2.2e-16
From the results we observe that Covid 19 deaths have linear relation with, medical variables such as Inflenze and pnemonea deaths. There is clear pattern, which indicates that the assumption relation is met and true. In other words, the coefficients of the regression model should be trustworthy and we don’t need to perform a transformation on the data.
We explored the data on national level from US and visualized and presented key details, after preforming Processing and Cleaing, we were able to find relations and trend setters in the dataset, We performed visualization and analysis on the data, we presented and found out the top states with respect to fatalites and infected cases over time, after that we processed the data and aimed to find relation between Covid deaths and people with existing medical condition, such that can we found through hypothesis testing and linear regression that there is strong relaton between number of covid deaths on national and state level and of deaths of people with existing medical conditions. We conclude that there is strong corelation between covid deaths and exsiting medical conditon such as pnemonea and influenze influce the number of deaths over time.