Data breaches in the US healthcare industry have been a persistent problem over the past decade. Between 2009 and 2019, numerous healthcare organisations experienced data breaches, resulting in the unauthorised access or disclosure of sensitive patient information. These breaches have had serious consequences for both patients and healthcare organisations, including financial losses, damage to reputation, and loss of trust.
T-test data analysis as a statistical technique was used to analyse the dataset, US healthcare data breaches between 2009 and 2019. It involved comparing the mean values of different variables in two groups of data to determine whether there is a significant difference between the groups.
This project used the t-test data analysis to provide valuable insights into the characteristics and consequences of healthcare data breaches, relationship between variables, as well as correlation between features.
In this report, we are going to explore the dataset that has been
titled ‘Biggest healthcare data breaches 2009-2019’ from the Privacy
Affairs website which we have re-titled as data for this
analysis.
Additional documentation can be found here:
< https://data.world/zendoll27/biggest-healthcare-data-breaches-2009-2019>
The dataset includes 9 variables and 2,641 observations:
- ‘Name.of.Covered.Entity’ - ‘State’ - Covered.Entity.Type
- Individuals.Affected -
Breach.Submission.Date’ -Type.of.Breach’ -
‘Location.of.Breached.Information’ - ‘Business.Associate.Present’ -
‘Web.Description’
First, let’s activate the all packages needed to explore and analyse the dataset after cleaning it in excel and importing it into R.
We import the data into R language
We will be performing descriptive statistics on the dataset.
Calculate the mean of one category inside Individuals.Affected
Calculate the mean of two categories
Calculate the median
Calculate the column sums
Calculate the maximum and munimum values
Calculate the interquartile range
Calculate the mode by creating a mode function
##
## Pearson's product-moment correlation
##
## data: data$Individuals.Affected and year
## t = 0.074367, df = 2638, p-value = 0.9407
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03670303 0.03959464
## sample estimates:
## cor
## 0.001447914
Correlation between year and individuals affected
Which state has the highest occurrence and which has the lowest
## Warning: Use of `data$Individuals.Affected` is discouraged.
## ℹ Use `Individuals.Affected` instead.
## Warning: Use of `data$Covered.Entity.Type` is discouraged.
## ℹ Use `Covered.Entity.Type` instead.
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`position_stack()`).
## Warning: Removed 1 rows containing missing values (`position_stack()`).
## Warning: Removed 1 rows containing missing values (`position_stack()`).
## [1] 432 7
## [1] 482 7
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_count()`).
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
Pearson’s correlation test revealed that individuals affected by the incidents and years in which the incidents happened are not significantly correlated with a correlation value of 0.001447914 and p-value = 0.9407.
The alternative hypothesis shows: true correlation is not equal to 0 with 95 percent confidence interval: -0.03670303 0.03959464.