US Healthcare Data Breach Between Years 2009 and 2019

Abstract

Data breaches in the US healthcare industry have been a persistent problem over the past decade. Between 2009 and 2019, numerous healthcare organisations experienced data breaches, resulting in the unauthorised access or disclosure of sensitive patient information. These breaches have had serious consequences for both patients and healthcare organisations, including financial losses, damage to reputation, and loss of trust.

T-test data analysis as a statistical technique was used to analyse the dataset, US healthcare data breaches between 2009 and 2019. It involved comparing the mean values of different variables in two groups of data to determine whether there is a significant difference between the groups.

This project used the t-test data analysis to provide valuable insights into the characteristics and consequences of healthcare data breaches, relationship between variables, as well as correlation between features.

Introduction

In this report, we are going to explore the dataset that has been titled ‘Biggest healthcare data breaches 2009-2019’ from the Privacy Affairs website which we have re-titled as data for this analysis.

Additional documentation can be found here:
< https://data.world/zendoll27/biggest-healthcare-data-breaches-2009-2019>

The dataset includes 9 variables and 2,641 observations:
- ‘Name.of.Covered.Entity’ - ‘State’ - Covered.Entity.Type - Individuals.Affected - Breach.Submission.Date’ -Type.of.Breach’ - ‘Location.of.Breached.Information’ - ‘Business.Associate.Present’ - ‘Web.Description’

First, let’s activate the all packages needed to explore and analyse the dataset after cleaning it in excel and importing it into R.

We import the data into R language

Performing EDA

We will be performing descriptive statistics on the dataset.

Calculate the mean of one category inside Individuals.Affected

Calculate the mean of two categories

Calculate the median

Calculate the column sums

Calculate the maximum and munimum values

Calculate the interquartile range

Calculate the mode by creating a mode function

Correlation Test

## 
##  Pearson's product-moment correlation
## 
## data:  data$Individuals.Affected and year
## t = 0.074367, df = 2638, p-value = 0.9407
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03670303  0.03959464
## sample estimates:
##         cor 
## 0.001447914

Research questions

Correlation between year and individuals affected
Which state has the highest occurrence and which has the lowest

Hypothesis Testing

## Warning: Use of `data$Individuals.Affected` is discouraged.
## ℹ Use `Individuals.Affected` instead.

## Warning: Use of `data$Covered.Entity.Type` is discouraged.
## ℹ Use `Covered.Entity.Type` instead.

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing missing values (`position_stack()`).

## Warning: Removed 1 rows containing missing values (`position_stack()`).

## Warning: Removed 1 rows containing missing values (`position_stack()`).

## [1] 432   7

## [1] 482   7

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_count()`).

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

Conclusion

Pearson’s correlation test revealed that individuals affected by the incidents and years in which the incidents happened are not significantly correlated with a correlation value of 0.001447914 and p-value = 0.9407.

The alternative hypothesis shows: true correlation is not equal to 0 with 95 percent confidence interval: -0.03670303 0.03959464.