Data Analytics in R-language (IFI7360.DT) Faculty of Digital Technologies, Tallinn University Tallinn, Estonia.
Waste generation is a topic that is pertinent to 21st-century global challenges. Broadly, the kind of waste that is generated can be classified as either hazardous or non-hazardous. While hazardous waste could pose serious threats to humanity and environment and as such should be monitored, non-hazardous waste should also be kept in check to curtail its excesses and maintain the sustainability of the ecosystem. The focus of this project narrows down to Estonia in Europe where the various economic activities that generate waste are being captured. Keywords: waste, generation, management, hazardous, non-hazardous.
“As population continues to increase, coupled with the increase in industrialization and economic activities, there is increasing pressure on the kind of waste that is being generated. There’s a need to properly analyse these economic activities and the kind of wastes being released into the environment and also prioritise principles that are needed for its global environmental sustainability and safety. For the purpose of this project, we’ll further narrow down the analysis and test waste generated from the manufacture of coke and refined petroleum products and the kind of waste being generated by this activity. More so we would be testing if there is some kind of correlation between two economic activities in this case manufacture of coke and refined petroleum products and manufacture of wood and wood product.
The dataset used for this project spans for a coverage of 12years (2008 - 2020) consisting of the kind of wastes being generated in terms of hazardous and non-hazardous waste, and the various economic activities that generate these wastes. The data was open source and easy to access. The datasets were not so large and was obtained from KK068: WASTE GENERATION by Year, Kind of waste and Economic activity (EMTAK 2008). Statistical database.
To carry out the two sample test in the dataset we used the number 1 to represent non-hazardous waste and 2 to represent hazardous waste in the column “kind of waste”. Hence, we would be looking at the dependent binary variable: kind of waste in comparison to the continuous, independent variable: Manufacture of coke and refined petroleum products.
Performing Explanatory data analysis on our data,which is waste generation by Year, Kind of waste and economic activity (EMTAK 2008) and to understand and visualize what kind of data we are working with as we would be using various functions such as the summary, head, tail, histogram, standard deviation and variance functions. This data was extracted from the Estonian statistics website comprising of 14 Observations and 12 variables that include the year, kind of waste being a categorical variable with nominal data of hazardous and non-hazardous, and 10 economic activities which have continuous data . It includes the years 2008 t0 2020, the kinds of waste generated and the quantities generated during these specified duration of 12 years.
For the purpose of these project we would be concentrating on the economic activity manufacture of Coke and refined petroleum products that generates hazardous and non-hazardous waste and the years in which these wastes were generated.
We would be carrying out the two-sample T-test to determine if there is a significant difference between the means of the kind of waste generated during manufacture of coke and refine petroleum within the years 2008 t0 2020.
The new derived dataset consists of two categorical variables with kind of waste with two categories non-hazardous and hazardous waste, while the economic activity, manufacture of coke and refined petroleum products includes quantitative continuous data.
RESEARCH QUESTION 1: Is there some kind of correlation between the amount of waste generated during the manufacture of coke and refined petroleum products and the amount of waste generated in the manufacture of wood and wood product.
RESEARCH QUESTION 2: Is there a difference between the means of the two samples, kind of waste (hazardous and non-hazadous) in the manufacture process of coke and petroleum procucts within the year 2008 to 2020?
To test for some relationship between one economic activity and another in this context we would be using manufacture of coke and refined petroleum products, and manufacture of wood and wood products. From our result we derived a cor value of -0.6123879 df 12 p-value of 0.01991 cor value of -0.6123879 This correlation result between the two economic activities signifies that there is strong and negative correlation.
## y m
## 1 78837 1058691
## 2 1971916 267
## 3 63742 711409
## 4 2402240 326
## 5 182123 713300
## 6 2799525 1065
## [1] -0.61
## [1] -0.6123879
##
## Pearson's product-moment correlation
##
## data: dirt.data$`Manufacture of coke and refined petroleum products` and dirt.data$`Manufacture of wood and wood products`
## t = -2.6834, df = 12, p-value = 0.01991
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8626687 -0.1211833
## sample estimates:
## cor
## -0.6123879
Null hypothesis: the correlation coefficient is not significantly different from 0. There is no significant linear relationship between manufacture of wood and wood product and manufacture of coke and refined petroleum products.
Alternative hypothesis: the population correlation coefficient is significantly different from 0. There is a significant linear relationship between manufacture of wood and wood product and manufacture of coke and refined petroleum products.
Conclusion due to the strong negative correlation result we choose to reject the null hypothises and accept the alternate hypothesis that there is some strong negative correlation between the two activities. But it should be noted that this does not mean that the manufacture of wood and wood products depend on the manufacture of coke and refine petroleum products and vice versa.
“y” represents the data for cokepetrol and “z” represents the binary values for the “wastetype”
Creating a new tibble known as Coke with kind of waste and manufacture of coke and refined petroleum products represented as “wastetype” and “cokepetrol” respectively
Renaming the columns Coke <- Coke %>% rename(wastetype = Kind of waste, cokepetrol = Manufacture of coke and refined petroleum products)
## # A tibble: 14 × 2
## wastetype cokepetrol
## <dbl> <dbl>
## 1 1 78837
## 2 2 1971916
## 3 1 63742
## 4 2 2402240
## 5 1 182123
## 6 2 2799525
## 7 1 107206
## 8 2 3308789
## 9 1 4605504
## 10 2 3511779
## 11 1 248912
## 12 2 3275992
## 13 1 2190379
## 14 2 1295697
Assigning “wastetype” instead of “numbers” and viewing the structure and content of the new tibble. We can see from the new tibble that we have 2 variables, “wastetype” and “cokepetrol” and 14 observations with values for non-hazardous and hazardous waste.
## wastetype cokepetrol
## Length:14 Min. : 63742
## Class :character 1st Qu.: 198820
## Mode :character Median :2081148
## Mean :1860189
## 3rd Qu.:3156875
## Max. :4605504
## [1] 14 2
## tibble [14 × 2] (S3: tbl_df/tbl/data.frame)
## $ wastetype : chr [1:14] "Nonhazardous" "Hazardous" "Nonhazardous" "Hazardous" ...
## $ cokepetrol: num [1:14] 78837 1971916 63742 2402240 182123 ...
## [1] 1540961
## # A tibble: 14 × 2
## wastetype cokepetrol
## <chr> <dbl>
## 1 Nonhazardous 78837
## 2 Hazardous 1971916
## 3 Nonhazardous 63742
## 4 Hazardous 2402240
## 5 Nonhazardous 182123
## 6 Hazardous 2799525
## 7 Nonhazardous 107206
## 8 Hazardous 3308789
## 9 Nonhazardous 4605504
## 10 Hazardous 3511779
## 11 Nonhazardous 248912
## 12 Hazardous 3275992
## 13 Nonhazardous 2190379
## 14 Hazardous 1295697
The standard deviation for manufacture of coke and petroleum products is 1,540,961
To create histogram to view the distribution our newly named varibale “cokepetrol” from the histogram of “cokepetrol” we can see that the data is approximately evenly distributed with a couple of higher values.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
At the same time, both qq-plot and boxplot for “cokepetrol” variable reveal that there are some extreme data points at both ends, meaning that there are values that are extremely high and data points that are quite low. There is the high value of 4,605,504 and low value of 63,742 with a difference of 4,541,762 which was gotten from the summary command of Coke.
To create a boxplot for “cokepetrol”
The table below presents the means and standard deviations for
cokepetrol for both groups: Nonhazardous and
Hazardous: with the mean and standard deviation values for
hazardous waste as 2,652,277 and 810,984.3 respectively and that of
non-hazardous as 1,068,100 and 1,738,747.4 respectively.
## # A tibble: 2 × 4
## wastetype N Mean SD
## <chr> <int> <dbl> <dbl>
## 1 Hazardous 7 2652277. 810984.
## 2 Nonhazardous 7 1068100. 1738747.
Checking the means
It is noticed from the test being run that the means of non-hazardous waste is 1,068,100 while that of hazardous waste is 2,652,277 which is significantly higher than that of non-hazardous waste being produced by the manufacture of Coke and refined Petroleum products.
To run the independent T-test we check the assumptions:
-Assumption 1:are the two samples independents? Yes, since the samples from Nonhazardous and Hazardous are not related.
-Assumption 2:does the data from each of the two groups follow a normal distribution? To check this assumption, we will produce histograms of the dependent variable by the independent. Both histograms are approximately normally distributed so the assumption has been met.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
## F test to compare two variances
##
## data: cokepetrol by wastetype
## F = 0.21755, num df = 6, denom df = 6, p-value = 0.08569
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.03738067 1.26606704
## sample estimates:
## ratio of variances
## 0.2175464
The p-value of F-test is p = 0.08569 which is slightly greater than the significance level alpha = 0.05. In conclusion, there is some difference between the variances of the two sets of data. Since the standard deviations are significantly different (sd for hazardous 810984.3, non-hazardous 1738747.4 ) , so the assumption of equal variances has not been met. Using the classic t-test which assumes equality of the two variances to further run our analysis on the two samples:
Null hypothesis: the means for the two samples are equal.
Alternative hypothesis: the means for the two samples are not equal.
##
## Two Sample t-test
##
## data: cokepetrol by wastetype
## t = 2.1846, df = 12, p-value = 0.04948
## alternative hypothesis: true difference in means between group Hazardous and group Nonhazardous is not equal to 0
## 95 percent confidence interval:
## 4201.649 3164151.208
## sample estimates:
## mean in group Hazardous mean in group Nonhazardous
## 2652277 1068100
In the result above:
- t is the t-test statistic value (t = 2.1846),
- df is the degrees of freedom (df= 12),
- p-value is the significance level of the t-test (p-value =
0.04948),
- conf.int is the confidence interval of the mean at 95%
(conf.int = [4201.649, 3164151.208]),
- sample estimates is the mean value of the sample (mean =
2652277, 1068100).
The p-value of the test is 0.04948, which is slightly less than the significance level alpha = 0.05 or almost the same. We can conclude that average hazardous waste produced by the manufacture of coke and refined petroleum products is statistical and significantly different from that of non-hazardous waste being produced.
Hence, we fail to reject the alternate hypothesis that the means of the two samples are not the same and choose to reject the null hypothesis that the average of the two samples are the same. The practical significance of this effect could be seen in the real world based on how the manufacture of Coke and refined Petroleum products with the production of hazardous waste can have a negative and adverse effect on the ecosysystem in general.