Derivation of Waste Management Profiles (WIP)

Disclaimer

Most recent code can be found below. .R files might be outdated.

Todos:

references: ~~bartlett test~~, ~~kmo~~, pca, fa, ~~hopkins~~
clustering: ~~diana~~, ~~k-means~~, ~~daisy~~
profiling
recommendations

Context

Although only about 15% of all waste within the EU is generated as municipal waste¹, the absolute figures pose a major problem for municipalities, waste management companies and the environment. 225.7 million tonnes of municipal waste were collected in the EU in 2020, of which only 68 million tonnes were directly recycled, with the remainder going into long-term landfill or being incinerated for energy generation. In view of the climate-damaging landfill gases produced during storage or CO2 emissions during incineration, combined with the problem of the large amount of space required, the EU's goal is to constantly optimise its waste management. This is intended to promote the production of less waste, a stronger circular economy and the economic efficiency of waste management. In the context of this optimisation, we want to work out a status quo of municipal waste management in Italian municipalities, on which subsequent optimisation projects can build. For this purpose, we base our work on a data set on the waste management of a total of 4341 Italian municipalities. With the help of these data, we are to draw up profiles of the municipalities, which we can cluster them with regard to their descriptive characteristics, in particular the key figures of waste management, but also geographical and economic factors.

(click to show/hide the characteristics of the data set)

ID	Column name	Explanation
1	ID	Unique identification number of the municipality.
2	Region	Region in which the municipality is located.
3	Provinz	Province in which the municipality is located.
4	Gemeinde	Name of the municipality.
5	Flaeche	Area of the municipality in km².
6	Bevoelkerung	Population of the municipality.
7	Bevoelkerungsdichte	Population of the municipality per km².
8	Strassen	Total length of all roads in the municipality in km.
9	Invelgemeinde	Indicator whether municipality is on an island (1 yes, 0 no).
10	Kuestengemeinde	Indicator whether municipality is on the coast (1 yes, 0 no).
11	Urbanisierungsgrad	Degree of municipality urbanisation (1 low, 3 high).
12	Geologischer_Indikator	Geological location of the municipality (1 south, 2 central, 3 north).
13	Abfaelle_gesamt	Total amount of municipal waste in kilotonnes (kt).
14	Abfaelle_sortiert	Municipal waste sorted in kilotonnes (kt).
15	Abfaelle_unsortiert	Unsorted municipal waste in kilotonnes (kt).
16	Sortierungsgrad	Percentage of sorted municipal waste in total.
17	Sort_Bio	Proportion of organic waste among the sorted.
18	Sort_Papier	Proportion of paper waste sorted.
19	Sort_Glas	Percentage of glass waste sorted.
20	Sort_Holz	Proportion of wood waste among those sorted.
21	Sort_Metall	Proportion of metal waste among the sorted.
22	Sort_Plastik	Proportion of plastic waste among the sorted.
23	Sort_Elektrik	Proportion of electrical waste among those sorted.
24	Sort__Textil	Proportion of textile waste among those sorted.
25	Sort_Rest	Proportion of residual waste among those sorted.
26	Verwendung_Energie	Proportion of waste that is used for energy recovery.
27	Verwendung_Deponie	Proportion of waste that is landfilled.
28	Verwendung_Recycling	Proportion of waste that is recycled.
29	Verwendung_Unbekannt	Proportion of waste whose further use is unknown.
30	Steuern_gewerblich	Per capita Tax revenue from commercial sources.
31	Steuern_privat	Per capita tax revenue from private sources.
32	Kosten_Basis	Per capita basic costs of waste management.
33	Kosten_Sortierung	Per capita sorting costs of waste management companies.
34	Kosten_sonstiges	Per capita other costs of waste management.
35	Gebuehrenregelung	Waste fee regulation of the municipality.
36	Region_PAYT	Indicator whether region offers PAYT (pay as you throw).

Exploratory Data Analysis

Get an overview of the data set.

wm_df <- load2("data/waste_management.RData")
skimr::skim(wm_df)
#wm_df %>% complete.cases() %>% sum()

(click to show/hide console output)

── Data Summary ────────────────────────
                           Values
Name                       wm_df 
Number of rows             4341  
Number of columns          36    
_______________________          
Column type frequency:           
  character                5     
  numeric                  31    
________________________         
Group variables            None  
── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate min max empty n_unique whitespace
1 Region                    0         1       5  21     0       20          0
2 Provinz                   0         1       4  21     0      102          0
3 Gemeinde                  6         0.999   2  60     0     4333          0
4 Gebuehrenregelung         0         1       4   8     0        2          0
5 Region_PAYT               0         1       2   4     0        2          0

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────
   skim_variable          n_missing complete_rate        mean         sd      p0      p25      p50      p75      p100 hist 
 1 ID                             0         1     47470.      30090.     1272    18135    42015    70049     111107   ▇▃▅▃▃
 2 Flaeche                        6         0.999    41.0        56.8       0.12    10.8     22.7     47.5     1287.  ▇▁▁▁▁
 3 Bevoelkerung                   0         1     10204.      53426.       34     1579     3535     8199    2617175   ▇▁▁▁▁
 4 Bevoelkerungsdichte            6         0.999   405.        771.        2.48    62.6    151.     399.     12123.  ▇▁▁▁▁
 5 Strassen                     443         0.898   102.        310.        1       25       51      105      14970   ▇▁▁▁▁
 6 Inselgemeinde                  6         0.999     0.00507     0.0711    0        0        0        0          1   ▇▁▁▁▁
 7 Kuestengemeinde                6         0.999     0.168       0.374     0        0        0        0          1   ▇▁▁▁▂
 8 Urbanisierungsgrad             6         0.999     2.49        0.595     1        2        3        3          3   ▁▁▆▁▇
 9 Geologischer_Indikator       285         0.934     2.29        0.888     1        1        3        3          3   ▃▁▂▁▇
10 Abfaelle_gesamt                0         1         5.31       32.5       0.02     0.61     1.52     3.95    1692.  ▇▁▁▁▁
11 Abfaelle_sortiert              0         1         3.25       15.6       0        0.37     1.04     2.73     765.  ▇▁▁▁▁
12 Abfaelle_unsortiert            0         1         2.04       17.6       0.01     0.18     0.41     1.06     927.  ▇▁▁▁▁
13 Sortierungsgrad                0         1        60.1        19.8       0       44.3     64.3     76.5       97.5 ▁▃▅▇▅
14 Sort_Bio                     511         0.882    22.3        12.7       0.01    11.1     25.0     31.8       61.6 ▅▅▇▂▁
15 Sort_Papier                   25         0.994    11.0         3.88      0        8.66    10.9     13.1       45.3 ▃▇▁▁▁
16 Sort_Glas                     33         0.992     9.41        3.71      0        7.15     9.1     11.3       39.8 ▅▇▁▁▁
17 Sort_Holz                   1095         0.748     4.11        2.72      0        2.08     4.02     5.71      25.1 ▇▃▁▁▁
18 Sort_Metall                  246         0.943     1.76        1.35      0        0.88     1.54     2.35      20.7 ▇▁▁▁▁
19 Sort_Plastik                  39         0.991     6.11        3.26      0        4.13     5.79     7.55      31.6 ▇▆▁▁▁
20 Sort_Elektrik                314         0.928     1.23        0.821     0        0.78     1.18     1.57      18.0 ▇▁▁▁▁
21 Sort_Textil                 1013         0.767     0.757       0.688     0        0.35     0.63     0.99      10.6 ▇▁▁▁▁
22 Sort_Rest                    136         0.969     7.94        5.15      0.03     3.96     7.13    11.1       37.2 ▇▆▂▁▁
23 Verwendung_Energie             0         1        20.3        15.7       0        5.63    18.5     38.5       55.1 ▇▃▂▇▁
24 Verwendung_Deponie             0         1        17.9        19.5       0        4.55    11.3     31.5       76.7 ▇▁▂▁▁
25 Verwendung_Recycling           0         1        41.3        12.8       2.35    32.7     43.2     51.6       76.7 ▁▅▇▇▁
26 Verwendung_Unbekannt           0         1        20.5        17.8       0        5.21    17.8     31.0       97.4 ▇▅▂▁▁
27 Steuern_gewerblich           386         0.911 16565.      14132.     4180.    9080.   12465.   19413.    377492.  ▇▁▁▁▁
28 Steuern_privat               285         0.934 13211.       3649.     2606.   10162.   13668.   15759.     35770.  ▂▇▃▁▁
29 Kosten_Basis                   0         1       154.         76.1      25.7    108.     137.     179.       977.  ▇▁▁▁▁
30 Kosten_Sortierung             67         0.985    52.7        33.1       3.39    31.3     48.9     66.4      582.  ▇▁▁▁▁
31 Kosten_sonstiges              52         0.988    54.2        43.2       4.27    27.3     41.7     66.5      670.  ▇▁▁▁▁

There are quite a lot of missing values. Omitting all of them would mean a loss of more than 50 % of the data. Create some plots to enhance the understanding of the data set:

wm_df %>% na.omit %>% 
  ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
  ggtitle("Waste by Region") +
  geom_boxplot(aes(fill = Region), outlier.shape = 2,
               outlier.colour = "black",
               outlier.alpha = .5) +
  theme(aspect.ratio = 0.5) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5))

A lot of outliers in total amounts of waste per community. I am going to take care of them later. Til then they will be hidden, so other values appear less compressed.

wm_df %>% na.omit %>% 
  ggplot(aes(x=Region, y=Abfaelle_gesamt)) +
  ggtitle("Waste by Region") +
  geom_boxplot(aes(fill = Region), outlier.shape = NA) +
  theme(aspect.ratio = 0.5) +
  coord_flip() +
  coord_fixed(ylim = c(0, 29)) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
        legend.position = "none",
        plot.title = element_text(hjust = 0.5))

wm_df %>% na.omit %>% 
  mutate(Geologischer_Indikator = ifelse(Geologischer_Indikator == 1, "South", ifelse(Geologischer_Indikator == 2, "Middle", "North"))) %>% 
  ggplot(aes(x=Geologischer_Indikator, y=Abfaelle_gesamt)) +
  ggtitle("Waste by geological location") +
  geom_boxplot(aes(fill = Geologischer_Indikator), outlier.shape = NA) +
  coord_cartesian(ylim = quantile(wm_df$Abfaelle_gesamt, c(0, 0.97))) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5))

wm_df %>%
  na.omit %>%
  mutate(Urbanisierungsgrad =
           ifelse(Urbanisierungsgrad == 1, "low",
                  ifelse(Urbanisierungsgrad == 2, "mid",
                         "high")
  )) %>% 
  ggplot(aes(y=Abfaelle_gesamt)) +
  ggtitle("Waste by population & urbanisation degree") +
  geom_point(aes(x=Bevoelkerung, colour = Urbanisierungsgrad)) +
  theme(plot.title = element_text(hjust = 0.5))

wm_df %>% na.omit %>% 
  mutate(Urbanisierungsgrad = ifelse(Urbanisierungsgrad == 1, "Urbanization low",
                                     ifelse(Urbanisierungsgrad == 3, "Urbanization high",
                                            "Urbanization mid"))) %>% 
  ggplot(aes(x=Urbanisierungsgrad, y=Abfaelle_gesamt)) +
  ggtitle("Mean waste by urbanisation degree") +
  stat_summary(aes(group = Urbanisierungsgrad, fill = Urbanisierungsgrad), fun = mean, geom = "bar") +
  labs(y = "mean(Abfaelle_gesamt)") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5))

Taking a look at correlations

wm_df %>% 
  select_if(is.numeric) %>% 
  na.omit %>% 
  cor(use = "pairwise.complete.obs") %>% 
  ggcorrplot::ggcorrplot(lab = F,
                         colors = c("red", "white", "blue"),
                         tl.cex = 10)

Inspecting missing values within the waste sorting related columns:

sapply(wm_df[,16:25],
       function(y) sum(length(which(is.na(y))))) %>%
  tibble(Column = names(.),
         NA_Count = .)

(click to show/hide console output)

# A tibble: 10 × 2
   Column          NA_Count
   <chr>              <int>
 1 Sortierungsgrad        0
 2 Sort_Bio             511
 3 Sort_Papier           25
 4 Sort_Glas             33
 5 Sort_Holz           1095
 6 Sort_Metall          246
 7 Sort_Plastik          39
 8 Sort_Elektrik        314
 9 Sort_Textil         1013
10 Sort_Rest            136

Sort_NAs <- c()
for (i in 17:25) {
  nas_t <- which(is.na(wm_df[i]))
  Sort_NAs <- c(Sort_NAs, nas_t)
}

Sort_NAs_table <- Sort_NAs %>% table %>% sort(decreasing = T)
Sort_NAs_table[1:5] %>% names %>% as.numeric -> t

wm_df[t,16:25]

(click to show/hide console output)

     Sortierungsgrad Sort_Bio Sort_Papier Sort_Glas Sort_Holz Sort_Metall Sort_Plastik Sort_Elektrik Sort_Textil Sort_Rest
429             0.72       NA        0.72        NA        NA          NA           NA            NA          NA        NA
3572            0.70       NA          NA        NA        NA          NA           NA           0.7          NA        NA
4175            0.00       NA          NA        NA        NA          NA           NA           0.0          NA        NA
3777           59.32       NA          NA     39.84        NA          NA           NA            NA          NA     19.48
4017            1.87       NA          NA        NA        NA          NA         0.62            NA          NA      1.25

Looks like the sum of the values present in these waste sorting columns equals the value in Sortierungsgrad. Imputing any values here would completely destroy the logic behind these columns. I would argue that dropping all these values is a viable option. However, one could also say that replacing these NAs with zeros instead might work out as well. The latter option is the one I chose.

Dimension Reduction

Within the data set there are dimensions that hold no value when it comes to any analyses. ID is a unique identifier, Gemeinde the name of a community, Strassen contains more than 10 % missing values, Region and Provinz contain too many unique values that would complicate the process a lot. Also the importance of the information they hold is questionable.

cols_to_exclude <- c("ID",
                     "Gemeinde",
                     "Strassen",
                     "Region",
                     "Provinz")

A vector containing the names of the columns mentioned is created. A recipe from the tidyverse is used to remove these dimensions, replace missing values via bag imputation and replace outliers in numeric dimension with the IQR limits of their individual column and replace the remaining nominal columns with dummy variables.

Note: step_impute_constant() and step_outliers_iqr_to_limits() are functions from the steffanossaR package found at https://github.com/steffanossa/steffanossaR.

recipe_prep <- recipe(~., data = wm_df) %>% 
  step_rm(all_of(cols_to_exclude)) %>% 
  step_impute_constant(contains("Sort_"), constant = 0) %>% 
  step_impute_bag(all_numeric()) %>% 
  step_outliers_iqr_to_limits(all_numeric(), -ends_with("gemeinde") %>% 
  step_other(all_nominal(), threshold = .001, other = NA) %>% 
  step_naomit(everything(), skip = T) %>% 
  step_dummy(all_nominal()) %>% 
  step_zv(everything())

#| include: false
set.seed(187)
wm_df_prepped <-
  recipe_prep %>%
  prep %>%
  bake(new_data = NULL)

Principal Components Analysis (PCA)

A PCA is used to reduce the number of variables by finding principal components of the data, which are new and uncorrelated variables that can explain the most variance in the original data.

First, the Bartlett Test is done. The Bartlett test verifies the null hypothesis that the correlation matrix is equal to the identity matrix, meaning that the variables of a given data set are uncorrelated. If the resulting value is below .05, the null hypothesis is rejected and it is concluded, that the variables are correlated².

psych::cortest.bartlett(cor(wm_df_prepped), n = 100)$p.value

(click to show/hide console output)

[1] 3.075839e-286

The p-value is way below 0.05, there is correlation between the dimensions of the data set.

Next, the Kaiser-Mayer-Olkin Criterion (KMO) is looked at. The KMO measures the adequacy of a dataset for factor analysis. It ranges from 0 to 1, where a higher value indicates higher suitability. A value above .6 is generally considered to be the threshold. However, some sources also consider .5 to be acceptable³.

psych::KMO(wm_df_prepped)$MSA

(click to show/hide console output)

[1] 0.5681499

~0.57 is not very good but I will consider this acceptable. Now the PCA can be executed.

wm_df_pca <- 
  wm_df_prepped %>% 
  prcomp(scale. = T,
         center = T)
wm_df_pca %>%
  summary()

(click to show/hide console output)

Importance of components:
                          PC1      PC2     PC3     PC4     PC5     PC6     PC7     PC8
Standard deviation     2.5701   2.3601  1.5934 1.35156 1.25891 1.15567 1.03789 1.02265
Proportion of Variance 0.2131   0.1797  0.0819 0.05893 0.05112 0.04308 0.03475 0.03374
Cumulative Proportion  0.2131   0.3928  0.4747 0.53359 0.58471 0.62780 0.66255 0.69628
                           PC9    PC10    PC11    PC12    PC13    PC14    PC15    PC16
Standard deviation     0.96964 0.95687 0.91417 0.88252 0.85881 0.80217 0.79484 0.76324
Proportion of Variance 0.03033 0.02954 0.02696 0.02512 0.02379 0.02076 0.02038 0.01879
Cumulative Proportion  0.72661 0.75615 0.78310 0.80823 0.83202 0.85278 0.87316 0.89195
                          PC17    PC18    PC19    PC20    PC21    PC22    PC23    PC24
Standard deviation     0.74113 0.69080 0.64876 0.60780 0.55877  0.5454 0.47311 0.45339
Proportion of Variance 0.01772 0.01539 0.01358 0.01192 0.01007  0.0096 0.00722 0.00663
Cumulative Proportion  0.90967 0.92506 0.93864 0.95055 0.96063  0.9702 0.97744 0.98407
                          PC25    PC26    PC27    PC28    PC29    PC30    PC31
Standard deviation     0.43197 0.38751 0.32419 0.17301 0.09742 0.08771 0.06887
Proportion of Variance 0.00602 0.00484 0.00339 0.00097 0.00031 0.00025 0.00015
Cumulative Proportion  0.99009 0.99494 0.99833 0.99929 0.99960 0.99985 1.00000

Taking a look at the first 15 principal components (PC) and the percentage of variance they explain.

wm_df_pca %>%
factoextra::fviz_eig(ncp = 15,
                     addlabels = T)

The elbow method would suggest using three PCs. The cumulative variance of 49.942 % when taking only three is too little to result in useful outcomes.

Another method of evaluating the number of PCs to keep is the Kaiser Criterion which states that factors with an eigenvalue above 1 are considered important and should be retained. An eigenvalue above 1 means its factor explains more variance than a single variable would.

factoextra::get_eig(wm_df_pca) %>%
  filter(eigenvalue > 1)

(click to show/hide console output)

      eigenvalue variance.percent cumulative.variance.percent
Dim.1	  6.604434	      21.304627                    21.30463
Dim.2   5.571090      	17.971259                    39.27589
Dim.3	  2.533845	       8.173692	                   47.44958
Dim.4 	1.824394      	 5.885141	                   53.33472
Dim.5	  1.586462	       5.117619	                   58.45234
Dim.6	  1.339441	       4.320776	                   62.77311
Dim.7	  1.076469	       3.472482            	       66.24560
Dim.8	  1.043998	       3.367737                    69.61333

8 factors possess eigenvalues above 1 with a cumulative variance of ~69.61 %.

get_eig(wm_df_pca)[1:15,] %>% 
  ggplot(aes(y=eigenvalue, x=1:15)) +
  ggtitle("Eigenvalue by Component") +
  labs(x = "Component",
       y = "Eigenvalue") +
  geom_line() +
  geom_point(col = "blue") +
  scale_x_continuous(breaks = 1:15) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme(plot.title = element_text(hjust = 0.5))

A third approach is Horn's Method. Here random data sets with equal size (columns and rows) as the original data set are generated and then a factor analysis is performed on each of them. The retained number of factors are then compared. The idea is that if the number of factors kept in the original data set is similar to the number of factors kept in the random sets, the factors of the original data set are considered not meaningful. If the number of factors of the original data set is larger than the number of factors in the random sets, the factors in the original data set are considered meaningful⁴.

set.seed(187)
wm_df_prepped %>%
     paran::paran()

(click to show/hide console output)

Results of Horn's Parallel Analysis for component retention
930 iterations, using the mean estimate

-------------------------------------------------- 
Component   Adjusted    Unadjusted    Estimated 
            Eigenvalue  Eigenvalue    Bias 
-------------------------------------------------- 
1           6.446948    6.604434      0.157485
2           5.432912    5.571090      0.138178
3           2.410785    2.533844      0.123058
4           1.713787    1.824393      0.110606
5           1.487574    1.586461      0.098887
6           1.251387    1.339440      0.088052
--------------------------------------------------

Adjusted eigenvalues > 1 indicate dimensions to retain.
(6 components retained)

Horn's Method suggests a number of 6 PCs to keep. I chose to keep 8 with approximately 70 % cumulative variance. Next, we take a look at the contributions of the original variables to each new PC.

n_PCs <- 8
for (i in 1:n_PCs) {
  factoextra::fviz_contrib(wm_df_pca, "var", axes = i)
  %>% print
}

The psych package comes with a function that can illustrate the contribution of each original variable to the PCs in one plot.

wm_df_prepped %>%
  psych::principal(nfactors = n_PCs) %>%
  psych::fa.diagram()

(click to show/hide)

A new data set is created based on the new dimensions.

wm_df_transformed_pca <- as.data.frame(-wm_df_pca$x[,1:n_PCs])

Factor Analysis (FA)

Like the PCA a factor analysis can be used to reduce the dimensions of a data set. However, while the PCA creates uncorrelated variables the FA identifies underlying latent factors that explain the relationships among the original variables in the data set. The factors the FA puts out might be correlated, so a rotation can be used in make these factors as uncorrelated as possible.

First, a vector containing all rotation methods is created. Then we iterate over each of them using a for-loop.

rot_meth <- c("varimax",
              "quartimax",
              "equamax",
              "oblimin",
              "promax")
max_cumvar <- 0
for (rm in rot_meth) {
  cat("Factor Analysis results. Rotation method: ", rm, "\n")
  res <- factanal(wm_df_prepped,
                  factors = n_PCs,
                  rotation = rm,
                  lower = 0.1) %>% 
    steffanossaR::ex_factanal()
  if (res[3,n_PCs] > max_cumvar & res[3,n_PCs] <= 1) {max_cumvar <- res[3,n_PCs]}
  print(res)
  cat("\n")
}

(click to show/hide console output)

Factor Analysis results. Rotation method:  varimax 
                 Factor1   Factor2    Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
SS loadings    5.2625797 4.2328943 2.42322412 1.72548638 1.47297756 1.38332617 1.17507304 1.16537813
Proportion Var 0.1697606 0.1365450 0.07816852 0.05566085 0.04751541 0.04462342 0.03790558 0.03759284
Cumulative Var 0.1697606 0.3063056 0.38447413 0.44013498 0.48765039 0.53227381 0.57017939 0.60777224

Factor Analysis results. Rotation method:  quartimax 
                 Factor1   Factor2   Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
SS loadings    5.2353602 4.2656853 3.2294537 1.63627033 1.33552917 1.32646255 1.07282139 0.73935672
Proportion Var 0.1688826 0.1376028 0.1041759 0.05278291 0.04308159 0.04278911 0.03460714 0.02385022
Cumulative Var 0.1688826 0.3064853 0.4106613 0.46344418 0.50652577 0.54931488 0.58392202 0.60777224

Factor Analysis results. Rotation method:  equamax 
                 Factor1    Factor2    Factor3    Factor4    Factor5    Factor6    Factor7    Factor8
SS loadings    4.3643577 2.96040826 2.56451685 1.98772753 1.90141196 1.90019869 1.72378435 1.43853404
Proportion Var 0.1407857 0.09549704 0.08272635 0.06412024 0.06133587 0.06129673 0.05560595 0.04640432
Cumulative Var 0.1407857 0.23628277 0.31900912 0.38312937 0.44446524 0.50576197 0.56136791 0.60777224

Factor Analysis results. Rotation method:  oblimin 
                 Factor1   Factor2    Factor3    Factor4  Factor5   Factor6    Factor7   Factor8
SS loadings    4.4381552 3.1541391 1.91135844 1.73505742 1.480126 1.4767935 1.35825449 1.3417234
Proportion Var 0.1431663 0.1017464 0.06165672 0.05596959 0.047746 0.0476385 0.04381466 0.0432814
Cumulative Var 0.1431663 0.2449127 0.30656944 0.36253904 0.410285 0.4579235 0.50173820 0.5450196

Factor Analysis results. Rotation method:  promax 
                Factor1   Factor2   Factor3   Factor4    Factor5    Factor6    Factor7    Factor8
SS loadings    5.009074 3.6273221 3.6242156 1.7168483 1.49714828 1.44777655 1.44589770 0.91053167
Proportion Var 0.161583 0.1170104 0.1169102 0.0553822 0.04829511 0.04670247 0.04664186 0.02937199
Cumulative Var 0.161583 0.2785934 0.3955036 0.4508858 0.49918090 0.54588337 0.59252523 0.62189722

62.22 % (promax rotation) is the maximum amount of cumulative variance with 8 factors. Approximately 10 % less than what the PCA yielded. Additionally, the loading are ambiguous. PCA will be used.

Cluster Analysis

To assess the clustering tendency of a data set the Hopkins Statistic can be used. It measures the probability that a given data set was generated by a uniform data distribution. The higher the resulting value the better the clustering tendency. Values range from 0 to 1⁵.

wm_df_transformed_pca %>%
  get_clust_tendency(n = nrow(wm_df_transformed_pca) - 1, graph = F)

(click to show/hide console output)

$hopkins_stat
[1] 0.8309369

$plot
NULL

~ 0.8309 is quite good.

Hierarchical Clustering: Agglomerative Methods

Hierarchical Clustering can be done with different approaches. Agglomerative methods start with every observation being considered a single-elementcluster. Then the two most similar clusters are combined into a new cluster. This is done until there is one cluster containing everything. First, a distance matrix and then clusters are created. For both steps there are multiple methods of creation. Every combination is looped over and the results are saved into .pdf files which can be found in doc/.

dist_meth <- c("euclidean",
               "maximum",
               "manhattan",
               "canberra",
               "minkowski")

daisy_meth <- c("euclidean",
                "manhattan",
                "gower")

hclust_meth <- c("single",
                 "complete",
                 "average",
                 "mcquitty",
                 "median",
                 "centroid",
                 "ward")

pdf("doc/dendro_dist_hclust.pdf", width = 16, height = 9)
for (dm in dist_meth) {
  for (cm in hclust_meth) {
    hdist <- dist(scale(wm_df_transformed_pca),
                  dm)
    hcl <- flashClust::flashClust(hdist, cm)
    plot(hcl,
         main = paste0("dist method: ", dm,
                       "\nclust method: ", cm),
         sub = NA,
         labels = FALSE)
  }
}
dev.off()

pdf("doc/dendro_daisy.pdf", width = 16, height = 9)
for (dm in daisy_meth) {
  for (cm in hclust_meth) {
    daisydist <- daisy(wm_df_transformed_pca,
                       metric = dm,
                       stand = TRUE)
    flashClust::flashClust(daisydist, cm) %>% 
      plot(main = paste0("dist method: ", dm,
                         "\nclust method: ", cm),
           sub = NA,
           labels = FALSE)
  }
}
dev.off()

The combination of canberra and ward.D2 looks most promising.

hclust_a <- 
  dist(scale(wm_df_transformed_pca), 
       method = "canberra") %>% 
  flashClust::flashClust(method = "ward")
 
hclust_a %>% plot(main = paste0("dist method: canberra",
                                "\nclust method: ward"),
                  sub = NA,
                  labels = FALSE)

A number of 2 to 7 clusters seem to be most viable. The results of clustering into 2 to 7 clusters will be saved as .csv file and later be used for profiling. The results can be found in doc/profiling_hclust_a.csv

(click to show/hide code)

for (i in 2:7) {
  wm_df_prepped_clust_a <-
    wm_df_prepped %>%
    mutate(Cluster_a = cutree(hclust_a, k = i))
  
  profiling_df <-
    wm_df_prepped_clust_a %>% 
    group_by(Cluster_a) %>% 
    mutate(n_island_municipalities = sum(Inselgemeinde),
           Inselgemeinde = NULL,
           n_coastal_municipalities = sum(Kuestengemeinde),
           Kuestengemeinde = NULL) %>% 
    summarize_all(mean)
  profiling_df$n_Cluster <- i
  profiling_df$Size <- table(wm_df_prepped_clust_a$Cluster_a)
  
  if (i == 2) {
    profiling_df_final <- profiling_df
    # write to csv
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_hclust_a.csv",
                sep = ";",
                dec = ",",
                row.names = F)
  } else {
    profiling_df_final <-
      profiling_df_final %>% 
      rows_append(profiling_df)
    
    # append rows with leading blank line to csv
    write.table("",
                file = "profiling_hclust_a.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_hclust_a.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
  }
}

Hierarchical Clustering: Divisive Methods

Unlike with the agglomerative approach, the divisive method begins with a single cluster containing every observation and with each step existing clusters are divided into smaller ones until there are as many clusters as observations. Diana (DIvisive ANAlysis) comes with two methods of calculating the dissimilarities between observations.

diana_eucl <- diana(wm_df_transformed_pca,
                    metric = "euclidean",
                    stand = TRUE)
diana_man <- diana(wm_df_transformed_pca,
                   metric = "manhatten",
                   stand = TRUE)
diana_eucl %>% 
  pltree(main = "Diana\nEuclidean",
         labels = FALSE,
         sub = NA,
         xlab = NA)

diana_man %>% 
  pltree(main = "Diana\nEuclidean",
         labels = FALSE,
         sub = NA,
         xlab = NA)

...

Clustering: Partitioning Methods

k-means needs a fixed amount of clusters to operate. we can get suggestions by using different approaches of calculating the resulting quality. The target is to produce clusters that are homogenous within themselves while being as differentiable as possible from other clusters. WithinSS (sum of squares within a cluster):

fviz_nbclust(wm_df_transformed_pca, kmeans, method = "wss")

Silhouette

fviz_nbclust(wm_df_transformed_pca, kmeans, method = "silhouette")

Gap statistic

fviz_nbclust(wm_df_transformed_pca, kmeans, method = "gap_stat")

These plot suggest 3, 4 or 7. Not too helpful, in my opinion, since we can just loop over a range that is similar to the range we used with Agglomerative Clustering anyway without any hassle. The results can be found in in doc/profiling_kmeans.csv.

(click to show/hide code)

set.seed(187)
for (i in 2:7) {
  kmeans_t <- kmeans(scale(wm_df_transformed_pca),
                   centers = i,
                   nstart = 10,
                   iter.max = 10
                   )
  
  wm_df_prepped_clust_kmeans <-
    wm_df_prepped %>% 
    mutate(Cluster_k = kmeans_t$cluster)
  
  profiling_df_kmeans <-
    wm_df_prepped_clust_kmeans %>% 
    group_by(Cluster_k) %>% 
    mutate(n_island_municipalities = sum(Inselgemeinde),
           Inselgemeinde = NULL,
           n_coastal_municipalities = sum(Kuestengemeinde),
           Kuestengemeinde = NULL) %>% 
    summarise_all(mean)
  
  profiling_df_kmeans$n_Cluster <- i
  profiling_df_kmeans$Size <- table(wm_df_prepped_clust_kmeans$Cluster_k)
  
  if (i == 2) {
    profiling_df_kmeans_final <- profiling_df_kmeans
    # write to csv
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_kmeans.csv",
                sep = ";",
                dec = ",",
                row.names = F)
  } else {
    profiling_df_kmeans_final <-
      profiling_df_kmeans_final %>% 
      rows_append(profiling_df_kmeans)
    # write to csv
    ## blank line
    write.table("",
                file = "profiling_kmeans.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
    ## data
    write.table(profiling_df_final %>%
                  filter(n_Cluster == i) %>% 
                  mutate(round(., 2)),
                file = "profiling_kmeans.csv",
                sep = ";",
                append = T,
                col.names = F,
                dec = ",",
                row.names = F)
  }
}

Profiling

Now the .csv files created earlier come into play. Microsoft Excel can be used to comfortably compare clusters for profiling and help determining the number of clusters to choose. Using a .vba script, the differences between the clusters can be emphasised easily.

(click to show/hide .vba code)

Sub Highlighting()
    
    Dim data_end_col As Integer
    Dim cluster_group_col As Long
    Dim cluster_count As Integer
    Dim i As Integer
    Dim j As Integer
    Dim group_end_row As Integer
    
    ' Dim chr As String
    ' get col of cluster groups
    ' chr = InputBox("ColName of cluster groupings", "Text Input", "n_Cluster")
    ' cluster_group_col = Application.WorksheetFunction.Match(chr, Range(Cells(1, 1), Cells(1, data_end_col)), 0)
    
    i = 2
    j = 2
    data_end_col = Cells(1, Columns.Count).End(xlToLeft).Column
    cluster_group_col = Application.WorksheetFunction.Match("n_Cluster", Range(Cells(1, 1), Cells(1, data_end_col)), 0)
    
    ' loop through the number of clusters
    Do While i <= ActiveSheet.UsedRange.Rows.Count
        cluster_count = Cells(i, cluster_group_col)
        group_end_row = i + cluster_count - 1
        
        ' loop through dimensions
        For j = 2 To data_end_col - 2
            Range(Cells(i, j), Cells(group_end_row, j)).Select
            
            ' add highlighting
            Selection.FormatConditions.AddColorScale ColorScaleType:=3
            Selection.FormatConditions(Selection.FormatConditions.Count).SetFirstPriority
            Selection.FormatConditions(1).ColorScaleCriteria(1).Type = _
                xlConditionValueLowestValue
            With Selection.FormatConditions(1).ColorScaleCriteria(1).FormatColor
                .Color = 13011546
                .TintAndShade = 0
            End With
            Selection.FormatConditions(1).ColorScaleCriteria(2).Type = _
                xlConditionValuePercentile
            Selection.FormatConditions(1).ColorScaleCriteria(2).Value = 50
            With Selection.FormatConditions(1).ColorScaleCriteria(2).FormatColor
                .Color = 16776444
                .TintAndShade = 0
            End With
            Selection.FormatConditions(1).ColorScaleCriteria(3).Type = _
                xlConditionValueHighestValue
            With Selection.FormatConditions(1).ColorScaleCriteria(3).FormatColor
                .Color = 7039480
                .TintAndShade = 0
            End With
            
        Next j
        i = group_end_row + 2
        
    Loop
End Sub

Agglomerative

k-means

Unsurprisingly both approaches lead to equal results. From here, it is possible to point out the differences between clusters. I went with 4 clusters.

Cluster #1: consists of municipalities with a large area and a high population and therefore - due to their correlation - high amounts of waste, while their urbanisation degree is the lowest among all clusters. Approximately third of these communities are coastal communities and there is no tendency of their location being rather north or south. This is true for using any number between 3 and 7 for clustering.
Cluster #2: contains the smallest communities (by average), the population per area, however, is very high; being only surpassed by cluster #1. These municipalities are located in the north of Italy and their urbanisation degree is rather high, as is their degree of sorting, the percentage of waste that is for recycling and to produce energy and the per capita tax revenue from private sources. The rate of municipalities that offer PAYT is the lowest among all clusters while the rate of standard fee regulation sits at 97 %. Also, this cluster contains the least amount of coastal communities, and everything cost related is below average.
Cluster #3: includes municipalities that are big by area but small by population. The percentage of sorted waste is the lowest among all clusters, as is the percentage of waste used for energy production and. The percentage of waste with unknown use, too is very low
Cluster #4: the highest rate of waste sorting, the highest rate of waste that is used for recycling and the highest per capita tax revenue from pirate sources. The amount of municipalities that offer PAYT is very low as is the amount of waste stored in landfills. Most communities are located in the north nd their urbanisation degree is high.

In order to visualise the geographical distribution of the clusters, the latitudes and longitudes are necessary. A dump of Italian municipalities and their geographical positions is used⁶. Using a left join, latitudes and longitudes of the relevant communities are extracted.

(Note: Municipality names are not unique and the database contains duplicates. Some errors are produced.)

(click to show/hide code)

library(curl)
zip_path <- "https://download.geonames.org/export/dump/IT.zip"
tmp <- tempfile()
curl_download(zip_path, tmp)
unzip(tmp, "IT.txt")

italy_df <- read.table("IT.txt", sep = "\t", header = F, fill = T, quote = '\"')
italy_df <- italy_df %>% select(Name = V2, Lat = V5, Long = V6)

export <- wm_df_kmeans %>% left_join(italy_df, by = c("Gemeinde" = "Name"))
export[!duplicated(export$Gemeinde),] %>%
  write.csv("export.csv")

The exported .csv can then be used to mark the municipalities on a map. Here, Google Maps was used. Note: Due to github Markdown limitations, a click on the map will open a link within this tab. Use CTRL + Click to open in a new tab instead.

Municipal waste is all waste collected and treated by or for municipalities. It includes waste from households including bulky waste, similar waste from trade and commerce, office buildings, institutions and small businesses, as well as yard and garden waste, street sweepings and the contents of waste containers. yard and garden waste, street sweepings and the contents of waste containers. The definition includes waste from municipal sewage networks and their treatment as well as waste from construction and demolition work. ↩
Reference: https://www.itl.nist.gov/div898/handbook/eda/section3/eda357.htm ↩
Reference: https://www.empirical-methods.hslu.ch/entscheidbaum/interdependenzanalyse/reduktion-der-variablen/faktoranalyse/ ↩
Reference: doi:10.1007/bf02289447 ↩
Reference: https://www.datanovia.com/en/lessons/assessing-clustering-tendency/ ↩
Source: http://www.geonames.org/export/ ↩

steffanossa / derivation_of_waste_mgmt_profiles_for_italian_municipalities Goto Github PK

derivation_of_waste_mgmt_profiles_for_italian_municipalities's Introduction

Derivation of Waste Management Profiles (WIP)

Disclaimer

Todos:

Context

Exploratory Data Analysis

Dimension Reduction

Principal Components Analysis (PCA)

Factor Analysis (FA)

Cluster Analysis

Hierarchical Clustering: Agglomerative Methods

Hierarchical Clustering: Divisive Methods

Clustering: Partitioning Methods

Profiling

derivation_of_waste_mgmt_profiles_for_italian_municipalities's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

steffanossa / derivation_of_waste_mgmt_profiles_for_italian_municipalities Goto Github PK

derivation_of_waste_mgmt_profiles_for_italian_municipalities's Introduction

Derivation of Waste Management Profiles (WIP)

Disclaimer

Todos:

Context

Exploratory Data Analysis

Dimension Reduction

Principal Components Analysis (PCA)

Factor Analysis (FA)

Cluster Analysis

Hierarchical Clustering: Agglomerative Methods

Hierarchical Clustering: Divisive Methods

Clustering: Partitioning Methods

Profiling

Footnotes

derivation_of_waste_mgmt_profiles_for_italian_municipalities's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org