Sparse Robust PCA with the ssMRCD Estimator for Multi-Source Data

This vignette is based on the real world example described in Puchhammer, Wilms and Filzmoser (2024). However, only a subset of the data is analysed to reduce the necessary runtime due to the large number of groups in the original data analysis. All functions are included in the package ssMRCD.

library(ssMRCD)
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(robustbase)
library(tidyr)
library(ggridges)

Data Preparation

Weather data is made available by GeoSphere Austria (2024) (https://data.hub.geosphere.at) and is pre-cleaned and saved in the data frame object weatherHoheWarte. We consider a set of weather variables for the weather station Hohe Warte in Vienna, Austria over the years 2000-2023. The decision of variables and time lines are based on data quality, i.e. missing data. Additional information can be found on the helping page.

? weatherHoheWarte
data("weatherHoheWarte")
head(weatherHoheWarte)
#> # A tibble: 6 × 18
#>   time                cloud_cover global_radiation vapor_pressure max_wind_speed
#>   <dttm>                    <dbl>            <dbl>          <dbl>          <dbl>
#> 1 1960-01-01 00:00:00          93              288            7.1            7.7
#> 2 1960-01-02 00:00:00         100              195            7.6            6.3
#> 3 1960-01-03 00:00:00         100               57            7.5            7.7
#> 4 1960-01-04 00:00:00          97               68            8.1            9.4
#> 5 1960-01-05 00:00:00          57              161            7.3            9.1
#> 6 1960-01-06 00:00:00         100              163            6.6           19.4
#> # ℹ 13 more variables: air_pressure <dbl>, relative_humidity <dbl>,
#> #   precipitation <dbl>, sight <dbl>, sunshine_duration <dbl>,
#> #   temperature_max <dbl>, temperature_min <dbl>, temperature_mean <dbl>,
#> #   wind_velocity <dbl>, year <dbl>, month <dbl>, day <dbl>, season <dbl>

Only data from 2000 onward is used.

varnames_short= c("cl", "rad", "vp", "wmax", "ap", "hum", "prec", "sight", "sun", 
               "tmax", "tmin", "t", "w")

# filter data
weather =  weatherHoheWarte %>% 
  dplyr::filter(year > 2000)

# select variables
X_data = weather %>% 
  select(cloud_cover:wind_velocity)  %>%
  as.matrix
colnames(X_data) = varnames_short

# get number of variables
p = dim(X_data)[2]

To facilitate a stable algorithm, the data is scaled robustly. High variance complicates the strategy to choose during the algorithm since it is based on the average overall variance.

X_data = scale(X_data, center = robustbase::colMedians(X_data), scale = apply(X = X_data, MARGIN = 2, FUN = mad))

Calculating the ssMRCD Estimator for Yearly Groups

Since each year is supposed to be similar but can also contain certain irregular weather phenomena, the ssMRCD estimator (Puchhammer and Filzmoser, 2024) is applicable to take advantage of the additional information of prior and following years. For the ssMRCD estimator we use the function ssMRCD. The neighborhoods are years, thus acquiring comparability by including one weather cycle per group. Weights are based on the temporal proximity and linear decline of influence, which lasts a maximum of 5 years and calculated with the function time_weights.

N_groups = as.numeric(as.factor(weather$year))
N = max(N_groups)
W = time_weights(max(N_groups), c(5:1))

The optimal amount of smoothing is tuned with the residuals-based criteria described in Puchhammer, Wilms and Filzmoser (2024). Since the optimization of values takes some time (around 10 min), the optimal parameter selection is hard-coded in the second code chunk. To see the optimal smoothing criterion plotted over lambda run the first code chunk below.

set.seed(1234)

lambda = seq(0, 1, by = 0.05)
opt_lambda = ssMRCD(X = X_data,
                    groups = N_groups,
                    weights = W,
                    lambda = lambda,
                    tuning = list(method = "residuals", plot = TRUE),
                    alpha = 0.75)
lambda_grid = seq(0, 1, by = 0.05)
residual_values = c(3.199500,3.188801,3.182297,3.175784,3.171462,3.167695,3.165151,3.162758,
                    3.161896,3.161929,3.162562, 3.163831,3.165651,3.168228,3.171867,3.175790,
                    3.180868,3.186437,3.192634,3.200292,3.208845)
lambda_opt = 0.4

ggplot() + 
  geom_line(aes(x = lambda_grid,
                y = residual_values)) +
  geom_point(aes(x = lambda_grid,
                y = residual_values)) +
  geom_point(aes(x = lambda_opt,
                y = residual_values[lambda_grid == lambda_opt]),
             col = "red") +
  labs(x = expression(lambda),
       y = "Residual Value",
       title = "Optimal Smoothing Based on Residuals") +
  theme_minimal()

Here, the ssMRCD with the optimal lambda is calculated.

# get optimal smoothing parameter and estimator
set.seed(123)
ssMRCD_weather = ssMRCD(X = X_data,
                        groups = weather$year,
                        weights = W,
                        lambda = 0.4,
                        alpha = 0.75)

# collect covariance matrices
COVS = ssMRCD_weather$MRCDcov

plot(ssMRCD_weather, type = c("ellipses", "convergence"), variables = c("hum", "sun"))
#> $plot_convergence

#> 
#> $plot_ellipses

#> 
#> $plot_geoellipses
#> NULL

By plotting the mean estimates of the ssMRCD estimator we see clear temporal trends in the data and the different variables, .

# collect means
mu_timeseries = do.call(cbind, ssMRCD_weather$MRCDmu)
colnames(mu_timeseries) = 2001:2023
rownames(mu_timeseries) = colnames(X_data)

# plot means dependent on year
ggplot(mu_timeseries %>%
         data.frame %>%
         mutate(Var1 = rownames(.)) %>%
         tidyr::pivot_longer(cols = X2001:X2023) %>% 
         group_by(Var1) %>% 
         mutate(value = (value-min(value))/(max(value) - min(value)),
                name = as.numeric(gsub("X", "", name)))) +
  geom_line(aes(x = name, y = value, group = Var1)) + 
  geom_smooth(aes(x = name, y = value, group = Var1), se = F) + 
  facet_wrap(vars(Var1)) + 
  theme_minimal() +
  labs(x = "", y = "")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Robust Sparse PCA

The covariances from the ssMRCD-estimator can be used as plug-in for the sparse robust multi-source PCA.

Hyper-Parameter Tuning

To get the optimal distribution of sparsity \(\gamma\) we use the AUC for the path of explained variance and sparsity of entries and groups. The optimal amount of sparsity \(\eta\) is based on the trade-off product optimization (TPO) criterion. The parameter tuning is included in the function when and/or are numeric vectors. Here, only is tuned.

set.seed(12345)
pca = msPCA(eta = seq(0, 3, 0.25), gamma = 0.5, COVS = COVS, k = p)
#> Tuning |===                                               | 1 parameter combinations of 13Tuning |=======                                           | 2 parameter combinations of 13Tuning |===========                                       | 3 parameter combinations of 13Tuning |===============                                   | 4 parameter combinations of 13Tuning |===================                               | 5 parameter combinations of 13Tuning |=======================                           | 6 parameter combinations of 13Tuning |==========================                        | 7 parameter combinations of 13Tuning |==============================                    | 8 parameter combinations of 13Tuning |==================================                | 9 parameter combinations of 13Tuning |======================================            | 10 parameter combinations of 13Tuning |==========================================        | 11 parameter combinations of 13Tuning |==============================================    | 12 parameter combinations of 13Tuning |==================================================| 13 parameter combinations of 13
pca$plot_tpo

The optimal parameter values is \(\eta=\) 1.

summary(pca, k = 3)
#> 
#> Sparse Multi-Source PCA Summary
#> ==============================================
#> Number of groups (N):               23 
#> Number of variables (p):            13 
#> Number of components (k):           13 
#> Sparsity parameters (gamma, eta):   0.5 , 1 
#> 
#> Component-wise Summary:
#> 
#> PC1:
#> Group    | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001     |   51.05%          |  10 / 13
#> 2002     |   56.84%          |  10 / 13
#> 2003     |   40.68%          |  13 / 13
#> 2004     |   53.00%          |   8 / 13
#> 2005     |   52.43%          |  10 / 13
#> 2006     |   44.01%          |  12 / 13
#> 2007     |   43.87%          |  10 / 13
#> 2008     |   41.83%          |   9 / 13
#> 2009     |   47.69%          |   9 / 13
#> 2010     |   45.79%          |  10 / 13
#> 2011     |   33.91%          |  13 / 13
#> 2012     |   44.37%          |  11 / 13
#> 2013     |   36.62%          |  13 / 13
#> 2014     |   38.06%          |  13 / 13
#> 2015     |   31.58%          |  13 / 13
#> 2016     |   43.72%          |  13 / 13
#> 2017     |   39.56%          |  13 / 13
#> 2018     |   43.75%          |  12 / 13
#> 2019     |   36.66%          |  12 / 13
#> 2020     |   30.45%          |  13 / 13
#> 2021     |   39.32%          |  13 / 13
#> 2022     |   35.62%          |  12 / 13
#> 2023     |   38.86%          |  13 / 13
#> 
#> PC2:
#> Group    | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001     |   21.88%          |  13 / 13
#> 2002     |   19.63%          |  12 / 13
#> 2003     |   29.19%          |  12 / 13
#> 2004     |   20.36%          |  12 / 13
#> 2005     |   20.47%          |  12 / 13
#> 2006     |   26.38%          |  12 / 13
#> 2007     |   26.27%          |  13 / 13
#> 2008     |   26.35%          |  12 / 13
#> 2009     |   25.37%          |  10 / 13
#> 2010     |   25.79%          |  10 / 13
#> 2011     |   32.70%          |  11 / 13
#> 2012     |   25.82%          |  11 / 13
#> 2013     |   31.09%          |  11 / 13
#> 2014     |   30.30%          |  11 / 13
#> 2015     |   33.98%          |  12 / 13
#> 2016     |   26.33%          |  12 / 13
#> 2017     |   30.02%          |  12 / 13
#> 2018     |   28.08%          |  11 / 13
#> 2019     |   28.22%          |  11 / 13
#> 2020     |   31.04%          |  13 / 13
#> 2021     |   26.74%          |  12 / 13
#> 2022     |   27.45%          |  11 / 13
#> 2023     |   25.38%          |  12 / 13
#> 
#> PC3:
#> Group    | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001     |   10.59%          |  11 / 13
#> 2002     |    9.12%          |  11 / 13
#> 2003     |   11.17%          |  13 / 13
#> 2004     |   10.04%          |  11 / 13
#> 2005     |   10.03%          |  11 / 13
#> 2006     |   10.45%          |  11 / 13
#> 2007     |   11.71%          |  11 / 13
#> 2008     |   11.44%          |  11 / 13
#> 2009     |    9.57%          |  13 / 13
#> 2010     |    9.93%          |  13 / 13
#> 2011     |   11.37%          |  13 / 13
#> 2012     |    9.77%          |  13 / 13
#> 2013     |   10.91%          |  13 / 13
#> 2014     |   10.31%          |  13 / 13
#> 2015     |   11.17%          |  13 / 13
#> 2016     |   10.83%          |  13 / 13
#> 2017     |   11.59%          |  13 / 13
#> 2018     |    9.59%          |  13 / 13
#> 2019     |   12.31%          |  13 / 13
#> 2020     |   13.32%          |  13 / 13
#> 2021     |   12.07%          |  13 / 13
#> 2022     |   12.08%          |  13 / 13
#> 2023     |   12.67%          |  13 / 13
#> 
#> Convergence:
#>   Converged:  Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes Yes 
#>   Steps:      32 41 105 82 72 79 135 200 199 200 33 1 1 
#> 
#> Use `plot()` to visualize components.

Sparse Robust PCA - Number of PCs

When deciding how many PCs are enough to describe the data well, a scree-plot inspired box-plot scree-plot can visualize the explained variance well for multiple groups. Note, that we adjust the amount of sparsity for further components based on the remaining variance possible to explain.

# plot scree plot
screeplot(x = pca, 
                 text = TRUE,
                 size = 3, 
                 gnames = 2001:2023,
                 cutoff = 0.8)
#> [[1]]

#> 
#> [[2]]

We select the first three PCs as sufficient, since 80% of overall variance is explained by them.

# optimal number of components
optimal_k = 3

Diagnostic Plots

When it comes to plotting, we also need to align the loadings for a clear picture. Multiple approaches are possible and implemented in the function align with different type settings.

?align
#> Hilfe zum Thema 'align' in den folgenden Paketen gefunden:
#> 
#>   Package               Library
#>   pillar                C:/Users/puchhammer/AppData/Local/R/win-library/4.5
#>   ssMRCD                C:/Users/puchhammer/AppData/Local/Temp/RtmpYT5kkg/Rinst2ccc58c746ac
#> 
#> 
#> Nutze ersten passenden ...
pca$PC[, 1:optimal_k, ] = align(PC = pca$PC[, 1:optimal_k, ],type = "mean")

Loadings

Loadings can be plotted via the function plot with type = "loadings". Keep in mind, that each PC for each source is unique only up to a factor -1. While in theory this is not remarkable, for comparing results in a sensible manner, the PCs should be oriented to reflect similarities specifically. The function align_PC provides various approach to orient them in a similar direction and for applications different type arguments can be explored.

# plot aligned loadings
plot(x = pca, 
     type = "loadings",
     gnames = 2001:2023, 
     k = 1) 
#> $loadings
#> list()
#> 
#> $<NA>


plot(x = pca, 
     type = "loadings",
     gnames = 2001:2023, 
     k = 2) 
#> $loadings
#> list()
#> 
#> $<NA>


plot(x = pca, 
     type = "loadings", 
     gnames = 2001:2023, 
     k = 3) 
#> $loadings
#> list()
#> 
#> $<NA>

Also something like a biplot can be visualized to see the differences in the loadings.

biplot(pca)

Scores, Score- and Orthogonal-Distances

To improve interpretation of the components, we calculate the scores as well as the score distance and the orthogonal distance with the function scores. You can either use the ssMRCD object, then the data from the covariance estimation is used and centered accordingly, or you can specify the input manually.

sc = scores(X = X_data, groups = weather$year, PC = pca$PC, mu = ssMRCD_weather$MRCDmu, Sigma = ssMRCD_weather$MRCDcov)
sc2 = scores(PC = pca$PC, ssMRCD = ssMRCD_weather)

Possible visualizations include univariate densities for each group via histograms of the scores.

plot(x = pca, ssMRCD = ssMRCD_weather, type = "scores")
#> $scores
#> list()
#> 
#> $<NA>
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The score distance and orthogonal distance are calculated via the scores function. Some densities can be shown for each group similar to Puchhammer, Wilms and Filzmoser (2024).

# distances
score_sd = sc$SD
score_od = sc$OD

dat = data.frame(time = weather$time,
                 year = weather$year,
                 OD = score_od,
                 SD = score_sd,
                 groups = weather$year)
dat_melt = dat %>% tidyr::pivot_longer(cols = c("OD", "SD"))

ggplot(dat_melt %>% 
         dplyr::filter(name %in% c( "SD", "OD"))) +
  ggridges::geom_density_ridges_gradient(aes(x = value, 
                                   y = groups, 
                                   fill = as.factor(groups), 
                                   group = groups),
                               scale = 10,
                               rel_min_height = 0.01, 
                               gradient_lwd = 0.5) +
  scale_x_continuous(expand = c(0.03, 0)) +
  scale_y_discrete(expand = c(0.0, 0, 0.1, 0)) +
  scale_fill_discrete(name = "") +
  ggridges::theme_ridges(font_size = 13, grid = TRUE) + 
  theme(axis.title.y = element_blank(),
        legend.position = "right",
        panel.grid.major.y = element_line()
        ) + 
  labs(x = "") +
  facet_grid(cols = vars(name), scales = "free") 
#> Picking joint bandwidth of 1.03e-06
#> Picking joint bandwidth of 1.8

Also, a distance-distance plot can be plotted.

plot(pca, type = "score_distances", ssMRCD = ssMRCD_weather, k = 3)
#> $score_distances
#> list()
#> 
#> $<NA>

For more fun, the two distances can be visualized via Gif in a bivariate way for each group separately.

# use function saveGIF
library(animation)
param = 2001:2023

saveGIF(interval = 0.1, 
        movie.name = "PCA_scoreDist_bivariate.gif",
        expr = {
          for (i in param ){
           g = ggplot(dat %>% dplyr::filter (year %in% seq(i-5, i+5, 1))) +
                 geom_density_2d_filled(aes(x = SD, 
                     y = OD),
                     alpha = 1) +
                  scale_x_sqrt(limits = c(0,40)) + 
                  ylim(c(0, 6.2)) + 
                  labs(title = i)
  
            print(g)
          }}
        )

References

GeoSphere Austria (2022): https://data.hub.geosphere.at.

Puchhammer, P., & Filzmoser, P. (2024). Spatially smoothed robust covariance estimation for local outlier detection. Journal of Computational and Graphical Statistics, 33(3), 928-940.https://doi.org/10.1080/10618600.2023.2277875.

Puchhammer, P., Wilms, I., & Filzmoser, P. (2024). Sparse outlier-robust PCA for multi-source data. arXiv preprint arXiv:2407.16299. https://doi.org/10.48550/arXiv.2407.16299.