This vignette is based on the real world example described in
Puchhammer, Wilms and Filzmoser (2024). However, only a subset of the
data is analysed to reduce the necessary runtime due to the large number
of groups in the original data analysis. All functions are included in
the package ssMRCD
.
library(ssMRCD)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(robustbase)
library(tidyr)
library(ggridges)
Weather data is made available by GeoSphere Austria (2024) (https://data.hub.geosphere.at) and is pre-cleaned and
saved in the data frame object weatherHoheWarte
. We
consider a set of weather variables for the weather station Hohe Warte
in Vienna, Austria over the years 2000-2023. The decision of variables
and time lines are based on data quality, i.e. missing data. Additional
information can be found on the helping page.
? weatherHoheWarte
data("weatherHoheWarte")
head(weatherHoheWarte)
#> # A tibble: 6 × 18
#> time cloud_cover global_radiation vapor_pressure max_wind_speed
#> <dttm> <dbl> <dbl> <dbl> <dbl>
#> 1 1960-01-01 00:00:00 93 288 7.1 7.7
#> 2 1960-01-02 00:00:00 100 195 7.6 6.3
#> 3 1960-01-03 00:00:00 100 57 7.5 7.7
#> 4 1960-01-04 00:00:00 97 68 8.1 9.4
#> 5 1960-01-05 00:00:00 57 161 7.3 9.1
#> 6 1960-01-06 00:00:00 100 163 6.6 19.4
#> # ℹ 13 more variables: air_pressure <dbl>, relative_humidity <dbl>,
#> # precipitation <dbl>, sight <dbl>, sunshine_duration <dbl>,
#> # temperature_max <dbl>, temperature_min <dbl>, temperature_mean <dbl>,
#> # wind_velocity <dbl>, year <dbl>, month <dbl>, day <dbl>, season <dbl>
Only data from 2000 onward is used.
= c("cl", "rad", "vp", "wmax", "ap", "hum", "prec", "sight", "sun",
varnames_short"tmax", "tmin", "t", "w")
# filter data
= weatherHoheWarte %>%
weather ::filter(year > 2000)
dplyr
# select variables
= weather %>%
X_data select(cloud_cover:wind_velocity) %>%
as.matrixcolnames(X_data) = varnames_short
# get number of variables
= dim(X_data)[2] p
To facilitate a stable algorithm, the data is scaled robustly. High variance complicates the strategy to choose during the algorithm since it is based on the average overall variance.
= scale(X_data, center = robustbase::colMedians(X_data), scale = apply(X = X_data, MARGIN = 2, FUN = mad)) X_data
Since each year is supposed to be similar but can also contain
certain irregular weather phenomena, the ssMRCD estimator (Puchhammer
and Filzmoser, 2024) is applicable to take advantage of the additional
information of prior and following years. For the ssMRCD estimator we
use the function ssMRCD
. The neighborhoods are years, thus
acquiring comparability by including one weather cycle per group.
Weights are based on the temporal proximity and linear decline of
influence, which lasts a maximum of 5 years and calculated with the
function time_weights
.
= as.numeric(as.factor(weather$year))
N_groups = max(N_groups)
N = time_weights(max(N_groups), c(5:1)) W
The optimal amount of smoothing is tuned with the residuals-based criteria described in Puchhammer, Wilms and Filzmoser (2024). Since the optimization of values takes some time (around 10 min), the optimal parameter selection is hard-coded in the second code chunk. To see the optimal smoothing criterion plotted over lambda run the first code chunk below.
set.seed(1234)
= seq(0, 1, by = 0.05)
lambda = ssMRCD(X = X_data,
opt_lambda groups = N_groups,
weights = W,
lambda = lambda,
tuning = list(method = "residuals", plot = TRUE),
alpha = 0.75)
= seq(0, 1, by = 0.05)
lambda_grid = c(3.199500,3.188801,3.182297,3.175784,3.171462,3.167695,3.165151,3.162758,
residual_values 3.161896,3.161929,3.162562, 3.163831,3.165651,3.168228,3.171867,3.175790,
3.180868,3.186437,3.192634,3.200292,3.208845)
= 0.4
lambda_opt
ggplot() +
geom_line(aes(x = lambda_grid,
y = residual_values)) +
geom_point(aes(x = lambda_grid,
y = residual_values)) +
geom_point(aes(x = lambda_opt,
y = residual_values[lambda_grid == lambda_opt]),
col = "red") +
labs(x = expression(lambda),
y = "Residual Value",
title = "Optimal Smoothing Based on Residuals") +
theme_minimal()
Here, the ssMRCD with the optimal lambda is calculated.
# get optimal smoothing parameter and estimator
set.seed(123)
= ssMRCD(X = X_data,
ssMRCD_weather groups = weather$year,
weights = W,
lambda = 0.4,
alpha = 0.75)
# collect covariance matrices
= ssMRCD_weather$MRCDcov
COVS
plot(ssMRCD_weather, type = c("ellipses", "convergence"), variables = c("hum", "sun"))
#> $plot_convergence
#>
#> $plot_ellipses
#>
#> $plot_geoellipses
#> NULL
By plotting the mean estimates of the ssMRCD estimator we see clear temporal trends in the data and the different variables, .
# collect means
= do.call(cbind, ssMRCD_weather$MRCDmu)
mu_timeseries colnames(mu_timeseries) = 2001:2023
rownames(mu_timeseries) = colnames(X_data)
# plot means dependent on year
ggplot(mu_timeseries %>%
%>%
data.frame mutate(Var1 = rownames(.)) %>%
::pivot_longer(cols = X2001:X2023) %>%
tidyrgroup_by(Var1) %>%
mutate(value = (value-min(value))/(max(value) - min(value)),
name = as.numeric(gsub("X", "", name)))) +
geom_line(aes(x = name, y = value, group = Var1)) +
geom_smooth(aes(x = name, y = value, group = Var1), se = F) +
facet_wrap(vars(Var1)) +
theme_minimal() +
labs(x = "", y = "")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The covariances from the ssMRCD-estimator can be used as plug-in for the sparse robust multi-source PCA.
To get the optimal distribution of sparsity \(\gamma\) we use the AUC for the path of explained variance and sparsity of entries and groups. The optimal amount of sparsity \(\eta\) is based on the trade-off product optimization (TPO) criterion. The parameter tuning is included in the function when and/or are numeric vectors. Here, only is tuned.
set.seed(12345)
= msPCA(eta = seq(0, 3, 0.25), gamma = 0.5, COVS = COVS, k = p)
pca #> Tuning |=== | 1 parameter combinations of 13Tuning |======= | 2 parameter combinations of 13Tuning |=========== | 3 parameter combinations of 13Tuning |=============== | 4 parameter combinations of 13Tuning |=================== | 5 parameter combinations of 13Tuning |======================= | 6 parameter combinations of 13Tuning |========================== | 7 parameter combinations of 13Tuning |============================== | 8 parameter combinations of 13Tuning |================================== | 9 parameter combinations of 13Tuning |====================================== | 10 parameter combinations of 13Tuning |========================================== | 11 parameter combinations of 13Tuning |============================================== | 12 parameter combinations of 13Tuning |==================================================| 13 parameter combinations of 13
$plot_tpo pca
The optimal parameter values is \(\eta=\) 1.
summary(pca, k = 3)
#>
#> Sparse Multi-Source PCA Summary
#> ==============================================
#> Number of groups (N): 23
#> Number of variables (p): 13
#> Number of components (k): 13
#> Sparsity parameters (gamma, eta): 0.5 , 1
#>
#> Component-wise Summary:
#>
#> PC1:
#> Group | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001 | 51.05% | 10 / 13
#> 2002 | 56.84% | 10 / 13
#> 2003 | 40.68% | 13 / 13
#> 2004 | 53.00% | 8 / 13
#> 2005 | 52.43% | 10 / 13
#> 2006 | 44.01% | 12 / 13
#> 2007 | 43.87% | 10 / 13
#> 2008 | 41.83% | 9 / 13
#> 2009 | 47.69% | 9 / 13
#> 2010 | 45.79% | 10 / 13
#> 2011 | 33.91% | 13 / 13
#> 2012 | 44.37% | 11 / 13
#> 2013 | 36.62% | 13 / 13
#> 2014 | 38.06% | 13 / 13
#> 2015 | 31.58% | 13 / 13
#> 2016 | 43.72% | 13 / 13
#> 2017 | 39.56% | 13 / 13
#> 2018 | 43.75% | 12 / 13
#> 2019 | 36.66% | 12 / 13
#> 2020 | 30.45% | 13 / 13
#> 2021 | 39.32% | 13 / 13
#> 2022 | 35.62% | 12 / 13
#> 2023 | 38.86% | 13 / 13
#>
#> PC2:
#> Group | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001 | 21.88% | 13 / 13
#> 2002 | 19.63% | 12 / 13
#> 2003 | 29.19% | 12 / 13
#> 2004 | 20.36% | 12 / 13
#> 2005 | 20.47% | 12 / 13
#> 2006 | 26.38% | 12 / 13
#> 2007 | 26.27% | 13 / 13
#> 2008 | 26.35% | 12 / 13
#> 2009 | 25.37% | 10 / 13
#> 2010 | 25.79% | 10 / 13
#> 2011 | 32.70% | 11 / 13
#> 2012 | 25.82% | 11 / 13
#> 2013 | 31.09% | 11 / 13
#> 2014 | 30.30% | 11 / 13
#> 2015 | 33.98% | 12 / 13
#> 2016 | 26.33% | 12 / 13
#> 2017 | 30.02% | 12 / 13
#> 2018 | 28.08% | 11 / 13
#> 2019 | 28.22% | 11 / 13
#> 2020 | 31.04% | 13 / 13
#> 2021 | 26.74% | 12 / 13
#> 2022 | 27.45% | 11 / 13
#> 2023 | 25.38% | 12 / 13
#>
#> PC3:
#> Group | Explained Var (%) | Non-zero Loadings
#> -----------------------------------------------
#> 2001 | 10.59% | 11 / 13
#> 2002 | 9.12% | 11 / 13
#> 2003 | 11.17% | 13 / 13
#> 2004 | 10.04% | 11 / 13
#> 2005 | 10.03% | 11 / 13
#> 2006 | 10.45% | 11 / 13
#> 2007 | 11.71% | 11 / 13
#> 2008 | 11.44% | 11 / 13
#> 2009 | 9.57% | 13 / 13
#> 2010 | 9.93% | 13 / 13
#> 2011 | 11.37% | 13 / 13
#> 2012 | 9.77% | 13 / 13
#> 2013 | 10.91% | 13 / 13
#> 2014 | 10.31% | 13 / 13
#> 2015 | 11.17% | 13 / 13
#> 2016 | 10.83% | 13 / 13
#> 2017 | 11.59% | 13 / 13
#> 2018 | 9.59% | 13 / 13
#> 2019 | 12.31% | 13 / 13
#> 2020 | 13.32% | 13 / 13
#> 2021 | 12.07% | 13 / 13
#> 2022 | 12.08% | 13 / 13
#> 2023 | 12.67% | 13 / 13
#>
#> Convergence:
#> Converged: Yes Yes Yes Yes Yes Yes Yes No Yes No Yes Yes Yes
#> Steps: 32 41 105 82 72 79 135 200 199 200 33 1 1
#>
#> Use `plot()` to visualize components.
When deciding how many PCs are enough to describe the data well, a scree-plot inspired box-plot scree-plot can visualize the explained variance well for multiple groups. Note, that we adjust the amount of sparsity for further components based on the remaining variance possible to explain.
# plot scree plot
screeplot(x = pca,
text = TRUE,
size = 3,
gnames = 2001:2023,
cutoff = 0.8)
#> [[1]]
#>
#> [[2]]
We select the first three PCs as sufficient, since 80% of overall variance is explained by them.
# optimal number of components
= 3 optimal_k
When it comes to plotting, we also need to align the loadings for a
clear picture. Multiple approaches are possible and implemented in the
function align
with different type
settings.
?align#> Hilfe zum Thema 'align' in den folgenden Paketen gefunden:
#>
#> Package Library
#> pillar C:/Users/puchhammer/AppData/Local/R/win-library/4.5
#> ssMRCD C:/Users/puchhammer/AppData/Local/Temp/RtmpYT5kkg/Rinst2ccc58c746ac
#>
#>
#> Nutze ersten passenden ...
$PC[, 1:optimal_k, ] = align(PC = pca$PC[, 1:optimal_k, ],type = "mean") pca
Loadings can be plotted via the function plot
with
type = "loadings"
. Keep in mind, that each PC for each
source is unique only up to a factor -1. While in theory this is not
remarkable, for comparing results in a sensible manner, the PCs should
be oriented to reflect similarities specifically. The function
align_PC
provides various approach to orient them in a
similar direction and for applications different type
arguments can be explored.
# plot aligned loadings
plot(x = pca,
type = "loadings",
gnames = 2001:2023,
k = 1)
#> $loadings
#> list()
#>
#> $<NA>
plot(x = pca,
type = "loadings",
gnames = 2001:2023,
k = 2)
#> $loadings
#> list()
#>
#> $<NA>
plot(x = pca,
type = "loadings",
gnames = 2001:2023,
k = 3)
#> $loadings
#> list()
#>
#> $<NA>
Also something like a biplot can be visualized to see the differences in the loadings.
biplot(pca)
To improve interpretation of the components, we calculate the scores
as well as the score distance and the orthogonal distance with the
function scores
. You can either use the ssMRCD object, then
the data from the covariance estimation is used and centered
accordingly, or you can specify the input manually.
= scores(X = X_data, groups = weather$year, PC = pca$PC, mu = ssMRCD_weather$MRCDmu, Sigma = ssMRCD_weather$MRCDcov)
sc = scores(PC = pca$PC, ssMRCD = ssMRCD_weather) sc2
Possible visualizations include univariate densities for each group via histograms of the scores.
plot(x = pca, ssMRCD = ssMRCD_weather, type = "scores")
#> $scores
#> list()
#>
#> $<NA>
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The score distance and orthogonal distance are calculated via the
scores
function. Some densities can be shown for each group
similar to Puchhammer, Wilms and Filzmoser (2024).
# distances
= sc$SD
score_sd = sc$OD
score_od
= data.frame(time = weather$time,
dat year = weather$year,
OD = score_od,
SD = score_sd,
groups = weather$year)
= dat %>% tidyr::pivot_longer(cols = c("OD", "SD"))
dat_melt
ggplot(dat_melt %>%
::filter(name %in% c( "SD", "OD"))) +
dplyr::geom_density_ridges_gradient(aes(x = value,
ggridgesy = groups,
fill = as.factor(groups),
group = groups),
scale = 10,
rel_min_height = 0.01,
gradient_lwd = 0.5) +
scale_x_continuous(expand = c(0.03, 0)) +
scale_y_discrete(expand = c(0.0, 0, 0.1, 0)) +
scale_fill_discrete(name = "") +
::theme_ridges(font_size = 13, grid = TRUE) +
ggridgestheme(axis.title.y = element_blank(),
legend.position = "right",
panel.grid.major.y = element_line()
+
) labs(x = "") +
facet_grid(cols = vars(name), scales = "free")
#> Picking joint bandwidth of 1.03e-06
#> Picking joint bandwidth of 1.8
Also, a distance-distance plot can be plotted.
plot(pca, type = "score_distances", ssMRCD = ssMRCD_weather, k = 3)
#> $score_distances
#> list()
#>
#> $<NA>
For more fun, the two distances can be visualized via Gif in a bivariate way for each group separately.
# use function saveGIF
library(animation)
= 2001:2023
param
saveGIF(interval = 0.1,
movie.name = "PCA_scoreDist_bivariate.gif",
expr = {
for (i in param ){
= ggplot(dat %>% dplyr::filter (year %in% seq(i-5, i+5, 1))) +
g geom_density_2d_filled(aes(x = SD,
y = OD),
alpha = 1) +
scale_x_sqrt(limits = c(0,40)) +
ylim(c(0, 6.2)) +
labs(title = i)
print(g)
}} )
GeoSphere Austria (2022): https://data.hub.geosphere.at.
Puchhammer, P., & Filzmoser, P. (2024). Spatially smoothed robust covariance estimation for local outlier detection. Journal of Computational and Graphical Statistics, 33(3), 928-940.https://doi.org/10.1080/10618600.2023.2277875.
Puchhammer, P., Wilms, I., & Filzmoser, P. (2024). Sparse outlier-robust PCA for multi-source data. arXiv preprint arXiv:2407.16299. https://doi.org/10.48550/arXiv.2407.16299.