| Type: | Package |
| Title: | Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances |
| Version: | 0.1.2 |
| Depends: | R (≥ 4.0.0) |
| Maintainer: | Marcus Mayrhofer <marcus.mayrhofer@tuwien.ac.at> |
| Description: | Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) <doi:10.1016/j.ecosta.2023.04.003>. |
| License: | GPL-3 |
| Imports: | dplyr, Rdpack, stats, tibble, tidyr, robustbase, forcats, egg, ggplot2, gridExtra, RColorBrewer, magrittr |
| Suggests: | grDevices, cellWise, robustHD, knitr, MASS, rmarkdown |
| RdMacros: | Rdpack |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| VignetteBuilder: | knitr |
| Repository: | CRAN |
| NeedsCompilation: | no |
| Packaged: | 2024-10-17 11:38:48 UTC; Marcus Mayrhofer |
| Author: | Marcus Mayrhofer [aut, cre], Peter Filzmoser [aut] |
| Date/Publication: | 2024-10-17 12:00:34 UTC |
ShapleyOutlier: Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances
Description
Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) doi:10.1016/j.ecosta.2023.04.003.
Author(s)
Maintainer: Marcus Mayrhofer marcus.mayrhofer@tuwien.ac.at
Authors:
Peter Filzmoser
Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling 'rhs(lhs)'.
Detecting cellwise outliers using Shapley values based on local outlyingness.
Description
The MOE function indicates outlying cells for
a data vector with p entries or data matrix with n \times p entries containing only numeric entries x
for a given center mu and covariance matrix Sigma using the Shapley value.
It is a more sophisticated alternative to the SCD algorithm,
which uses the information of the regular cells to derive an alternative reference point (Mayrhofer and Filzmoser 2023).
Usage
MOE(
x,
mu,
Sigma,
Sigma_inv = NULL,
step_size = 0.1,
min_deviation = 0,
max_step = NULL,
local = TRUE,
max_iter = 1000,
q = 0.99,
check_outlyingness = FALSE,
check = TRUE,
cells = NULL,
method = "cellMCD"
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
local |
Logical. If TRUE (default), the non-central Chi-Squared distribution is used to determine the cutoff value based on |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
check_outlyingness |
Logical. If TRUE (default), the outlyingness is rechecked after applying |
check |
Logical. If |
cells |
Either |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
Value
A list of class shapley_algorithm (new_shapley_algorithm) containing the following:
x |
A |
phi |
A |
mu_tilde |
A |
x_original |
A |
x_original |
The non-centrality parameters for the Chi-Squared distribution |
x_history |
A list with |
phi_history |
A list with |
mu_tilde_history |
A list with |
S_history |
A list with |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma)
plot(MOE_x)
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
MOE_X <- MOE(X, mu, Sigma)
plot(MOE_X, subset = 20)
Detecting cellwise outliers using Shapley values.
Description
The SCD function indicates outlying cells for
a data vector with p entries or data matrix with n \times p entries containing only numeric entries x
for a given center mu and covariance matrix Sigma using the Shapley value (Mayrhofer and Filzmoser 2023).
Usage
SCD(
x,
mu,
Sigma,
Sigma_inv = NULL,
step_size = 0.1,
min_deviation = 0,
max_step = NULL,
max_iter = 1000,
q = 0.99,
method = "cellMCD",
check = TRUE,
cells = NULL
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
Value
A list of class shapley_algorithm (new_shapley_algorithm) containing the following:
x |
A |
phi |
A |
x_original |
A |
x_history |
The path of how the original data vector was modified. |
phi_history |
The Shapley values corresponding to |
S_history |
The indices of the outlying cells in each iteration. |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma)
plot(SCD_x)
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
SCD_X <- SCD(X, mu, Sigma)
plot(SCD_X, subset = 20)
Weather data from Vienna
Description
Monthly data from the weather station Hohe Warte since April 1872 - Vienna (Stadt Wien 2022).
Usage
WeatherVienna
Format
A data frame with 1,804 rows and 25 columns:
yearYear
monthMonth
tDaily mean air temperature in °C, (t7 mean + t19 mean + tmax mean + tmin mean)/4; before 1971: t7 mean + t14 mean + 2 x t21 mean)
t_maxAbsolute maximum air temperature in °C
t_minAbsolute air temperature minimum in °C
avg_t_maxMean daily maximum air temperature in °C
avg_t_minMean daily minimum air temperature in °C
num_frostNumber of frost days (days with a temperature maximum tmin < 0.0 °C)
num_iceNumber of ice days (days with a temperature maximum tmax < 0.0 °C)
num_summerNumber of summer days (days with a temperature maximum tmax >= 25.0 °C)
num_heatNumber of hot days (days with a temperature maximum tmax >= 30.0 °C)
pDaily mean air pressure in hPa (mean of all measurements at 7 a.m., 2 p.m., 7 p.m. CET; before 1971 9 p.m. instead of 7 p.m.)
p_maxMaximum air pressure in hPa (maximum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
p_minMinimum air pressure in hPa (minimum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
sun_hMonthly total sunshine duration in hours
num_clearNumber of clear days (daily mean cloudiness < 20/100)
num_cloudNumber of cloudy days (daily mean cloudiness > 80/100)
rel_humDaily mean relative humidity in percent (2 x RH7 mean + RH14 mean + RH19 mean)/4; before 1971 9 p.m. instead of 7 p.m.)
rel_hum_maxRelative humidity maximum in percent
rel_hum_minRelative humidity minimum in percent
wind_vMonthly average wind speed in km/h
num_wind_v60Number of days with wind peaks >= 60 km/h
wind_v_maxMaximum wind speed in km/h
precp_sumMonthly total precipitation in mm
num_precp_01Number of days with precipitation >= 0.1 mm
Source
The data were downloaded from https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien in September 2022.
References
Stadt Wien (2022). “Monthly data from the weather station Hohe Warte since April 1872 - Vienna.” https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien.
Examples
data("WeatherVienna")
summary(WeatherVienna)
Class constructor for class shapley.
Description
This function creates an object of class shapley that is returned by the shapley function.
Usage
new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)
Arguments
phi |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
Value
Named list of class shapley, containing the input parameters.
Class constructor for class shapley_algorithm.
Description
This function creates an object of class shapley_algorithm that is returned
by the SCD and MOE functions.
Usage
new_shapley_algorithm(
x = numeric(),
phi = numeric(),
x_original = numeric(),
mu_tilde = NULL,
non_centrality = NULL,
x_history = NULL,
phi_history = NULL,
mu_tilde_history = NULL,
S_history = NULL
)
Arguments
x |
A |
phi |
A |
x_original |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
x_history |
Optional. A list with |
phi_history |
Optional. A list with |
mu_tilde_history |
Optional. A list with |
S_history |
Optional. A list with |
Value
Named list of class shapley_algorithm, containing the input parameters.
Class constructor for class shapley_interaction.
Description
This function creates an object of class shapley_interaction that is returned
by the shapley_interaction function.
Usage
new_shapley_interaction(PHI = numeric())
Arguments
PHI |
A |
Value
Matrix of class shapley_interaction, containing input matrix PHI.
Barplot of Shapley values
Description
Barplot of Shapley values
Usage
## S3 method for class 'shapley'
plot(
x,
subset = NULL,
chi2.q = 0.99,
abbrev.var = 3,
abbrev.obs = 10,
sort.var = FALSE,
sort.obs = FALSE,
plot_md = TRUE,
md_squared = TRUE,
rotate_x = TRUE,
...
)
Arguments
x |
A list of class |
subset |
Either an integer, |
chi2.q |
Quantile, only used if |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
plot_md |
Logical. If |
md_squared |
Logical. If |
rotate_x |
Logical. If |
... |
Optional arguments passed to methods. |
Value
Returns a barplot that displays the Shapley values (shapley)for each observation and optionally (plot_md = TRUE)
includes the squared Mahalanobis distance (black bar) and the corresponding (non-)central chi-square quantile (dotted line).
Examples
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma)
plot(call_shapley, subset = 1:20)
plot(call_shapley, subset = 5, rotate_x = FALSE)
plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)
Barplot and tileplot of Shapley values.
Description
Barplot and tileplot of Shapley values.
Usage
## S3 method for class 'shapley_algorithm'
plot(
x,
type = "both",
subset = NULL,
abbrev.var = FALSE,
abbrev.obs = FALSE,
sort.var = FALSE,
sort.obs = FALSE,
n_digits = 2,
rotate_x = TRUE,
continuous_rowname = FALSE,
...
)
Arguments
x |
A list of class |
type |
Either |
subset |
Either an integer, |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
n_digits |
Integer. If |
rotate_x |
Logical. If |
continuous_rowname |
Logical. If |
... |
Arguments passed on to |
Value
Returns plots for a list of class shapley_algorithm.
If type is "bar", a barplot is generated. It displays the Shapley values (shapley)
for each observation and optionally (plot_md = TRUE) includes the squared Mahalanobis distance (black bar)
and the corresponding (non-)central chi-square quantile (dotted line).
If type is "cell" a tileplot is generated. It displays each cells of the dataset and shows the original value from the observations,
color coding indicates whether those values were higher (red) or lower (blue) than the imputed values,
and the color intensity is based on the magnitude of the Shapley value.
If type is "both", the barplot and the tileplot are generated.
Examples
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
MOE_X <- MOE(X, mu, Sigma)
plot(MOE_X, subset = 20, n_digits = 0)
Plot of Shapley interaction indices
Description
Plot of Shapley interaction indices
Usage
## S3 method for class 'shapley_interaction'
plot(
x,
abbrev = 4,
title = "Shapley Interaction",
legend = TRUE,
text_size = 22,
...
)
Arguments
x |
A |
abbrev |
Integer. If |
title |
Character. Title of the plot. |
legend |
Logical. If TRUE (default), a legend is plotted. |
text_size |
Integer. Size of the text in the plot |
... |
Optional arguments passed to methods. |
Value
Returns a figure consisting of two panels. The upper panel shows the Shapley values, and the lower panel the Shapley interaction indices.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
PHI <- shapley_interaction(x, mu, Sigma)
plot(PHI)
Print function for class shapley.
Description
Print function for class shapley.
Usage
## S3 method for class 'shapley'
print(x, ...)
Arguments
x |
List of class |
... |
Optional arguments passed to methods. |
Value
Prints the list entries of x that are not NULL.
Print function for class shapley_algorithm.
Description
Print function for class shapley_algorithm.
Usage
## S3 method for class 'shapley_algorithm'
print(x, ...)
Arguments
x |
List of class |
... |
Optional arguments passed to methods. |
Value
Prints the imputed data and the Shapley values.
Print function for class shapley_interaction.
Description
Print function for class shapley_interaction.
Usage
## S3 method for class 'shapley_interaction'
print(x, ...)
Arguments
x |
Matrix of class |
... |
Optional arguments passed to methods. |
Value
Prints the Shapley interaction indices.
Decomposition of squared Mahalanobis distance using Shapley values.
Description
The shapley function computes a p-dimensional vector containing the decomposition of the
squared Mahalanobis distance of x (with respect to mu and Sigma)
into outlyingness contributions of the individual variables (Mayrhofer and Filzmoser 2023).
The value of the j-th coordinate of this vector represents the
average marginal contribution of the j-th variable to the squared Mahalanobis distance of
the individual observation x.
If cells is provided, Shapley values of x are computed with respect to a local reference point,
that is based on a cellwise prediction of each coordinate, using the information of the regular cells of x, see (Mayrhofer and Filzmoser 2023).
If x is a n \times p matrix, a n \times p matrix is returned, containing the decomposition for each row.
Usage
shapley(
x,
mu = NULL,
Sigma = NULL,
inverted = FALSE,
method = "cellMCD",
check = TRUE,
cells = NULL
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
Value
phi |
A |
mu_tilde |
A |
non_centrality |
The non-centrality parameters for the Chi-Squared distribution, given by |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
## Without outlying cells as input in the 'cells' argument#'
# Single observation
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
shapley(x, mu, Sigma)
phi <- shapley(x, mu, Sigma_inv, inverted = TRUE)
plot(phi)
# Multiple observations
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma)
plot(call_shapley, subset = 20)
## Giving outlying cells as input in the 'cells' argument
# Single observation
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE,
method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0))
plot(call_shapley)
# Multiple observations
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0)
plot(call_shapley, subset = 20)
Decomposition of squared Mahalanobis distance using Shapley interaction indices.
Description
The shapley_interaction function computes a p \times p matrix
containing pairwise outlyingness scores based on Shapley interaction indices.
It decomposes the squared Mahalanobis distance of x (with respect to mu and Sigma)
into outlyingness contributions of pairs of variables (Mayrhofer and Filzmoser 2023).
Usage
shapley_interaction(x, mu, Sigma, inverted = FALSE)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
Value
A p \times p matrix containing the decomposition of the squared Mahalanobis distance of x
into outlyingness scores for pairs of variables with respect to mu and Sigma.
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
shapley_interaction(x, mu, Sigma)
PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE)
plot(PHI)