Tensor Gaussian graphical models (GGMs), interpreting conditional independence structures within tensor data, have important applications in numerous areas. Yet, the available tensor data in one single study is often limited due to high acquisition costs. Although relevant studies can provide additional data, it remains an open question how to pool such heterogeneous data. In this paper, we propose a transfer learning framework for tensor GGMs, which takes full advantage of informative auxiliary domains even when non-informative auxiliary domains are present, benefiting from the carefully designed data-adaptive weights. Our theoretical analysis shows substantial improvement of estimation errors and variable selection consistency on the target domain under much relaxed conditions, by leveraging information from auxiliary domains.
Suppose that besides observations \(\{ \boldsymbol{\mathcal{X}}_i\}_{i=1}^n\) from the target domain, observations \(\{ \boldsymbol{\mathcal{X}}_i^{(k)}\}_{i=1}^{n_k}\); \(k \in [K]\) from some auxiliary domains are also available. For example, in the ADHD brain functional network dataset, \(\{ \boldsymbol{\mathcal{X}}_i\}_{i=1}^n\) are the dynamic activation levels of many brain regions of interests collected from some fMRI scans at one neuroscience institute, and \(\{ \boldsymbol{\mathcal{X}}_i^{(k)}\}_{i=1}^{n_k}\) are collected from \(K=6\) other neuroscience institutes for better data analysis in the target institute. That is, $ _i$’s are independently generated from \(\mathrm{TN}(\boldsymbol{0}; \boldsymbol{\Sigma}_1, \cdots, \boldsymbol{\Sigma}_M)\) and \(\boldsymbol{\mathcal{X}}_i^{(k)}\)’s are independently generated from $ (; _1^{(k)}, , _M^{(k)})$ with \(\Sigma_m \in \mathbb{R}^{p_m \times p_m}\) and \(\boldsymbol{\Sigma}_m^{(k)} \in \mathbb{R}^{p_m \times p_m}\). Particularly, we are interested in estimating the precision matrix \(\boldsymbol{\Omega}_m = (\boldsymbol{\Sigma}_m)^{-1}\) in the target domain for \(m\in [M]\) via transfer learning on the tensor GGMs.
Define the divergence matrix as \(\boldsymbol{\Delta}_{m}^{(k)} = \boldsymbol{\Omega}_m \boldsymbol{\Sigma}_m^{(k)} - \boldsymbol{I}_{p_m}\), where \(\boldsymbol{I}_{p_m}\) is the \(p_m\)-dimensional identity matrix. Clearly, it gets closer to \(\boldsymbol{0}\) when \(\boldsymbol{\Sigma}_m^{(k)}\) gets closer to \(\boldsymbol{\Sigma}_m\), and thus it provides a natural measure of the similarity between \(\boldsymbol{\Sigma}_m^{(k)}\) and \(\boldsymbol{\Sigma}_m\). To leverage information of all auxiliary domains, we consider the weighted average of the covariance and divergence matrices as follows, \[\begin{equation}\nonumber \begin{aligned} & \boldsymbol{\Sigma}_m^{\mathcal{A}} = \sum_{k=1}^{K} \alpha_k \boldsymbol{\Sigma}_m^{(k)} \text{ and } \boldsymbol{\Delta}_{m} = \sum_{k=1}^{K} \alpha_k \boldsymbol{\Delta}_m^{(k)}, \text{ with } \sum_{k=1}^{K} \alpha_k = 1, \end{aligned} \end{equation}\] where the choice of weights \(\{\alpha_k\}_{k=1}^{K}\) shall depend on the contribution of each auxiliary domain and will be discussed in details in Section \(\ref{weights}\). Also, it holds true that \(\boldsymbol{\Omega}_m \boldsymbol{\Sigma}_m^{\mathcal{A}} - \boldsymbol{\Delta}_{m} - \boldsymbol{I}_{p_m} = \boldsymbol{0}\).
For each mode, a multi-step method can be proposed to realize the transfer learning of tensor graphical models.
Step 1. Initialization. Estimate \(\{ \widehat{\boldsymbol{\Omega}}^{(0)}_m \}_{m=1}^{M}\) based on target samples ${ i }{i=1}^{n} $, and \(\{ \widehat{\boldsymbol{\Omega}}_{m}^{(k)} \}_{m=1}^{M}\) based on auxiliary samples \(\{ \boldsymbol{\mathcal{X}}_i^{(k)}\}_{i=1}^{n_k}\), for \(k \in [K]\), using the separable estimation approach ``Tlasso”. Then, define \[\begin{equation}\nonumber \begin{aligned} & \widehat{\boldsymbol{\Sigma}}_m^{\mathcal{A}} = \sum_{k=1}^{K} \alpha_k \widehat{\boldsymbol{\Sigma}}_m^{(k)}, \ \ \text{ where } \widehat{\boldsymbol{\Sigma}}_m^{(k)} = \frac{p_m}{n_k p} \sum_{i=1}^{n_k}\widehat{\boldsymbol{V}}_{i,m}^{(k)} \widehat{\boldsymbol{V}}_{i,m}^{(k) \top}, \\ & \widehat{\boldsymbol{V}}_{i,m}^{(k)} = [ \boldsymbol{\mathcal{X}}_{i}^{(k)}]_{(m)} \left[ ( \widehat{\boldsymbol{\Omega}}_M^{(k)} )^{1/2} \otimes \cdots \otimes ( \widehat{\boldsymbol{\Omega}}_{m+1}^{(k)} )^{1/2} \otimes (\widehat{\boldsymbol{\Omega}}_{m-1}^{(k)})^{1/2} \otimes \cdots \otimes (\widehat{\boldsymbol{\Omega}}_{1}^{(k)})^{1/2} \right]. \end{aligned} \end{equation}\]
Step 2. For each \(m \in [M]\), perform the following two estimation steps separately.
(a). Estimate the divergence matrix of mode-\(m\), \[\begin{equation} \widehat{\boldsymbol{\Delta}}_{m} = \arg \min \mathcal{Q}_{1} (\boldsymbol{\Delta}_m ), \end{equation}\] where \(\mathcal{Q}_{1} (\boldsymbol{\Delta}_{m} ) = \frac{1}{2} \operatorname{tr} \{ \boldsymbol{\Delta}_{m}^{\top} \boldsymbol{\Delta}_{m} \}-\operatorname{tr} \left\{ ( \widehat{\boldsymbol{\Omega}}_m^{(0)} \widehat{\boldsymbol{\Sigma}}_m^{\mathcal{A}} - \boldsymbol{I}_{p_m} ) ^{\top} \boldsymbol{\Delta}_{m} \right\}+ \lambda_{1m} \| \boldsymbol{\Delta}_{m} \|_1\).
(b). Estimate the precision matrix of mode-$m$,\[\begin{equation} \widehat{\boldsymbol{\Omega}}_{m} = \arg \min \mathcal{Q}_{2} ( \boldsymbol{\Omega}_{m} ), \end{equation}\] where \(\mathcal{Q}_{2} ( \boldsymbol{\Omega}_{m}) = \frac{1}{2} \operatorname{tr} \{ \boldsymbol{\Omega}_{m}^{\top} \widehat{\boldsymbol{\Sigma}}_m^{\mathcal{A}} \boldsymbol{\Omega}_{m} \} -\operatorname{tr} \{ ( \widehat{\boldsymbol{\Delta}}_{m}^{\top}+ \boldsymbol{I}_{p_m} ) \boldsymbol{\Omega}_{m} \} + \lambda_{2m} \| \boldsymbol{\Omega}_{m} \|_{1, \mathrm{off}}\).
Moreover, the similarity between the target and auxiliary domains may be weak in some scenarios, so that the learning performance in the target domain may be deteriorated due to information transfer, which is so-called ``negative transfer”. One practical solution is to further perform a model selection step following, which guarantees that transfer learning is no less effective than using only the target domain. To this end, the data from the target domain can be randomly split into two folds \(\mathcal{N}\) and \(\mathcal{N}^C\), satisfying \(\mathcal{N} \cup \mathcal{N}^C = \{ 1, \cdots, n \}\) and \(\text{card}(\mathcal{N}) = cn\), for some constant \(0 < c <1\). The value of \(c\) is not sensitive, and we set \(c=0.6\) in all numerical experiments. The subjects in \(\mathcal{N}\) are used to construct the initialization of the separable transfer estimation in Step 1. The selection step is performed based on subjects in \(\mathcal{N}^C\). Specifically, based on \(\{ \widetilde{\boldsymbol{\Omega}}_{m}^{(0)} \}_{m=1}^{M}\) estimated using subjects in \(\mathcal{N}^C\), for \(j = 1, \cdots, p_m\), define \(\widetilde{\boldsymbol{\Sigma}}_m = \frac{p_m}{ (1-c)n p} \sum_{i \in \mathcal{N}^C} \widetilde{\boldsymbol{V}}_{i,m} \widetilde{\boldsymbol{V}}_{i,m}^{\top}\), \(\widetilde{\boldsymbol{V}}_{i,m} = [\boldsymbol{\mathcal{X}}_{i}]_{(m) } \left[ ( \widetilde{\boldsymbol{\Omega}}_M^{(0)} )^{1/2} \otimes \cdots \otimes ( \widetilde{\boldsymbol{\Omega}}_{m+1}^{(0)} )^{1/2} \otimes (\widetilde{\boldsymbol{\Omega}}_{m-1}^{(0)} )^{1/2} \otimes \cdots \otimes (\widetilde{\boldsymbol{\Omega}}_{1}^{(0)} )^{1/2} \right]\), and \[\begin{equation}\nonumber \begin{aligned} & \widehat{w}_{m,j} = \underset{ w \in \{ (0,1)^{\top}, (1,0)^{\top} \} }{\arg\min} \| \widetilde{\boldsymbol{\Sigma}}_m ( \widehat{\boldsymbol{\Omega}}^{(0)}_{m(j)}, \widehat{\boldsymbol{\Omega}}_{m(j)} ) w - \boldsymbol{I}_{p_m(j)} \|_2^2, \end{aligned} \end{equation}\] where \(\widehat{\boldsymbol{\Omega}}^{(0)}_{m(j)}\), \(\widehat{\boldsymbol{\Omega}}_{m(j)}\), and \(\boldsymbol{I}_{p_m(j)}\) are the \(j\)-th columns of \(\widehat{\boldsymbol{\Omega}}^{(0)}_{m}\), \(\widehat{\boldsymbol{\Omega}}_{m}\), and \(\boldsymbol{I}_{p_m}\), respectively. Then the final estimate becomes %\(\widehat{\boldsymbol{\Omega}}_{m(j)} = (\widehat{\boldsymbol{\Omega}}^{(0)}_{m(j)}, \widehat{\boldsymbol{\Omega}}_{m(j)}) \widehat{w}_{m,j}\). \[\begin{equation}\label{omega_obj} \widehat{\boldsymbol{\Omega}}_{m(j)}^{(f)} = (\widehat{\boldsymbol{\Omega}}^{(0)}_{m(j)}, \widehat{\boldsymbol{\Omega}}_{m(j)}) \widehat{w}_{m(j)}. \end{equation}\] The selection step realizes a model selection between the \(\widehat{\boldsymbol{\Omega}}^{(0)}_{m(j)}\) and \(\widehat{\boldsymbol{\Omega}}_{m(j)}\), which yields satisfactory theoretical and numerical performance. Note that \(\widehat{\boldsymbol{\Omega}}_{m}^{(f)}\) is not symmetric in general, and \((\widehat{\boldsymbol{\Omega}}_{m}^{(f)} + [\widehat{\boldsymbol{\Omega}}_{m}^{(f)}]^{\top}) / 2\) can be used as a symmetric estimate. Furthermore, it can be theoretically guaranteed that the final estimate is positive definite.
For the weights on auxiliary domains, a natural choice of is to set \[\begin{equation} \widehat{\boldsymbol{\Sigma}}_m^{\mathcal{A}} = \sum_{k=1}^{K} \alpha_k \widehat{\boldsymbol{\Sigma}}_m^{(k)}, \ \mbox{with} \ \alpha_k = n_k / N \ \mbox{and} \ N = \sum_{k=1}^{K} n_k, \end{equation}\] following from the fact that the auxiliary domain with larger sample size shall be more important. Yet, it does not take into account the similarities between the target and auxiliary domains. If there are some large non-informative auxiliary domains, although the final model selection step can guarantee that transfer learning is no less effective than using the target domain only, it may also offset the potential improvement benefiting from the informative auxiliary domains with positive impact.
To address this challenge, we further design some data-adaptive weights for auxiliary covariance matrices, in which weights are constructed combining both sample sizes and the estimated differences between the target and auxiliary domains. Particularly, we set \[\begin{equation} \widehat{\boldsymbol{\Sigma}}_m^{\mathcal{A}} = \sum_{k=1}^{K} \alpha_k \widehat{\boldsymbol{\Sigma}}_m^{(k)}, \text{ with } \alpha_k = \frac{n_k / \widehat{h}_k}{\sum_{k=1}^{K} (n_k / \widehat{h}_k)}, \end{equation}\] where \(\widehat{h}_k = \max_{m \in [M]} \| \widehat{\boldsymbol{\Delta}}_{m}^{(k)} \|_{1, \infty}\) and \(\widehat{\boldsymbol{\Delta}}_{m}^{(k)} = \widehat{\boldsymbol{\Omega}}_m^{(0)} \widehat{\boldsymbol{\Sigma}}_m^{(k)} - \boldsymbol{I}_{p_m}\). Clearly, for auxiliary domains with similar sample size, the weight for the one with smaller difference from the target domain is larger. Here we note that the type of norm for measuring similarity is not critical, and the specified \(L_1\)-norm is only for keeping with the form on theoretical analysis and may be replaced by other norms with slight modification. It is also interesting to note that even with such data-adaptive weights, the model selection step is still necessary to safeguard the extreme case where all the auxiliary domains are non-informative.
# example.data
library(TransTGGM)
library(Tlasso)
data(example.data)
t.data = example.data$t.data
A.data = example.data$A.data
t.Omega.true.list = example.data$t.Omega.true.list
normalize = T
K = length(A.data)
p.vec = dim(t.data)
M = length(p.vec) - 1
n = p.vec[M+1]
p.vec = p.vec[1:M]
tla.lambda = 20*sqrt( p.vec*log(p.vec) / ( n * prod(p.vec) ))
A.lambda = list()
for (k in 1:K) {
  A.lambda[[k]] = 20*sqrt( log(p.vec) / ( dim(A.data[[k]])[M+1] * prod(p.vec) ))
}
# the proposed method
res.final = tensor.GGM.trans(t.data, A.data, A.lambda, normalize = normalize)
# Tlasso
Tlasso.Omega.list = Tlasso.fit(t.data, lambda.vec = tla.lambda, norm.type = 1+as.numeric(normalize))
# summary
i.Omega = as.data.frame(t(unlist(est.analysis(res.final$Omega.list, t.Omega.true.list))))
i.Omega.diff = as.data.frame(t(unlist(est.analysis(res.final$Omega.list.diff, t.Omega.true.list))))
i.Tlasso = as.data.frame(t(unlist(est.analysis(Tlasso.Omega.list, t.Omega.true.list))))
i.Omega.diff     # proposed.v
i.Omega          # proposed
i.Tlasso         # Tlasso