Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data

Yuan Huang, Jin Liu, Huangdi Yi, Ben Chang Shia, Shuangge Ma

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.

Original languageEnglish
Pages (from-to)509-559
Number of pages51
JournalStatistics in Medicine
Volume36
Issue number3
DOIs
Publication statusPublished - Feb 10 2017

Fingerprint

Sparsity
Cancer
Neoplasms
Boosting
Model
imidazole mustard
Penalization Method
Accelerated Failure Time Model
Censored Survival Data
Similarity
Datasets
Right-censored Data
Prognosis
Small Sample Size
Performance Prediction
Profiling
Breast Cancer
Overlapping
Penalty
Numerical Study

Keywords

  • heterogeneity structure
  • integrative analysis
  • marker identification
  • model sparsity structure
  • sparse boosting

ASJC Scopus subject areas

  • Epidemiology
  • Statistics and Probability

Cite this

Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data. / Huang, Yuan; Liu, Jin; Yi, Huangdi; Shia, Ben Chang; Ma, Shuangge.

In: Statistics in Medicine, Vol. 36, No. 3, 10.02.2017, p. 509-559.

Research output: Contribution to journalArticle

Huang, Yuan ; Liu, Jin ; Yi, Huangdi ; Shia, Ben Chang ; Ma, Shuangge. / Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data. In: Statistics in Medicine. 2017 ; Vol. 36, No. 3. pp. 509-559.
@article{30ade1b54bd74489851d8d3ed3803a64,
title = "Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data",
abstract = "In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.",
keywords = "heterogeneity structure, integrative analysis, marker identification, model sparsity structure, sparse boosting",
author = "Yuan Huang and Jin Liu and Huangdi Yi and Shia, {Ben Chang} and Shuangge Ma",
year = "2017",
month = "2",
day = "10",
doi = "10.1002/sim.7138",
language = "English",
volume = "36",
pages = "509--559",
journal = "Statistics in Medicine",
issn = "0277-6715",
publisher = "John Wiley and Sons Ltd",
number = "3",

}

TY - JOUR

T1 - Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data

AU - Huang, Yuan

AU - Liu, Jin

AU - Yi, Huangdi

AU - Shia, Ben Chang

AU - Ma, Shuangge

PY - 2017/2/10

Y1 - 2017/2/10

N2 - In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.

AB - In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.

KW - heterogeneity structure

KW - integrative analysis

KW - marker identification

KW - model sparsity structure

KW - sparse boosting

UR - http://www.scopus.com/inward/record.url?scp=84988909504&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988909504&partnerID=8YFLogxK

U2 - 10.1002/sim.7138

DO - 10.1002/sim.7138

M3 - Article

C2 - 27667129

AN - SCOPUS:84988909504

VL - 36

SP - 509

EP - 559

JO - Statistics in Medicine

JF - Statistics in Medicine

SN - 0277-6715

IS - 3

ER -