Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy

Chi Cheng Huang, Shih Hsin Tu, Ching Shui Huang, Heng Hui Lien, Liang Chuan Lai, Eric Y. Chuang

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n = 535). The agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.

Original languageEnglish
Article number248648
JournalBioMed Research International
Volume2013
DOIs
Publication statusPublished - 2013

Fingerprint

Taxonomies
Least-Squares Analysis
Gene expression
Breast Neoplasms
Gene Expression
Microarrays
Genes
Bioinformatics
Gene Components
Learning systems
Logistics
Throughput
Feasibility Studies
Computational Biology
Transcriptome
Logistic Models

ASJC Scopus subject areas

  • Biochemistry, Genetics and Molecular Biology(all)
  • Immunology and Microbiology(all)

Cite this

Multiclass prediction with partial least square regression for gene expression data : Applications in breast cancer intrinsic taxonomy. / Huang, Chi Cheng; Tu, Shih Hsin; Huang, Ching Shui; Lien, Heng Hui; Lai, Liang Chuan; Chuang, Eric Y.

In: BioMed Research International, Vol. 2013, 248648, 2013.

Research output: Contribution to journalArticle

@article{76b83a5eec62418597b334e60badeabc,
title = "Multiclass prediction with partial least square regression for gene expression data: Applications in breast cancer intrinsic taxonomy",
abstract = "Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n = 535). The agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.",
author = "Huang, {Chi Cheng} and Tu, {Shih Hsin} and Huang, {Ching Shui} and Lien, {Heng Hui} and Lai, {Liang Chuan} and Chuang, {Eric Y.}",
year = "2013",
doi = "10.1155/2013/248648",
language = "English",
volume = "2013",
journal = "BioMed Research International",
issn = "2314-6133",
publisher = "Hindawi Publishing Corporation",

}

TY - JOUR

T1 - Multiclass prediction with partial least square regression for gene expression data

T2 - Applications in breast cancer intrinsic taxonomy

AU - Huang, Chi Cheng

AU - Tu, Shih Hsin

AU - Huang, Ching Shui

AU - Lien, Heng Hui

AU - Lai, Liang Chuan

AU - Chuang, Eric Y.

PY - 2013

Y1 - 2013

N2 - Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n = 535). The agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.

AB - Multiclass prediction remains an obstacle for high-throughput data analysis such as microarray gene expression profiles. Despite recent advancements in machine learning and bioinformatics, most classification tools were limited to the applications of binary responses. Our aim was to apply partial least square (PLS) regression for breast cancer intrinsic taxonomy, of which five distinct molecular subtypes were identified. The PAM50 signature genes were used as predictive variables in PLS analysis, and the latent gene component scores were used in binary logistic regression for each molecular subtype. The 139 prototypical arrays for PAM50 development were used as training dataset, and three independent microarray studies with Han Chinese origin were used for independent validation (n = 535). The agreement between PAM50 centroid-based single sample prediction (SSP) and PLS-regression was excellent (weighted Kappa: 0.988) within the training samples, but deteriorated substantially in independent samples, which could attribute to much more unclassified samples by PLS-regression. If these unclassified samples were removed, the agreement between PAM50 SSP and PLS-regression improved enormously (weighted Kappa: 0.829 as opposed to 0.541 when unclassified samples were analyzed). Our study ascertained the feasibility of PLS-regression in multi-class prediction, and distinct clinical presentations and prognostic discrepancies were observed across breast cancer molecular subtypes.

UR - http://www.scopus.com/inward/record.url?scp=84896066046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84896066046&partnerID=8YFLogxK

U2 - 10.1155/2013/248648

DO - 10.1155/2013/248648

M3 - Article

C2 - 24490149

AN - SCOPUS:84896066046

VL - 2013

JO - BioMed Research International

JF - BioMed Research International

SN - 2314-6133

M1 - 248648

ER -