Identify Breast Cancer Subtypes by Gene Expression Profiles

Grace S. Shieh, Chyi-Huey Bai, Chih Lee

Research output: Contribution to journalArticle

Abstract

Support vector machines (SVMs), with linear, polynomial and
radial kernels, were applied to classify subtypes of breast cancer by gene
expression profiles of tissues samples. Using the top 500 genes ranked by
between-group to within-group sum of squares, SVMs with linear kernel had
an average accuracy rate about 97% when applied to a balanced dataset; this
accuracy rate was significantly higher than that of the original data. After
imputation, the smallest subsample of the balanced dataset was comparable
to the other subsamples’ (containing more than 10 samples). In biomedical
sciences, it is of interest to identify genes that can be used to classify subtypes
of breast cancer well. Using SVMs, we identified 500 genes and looked up
the functions of 297 genes from databases. Furthermore, about 65% of these
297 genes were known to be related to breast cancer, and this confirms the
consistency of our results with existing biomedical knowledge. Those 203
genes may also be investigated further to see if they are involved in breast
cancer; any novel findings will be important.
Original languageEnglish
Pages (from-to)165-175
Journaljournal of data science
Publication statusPublished - 2004
Externally publishedYes

Cite this

Identify Breast Cancer Subtypes by Gene Expression Profiles. / Shieh, Grace S. ; Bai, Chyi-Huey; Lee, Chih .

In: journal of data science, 2004, p. 165-175.

Research output: Contribution to journalArticle

@article{c36a4d3547394baea094ed6bde65c953,
title = "Identify Breast Cancer Subtypes by Gene Expression Profiles",
abstract = "Support vector machines (SVMs), with linear, polynomial andradial kernels, were applied to classify subtypes of breast cancer by geneexpression profiles of tissues samples. Using the top 500 genes ranked bybetween-group to within-group sum of squares, SVMs with linear kernel hadan average accuracy rate about 97{\%} when applied to a balanced dataset; thisaccuracy rate was significantly higher than that of the original data. Afterimputation, the smallest subsample of the balanced dataset was comparableto the other subsamples’ (containing more than 10 samples). In biomedicalsciences, it is of interest to identify genes that can be used to classify subtypesof breast cancer well. Using SVMs, we identified 500 genes and looked upthe functions of 297 genes from databases. Furthermore, about 65{\%} of these297 genes were known to be related to breast cancer, and this confirms theconsistency of our results with existing biomedical knowledge. Those 203genes may also be investigated further to see if they are involved in breastcancer; any novel findings will be important.",
keywords = "Classification, microarray gene expression data, support vector machines, tumor",
author = "Shieh, {Grace S.} and Chyi-Huey Bai and Chih Lee",
year = "2004",
language = "English",
pages = "165--175",
journal = "journal of data science",
issn = "1680-743X",

}

TY - JOUR

T1 - Identify Breast Cancer Subtypes by Gene Expression Profiles

AU - Shieh, Grace S.

AU - Bai, Chyi-Huey

AU - Lee, Chih

PY - 2004

Y1 - 2004

N2 - Support vector machines (SVMs), with linear, polynomial andradial kernels, were applied to classify subtypes of breast cancer by geneexpression profiles of tissues samples. Using the top 500 genes ranked bybetween-group to within-group sum of squares, SVMs with linear kernel hadan average accuracy rate about 97% when applied to a balanced dataset; thisaccuracy rate was significantly higher than that of the original data. Afterimputation, the smallest subsample of the balanced dataset was comparableto the other subsamples’ (containing more than 10 samples). In biomedicalsciences, it is of interest to identify genes that can be used to classify subtypesof breast cancer well. Using SVMs, we identified 500 genes and looked upthe functions of 297 genes from databases. Furthermore, about 65% of these297 genes were known to be related to breast cancer, and this confirms theconsistency of our results with existing biomedical knowledge. Those 203genes may also be investigated further to see if they are involved in breastcancer; any novel findings will be important.

AB - Support vector machines (SVMs), with linear, polynomial andradial kernels, were applied to classify subtypes of breast cancer by geneexpression profiles of tissues samples. Using the top 500 genes ranked bybetween-group to within-group sum of squares, SVMs with linear kernel hadan average accuracy rate about 97% when applied to a balanced dataset; thisaccuracy rate was significantly higher than that of the original data. Afterimputation, the smallest subsample of the balanced dataset was comparableto the other subsamples’ (containing more than 10 samples). In biomedicalsciences, it is of interest to identify genes that can be used to classify subtypesof breast cancer well. Using SVMs, we identified 500 genes and looked upthe functions of 297 genes from databases. Furthermore, about 65% of these297 genes were known to be related to breast cancer, and this confirms theconsistency of our results with existing biomedical knowledge. Those 203genes may also be investigated further to see if they are involved in breastcancer; any novel findings will be important.

KW - Classification

KW - microarray gene expression data

KW - support vector machines

KW - tumor

M3 - Article

SP - 165

EP - 175

JO - journal of data science

JF - journal of data science

SN - 1680-743X

ER -