A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering

Tsau Young Lin, I. Jen Chiang

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the collection. Based on these concepts, documents can be clustered into meaningful classes. However, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the concept so defined is called maximal primitive concepts. Experiments with three different data sets from Web pages and medical literature have shown that the proposed unsupervised clustering approach performs significantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAG). This abstract geometric model seems have captured the latent semantic structure of documents.

Original languageEnglish
Pages (from-to)55-80
Number of pages26
JournalInternational Journal of Approximate Reasoning
Volume40
Issue number1-2
DOIs
Publication statusPublished - Jul 2005

Fingerprint

Document Clustering
Simplicial Complex
Hypergraph
Clustering algorithms
Websites
Semantics
Connected Components
Topology
Experiments
Unsupervised Clustering
Geometric Model
K-means Clustering
Hierarchical Clustering
Geometric Structure
Higher Dimensions
Clustering Algorithm
Attack
Concepts
Document clustering
Term

Keywords

  • Association rules
  • Document clustering
  • Hierarchical clustering
  • Simplicial complex
  • Topology

ASJC Scopus subject areas

  • Statistics and Probability
  • Electrical and Electronic Engineering
  • Statistics, Probability and Uncertainty
  • Information Systems and Management
  • Information Systems
  • Computer Science Applications
  • Artificial Intelligence

Cite this

A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering. / Lin, Tsau Young; Chiang, I. Jen.

In: International Journal of Approximate Reasoning, Vol. 40, No. 1-2, 07.2005, p. 55-80.

Research output: Contribution to journalArticle

@article{ddf648c0321542129499a588a6a37107,
title = "A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering",
abstract = "This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the collection. Based on these concepts, documents can be clustered into meaningful classes. However, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the concept so defined is called maximal primitive concepts. Experiments with three different data sets from Web pages and medical literature have shown that the proposed unsupervised clustering approach performs significantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAG). This abstract geometric model seems have captured the latent semantic structure of documents.",
keywords = "Association rules, Document clustering, Hierarchical clustering, Simplicial complex, Topology",
author = "Lin, {Tsau Young} and Chiang, {I. Jen}",
year = "2005",
month = "7",
doi = "10.1016/j.ijar.2004.11.005",
language = "English",
volume = "40",
pages = "55--80",
journal = "International Journal of Approximate Reasoning",
issn = "0888-613X",
publisher = "Elsevier Inc.",
number = "1-2",

}

TY - JOUR

T1 - A simplicial complex, a hypergraph, structure in the latent semantic space of document clustering

AU - Lin, Tsau Young

AU - Chiang, I. Jen

PY - 2005/7

Y1 - 2005/7

N2 - This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the collection. Based on these concepts, documents can be clustered into meaningful classes. However, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the concept so defined is called maximal primitive concepts. Experiments with three different data sets from Web pages and medical literature have shown that the proposed unsupervised clustering approach performs significantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAG). This abstract geometric model seems have captured the latent semantic structure of documents.

AB - This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the collection. Based on these concepts, documents can be clustered into meaningful classes. However, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the concept so defined is called maximal primitive concepts. Experiments with three different data sets from Web pages and medical literature have shown that the proposed unsupervised clustering approach performs significantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAG). This abstract geometric model seems have captured the latent semantic structure of documents.

KW - Association rules

KW - Document clustering

KW - Hierarchical clustering

KW - Simplicial complex

KW - Topology

UR - http://www.scopus.com/inward/record.url?scp=19044395829&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=19044395829&partnerID=8YFLogxK

U2 - 10.1016/j.ijar.2004.11.005

DO - 10.1016/j.ijar.2004.11.005

M3 - Article

AN - SCOPUS:19044395829

VL - 40

SP - 55

EP - 80

JO - International Journal of Approximate Reasoning

JF - International Journal of Approximate Reasoning

SN - 0888-613X

IS - 1-2

ER -