Document Clustering using Semantic Cliques Aggregation

Ajit Kumar, I-Jen Chiang

Research output: Contribution to journalArticle

Abstract

The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.
Original languageEnglish
Pages (from-to)28-40
JournalJournal of Computer and Communications
Volume3
Issue number12
DOIs
Publication statusPublished - 2015

Fingerprint

Agglomeration
Semantics
Search engines
Clustering algorithms
Association rules
Websites

Keywords

  • Theme
  • Aggregation
  • Association
  • Semantic Clique
  • Document Clustering

Cite this

Document Clustering using Semantic Cliques Aggregation. / Kumar, Ajit; Chiang, I-Jen.

In: Journal of Computer and Communications, Vol. 3, No. 12, 2015, p. 28-40.

Research output: Contribution to journalArticle

Kumar, Ajit ; Chiang, I-Jen. / Document Clustering using Semantic Cliques Aggregation. In: Journal of Computer and Communications. 2015 ; Vol. 3, No. 12. pp. 28-40.
@article{0dd77f996c094a2b939c670d7eaa2917,
title = "Document Clustering using Semantic Cliques Aggregation",
abstract = "The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.",
keywords = "Theme, Aggregation, Association, Semantic Clique, Document Clustering",
author = "Ajit Kumar and I-Jen Chiang",
year = "2015",
doi = "10.4236/jcc.2015.312004",
language = "English",
volume = "3",
pages = "28--40",
journal = "Journal of Computer and Communications",
number = "12",

}

TY - JOUR

T1 - Document Clustering using Semantic Cliques Aggregation

AU - Kumar, Ajit

AU - Chiang, I-Jen

PY - 2015

Y1 - 2015

N2 - The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.

AB - The search engines are indispensable tools to find information amidst massive web pages and documents. A good search engine needs to retrieve information not only in a shorter time, but also relevant to the users’ queries. Most search engines provide short time retrieval to user queries; however, they provide a little guarantee of precision even to the highly detailed users’ queries. In such cases, documents clustering centered on the subject and contents might improve search results. This paper presents a novel method of document clustering, which uses semantic clique. First, we extracted the Features from the documents. Later, the associations between frequently co-occurring terms were defined, which were called as semantic cliques. Each connected component in the semantic clique represented a theme. The documents clustered based on the theme, for which we designed an aggregation algorithm. We evaluated the aggregation algorithm effectiveness using four kinds of datasets. The result showed that the semantic clique based document clustering algorithm performed significantly better than traditional clustering algorithms such as Principal Direction Divisive Partitioning (PDDP), k-means, Auto-Class, and Hierarchical Clustering (HAC). We found that the Semantic Clique Aggregation is a potential model to represent association rules in text and could be immensely useful for automatic document clustering.

KW - Theme

KW - Aggregation

KW - Association

KW - Semantic Clique

KW - Document Clustering

U2 - 10.4236/jcc.2015.312004

DO - 10.4236/jcc.2015.312004

M3 - Article

VL - 3

SP - 28

EP - 40

JO - Journal of Computer and Communications

JF - Journal of Computer and Communications

IS - 12

ER -