The BankSearch web document dataset: investigating unsupervised clustering and category similarity

Mark P. Sinka; David W. Corne

首页> 外文期刊>Journal of network and computer applications >The BankSearch web document dataset: investigating unsupervised clustering and category similarity

【24h】

The BankSearch web document dataset: investigating unsupervised clustering and category similarity

机译：BankSearch Web文档数据集：调查无监督的聚类和类别相似性

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple k-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.

机译：在互联网上定位有用和相关的信息是一个非常复杂的研究领域，其中一部分是通过文件聚类研究来实现的。此类研究的基本方面（在其他研究学科中得到反复证明）是使用标准数据集，可以对不同的技术进行适当的基准测试和评估。我们认为，到目前为止，在这一广泛的研究领域中，已将许多数据集用作书面研究论文，从而使人们无法对在不同出版物中使用的不同技术的相对性能做出可靠的推理。我们通过BankSearch数据集的编译来描述此问题的解决方案，BankSearch数据集是适合各种与Web智能相关的研究活动的建议标准数据集。在撰写本文时，此数据集已成为Statlib档案库中的热门下载，并且已用于对各种文档处理和Web搜索技术进行基准测试。在此，我们还使用实验中的数据集来调查无监督Web文档聚类中的某些问题。我们的主要兴趣是无监督聚类性能如何随数据固有类别之间的相对“距离”而变化，以及这如何受到词干和非索引字表的影响。这些问题尤其与有用的搜索引擎的设计有关。我们使用简单的k均值聚类，毫不奇怪地发现，随着类别越来越远，性能会提高。但是，我们还发现可以非常准确地区分非常接近的类别，并且对于词干的使用有有趣的结果。停用词的删除被普遍认为是有帮助的，但并非总是建议在“远距离”类别上使用词干。

著录项

来源
《Journal of network and computer applications》 |2005年第2期|p.129-146|共18页
作者
Mark P. Sinka; David W. Corne;
展开▼
作者单位

Department of Computer Science, University of Reading, P.O. Box 225, Whiteknights, Reading RG6 6AY, UK;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
benchmark dataset; clustering; text classification; unsupervised learning; stemming; stoplists;

机译：基准数据集;聚类;文本分类;无监督学习;词干;停止列表;

相似文献

外文文献
中文文献
专利

1. Unsupervised Semantic Similarity Computation between Terms Using Web Documents [J] . Iosif Elias, Potamianos Alexandros Knowledge and Data Engineering, IEEE Transactions on . 2010,第11期

机译：使用Web文档的术语之间的无监督语义相似度计算
2. Unsupervised clustering for nontextual web document classification [J] . Samuel W.K. Chan, Mickey W.C. Chong Decision support systems . 2004,第3期

机译：用于非文本Web文档分类的无监督聚类
3. Unsupervised Web Topic Detection Using A Ranked Clustering-Like Pattern Across Similarity Cascades [J] . Pang Junbiao, Jia Fei, Zhang Chunjie, Multimedia, IEEE Transactions on . 2015,第6期

机译：跨相似性串级使用排序的类聚模式的无监督Web主题检测
4. An efficient web document clustering algorithm for building dynamic similarity profile in Similarity-aware web caching [C] . Xiao Ji-Tian ICMLC;International Conference on Machine Learning and Cybernetics . 2012

机译：在相似性感知Web缓存中构建动态相似性概要文件的有效Web文档聚类算法
5. Effects of similarity metrics on document clustering. [D] . Veni, Rushikesh. 2009

机译：相似性指标对文档聚类的影响。
6. Unsupervised analysis of transcriptomics in bacterial sepsis across multiple datasets reveals three robust clusters [O] . Timothy E Sweeney, Tej D Azad, Michele Donato, -1

机译：跨多个数据集的细菌性败血症转录组学的无监督分析揭示了三个强大的簇
7. Unsupervised clustering for nontextual web document classification [O] . Samuel W. K. Chan, Mickey W. C. Chong 2004

机译：用于非文本Web文档分类的无监督聚类

The BankSearch web document dataset: investigating unsupervised clustering and category similarity

摘要

著录项

相似文献

相关主题

期刊订阅