...
首页> 外文期刊>Journal of network and computer applications >The BankSearch web document dataset: investigating unsupervised clustering and category similarity
【24h】

The BankSearch web document dataset: investigating unsupervised clustering and category similarity

机译:BankSearch Web文档数据集:调查无监督的聚类和类别相似性

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Targeting useful and relevant information on the internet is a highly complicated research area, which is served in part by research into document clustering. A foundational aspect of such research (proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked and assessed. We argue that, so far in this broad area of research, as many datasets have been used as research papers written, thus preventing confident reasoning about the relative performance of different techniques used in different publications. We describe a solution to this problem with the compilation of the BankSearch dataset, a proposed standard dataset suitable for a wide range of web-intelligence related research activities. At the time of writing, this dataset has already become a popular download in the Statlib archive, and is in use for benchmarking of a variety of document processing and web search techniques. Herein we also use the dataset in experiments to investigate certain issues in unsupervised web document clustering. Our main interest is how unsupervised clustering performance varies with the relative 'distance' between the categories inherent in the data, and how this is affected by the use of stemming and stoplists. These issues relate to, among other things, the design of useful search engines. We use simple k-means clustering, and find, unsurprisingly, that performance improves as categories become more distant. However, we also find that very close categories can be distinguished with fair accuracy, and there are interesting results concerning the use of stemming. Stop-word removal is confirmed as universally helpful, but stemming is not always to be recommended on 'distant' categories.
机译:在互联网上定位有用和相关的信息是一个非常复杂的研究领域,其中一部分是通过文件聚类研究来实现的。此类研究的基本方面(在其他研究学科中得到反复证明)是使用标准数据集,可以对不同的技术进行适当的基准测试和评估。我们认为,到目前为止,在这一广泛的研究领域中,已将许多数据集用作书面研究论文,从而使人们无法对在不同出版物中使用的不同技术的相对性能做出可靠的推理。我们通过BankSearch数据集的编译来描述此问题的解决方案,BankSearch数据集是适合各种与Web智能相关的研究活动的建议标准数据集。在撰写本文时,此数据集已成为Statlib档案库中的热门下载,并且已用于对各种文档处理和Web搜索技术进行基准测试。在此,我们还使用实验中的数据集来调查无监督Web文档聚类中的某些问题。我们的主要兴趣是无监督聚类性能如何随数据固有类别之间的相对“距离”而变化,以及这如何受到词干和非索引字表的影响。这些问题尤其与有用的搜索引擎的设计有关。我们使用简单的k均值聚类,毫不奇怪地发现,随着类别越来越远,性能会提高。但是,我们还发现可以非常准确地区分非常接近的类别,并且对于词干的使用有有趣的结果。停用词的删除被普遍认为是有帮助的,但并非总是建议在“远距离”类别上使用词干。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号