【24h】

Document Clustering with Committees

机译:与委员会的文档群集

获取原文

摘要

Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoo-like hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
机译:文档群集在许多信息检索任务中是有用的:文档浏览,组织和查看检索结果,yahoo样品的生成等文件等。聚类的一般目标是分组数据元素,使得组内的相似性高并且间间相似之处很低。我们介绍了一种名为CBC(委员会聚类)的聚类算法,该算法显示与多个众所周知的聚类算法相比,在文档聚类任务中产生更高质量的群集。它最初发现一组紧密的群集(群体内部相似性),称为委员会,它们在相似度空间(低间间相似性)中均匀地分散。委员会的联盟是所有元素的子集。算法通过将元素分配给其最相似的委员会来进行。评估群集质量一直是一项艰巨的任务,我们提出了一种基于输出群集和手动构造的类(答案密钥)之间的编辑距离的新评估方法。该评估措施比以前的评估措施更直观,更容易解释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号