首页> 外文会议>9th International conference on language resources and evaluation >Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
【24h】

Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing

机译:通过人群采购设计和评估Web流派的可靠语料库

获取原文

摘要

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories' chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.
机译:自然语言处理的研究通常依赖于大量手动注释的文件。然而,目前没有可靠的类型的网页批评语料库,以便在自动类型识别(AGI)中使用。在AGI中,文件根据其类型而不是其主题或主题进行分类。可用Web类型集合的主要缺点是它们相对较低的编码器间协议。注释数据的可靠性是研究结果可靠性的必要因素。在本文中,我们介绍了可靠注释的第一个Web流派语料库。我们开发了精确且一致的注释指南,包括定义明确且公认的类别。为了注释语料库,我们使用人群采购,这是一种在流派注释中的新方法。我们计算了整体和各个类别的机会纠正的补充剂协议。结果表明,语料库已可靠地注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号