首页> 外文会议>Proof of Designed Reliability >Bootstrapping Semantic Annotation for Content-Rich HTML Documents
【24h】

Bootstrapping Semantic Annotation for Content-Rich HTML Documents

机译:内容丰富的HTML文档的自举语义注释

获取原文
获取原文并翻译 | 示例

摘要

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique.
机译:HTML文档中仍在编码大量的语义数据。对此类文档中隐含的语义概念的识别和注释使它们直接适用于语义Web处理。在本文中,我们描述了一种用于注释HTML文档(尤其是基于模板的内容丰富的文档)的高度自动化的技术,每个文档包含许多不同的语义概念。从一组HTML文档中的带有语义标签的手动标记实例实例的一小种子开始,我们启动了一个注释过程,该过程会自动识别其他文档中存在的未标记的实例实例。自举技术利用了以下观察:内容丰富的文档中的语义相关项在表示样式和空间局部性方面表现出一致性,以学习一种统计模型,以准确地识别从各种Web来源中提取的HTML文档中的不同语义概念。我们还介绍了该技术有效性的实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号