首页> 外文会议>International conference on web information systems engineering >Extracting Records and Posts from Forum Pages with Limited Supervision
【24h】

Extracting Records and Posts from Forum Pages with Limited Supervision

机译:提取论坛页面的记录和帖子有限监督

获取原文

摘要

Internet forums are rich sources of human-generated content. Many applications, such as opinion mining and question answering, can greatly benefit from mining and exploring such useful content. An important step towards making user content from forums more easily accessible is to extract it from forum pages. We propose REPEX (REcord and Post Extractor), a two-step solution that uses limited supervision to achieve this goal. Given a forum page, REPEX first extracts data records that contain human-generated content and then, from these records, extracts their user content. The record extraction assumes that (1) a record is composed of an automatic-generated part, which we call record template, and a human-generated part; and (2) the structure of record templates are usually consistent across records. Based on those, the record extractor initially locates the subtree that contains all records in the forum page, using an information-theoretic measure, and then identifies the template of the records in this subtree, modelling this as an outlier detection problem. Finally, starting from the templates, REPEX determines the boundaries of the records. For the post extraction, REPEX applies an information extraction approach that performs this task by identifying the posts' string boundaries.
机译:互联网论坛具有丰富的人类生成内容来源。许多应用程序,如意见挖掘和问题回答,可以大大受益于采矿和探索这种有用的内容。从论坛制作用户内容的重要一步是更容易访问的是从论坛页面中提取它。我们提出了Repex(记录和后部提取器),这是一个使用有限监督的两步解决方案来实现这一目标。鉴于论坛页面,Repex首先提取包含人为人类生成内容的数据记录,然后从这些记录中提取其用户内容。记录提取假定(1)记录由自动生成的部分组成,我们呼叫记录模板和人类生成的部分; (2)记录模板的结构通常在记录中一致。基于这些,记录提取器最初使用信息定理度量来定位包含论坛页面中的所有记录的子树,然后标识该子树中记录的模板,将其建模为异常检测问题。最后,从模板开始,Repex确定记录的边界。对于Post Extraction,Repex应用一种通过识别POST的字符串边界来执行此任务的信息提取方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号