首页> 外文会议>Conference on empirical methods in natural language processing >Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation
【24h】

Identifying Products in Online Cybercrime Marketplaces: A Dataset for Fine-grained Domain Adaptation

机译:识别在线网络犯罪市场中的产品:用于细粒度域适应的数据集

获取原文

摘要

One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own "fine-grained domain" in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.
机译:机器学习的NLP模型的一个弱点是,它们通常在域外数据上的性能较差。在这项工作中,我们研究了在网上网络犯罪论坛中确定正在买卖的产品的任务,这显示出极具挑战性的跨域影响。我们制定了一个任务,该任务代表缝隙填充信息提取和命名实体识别的混合体,并对来自四个不同论坛的数据进行注释。这些论坛中的每一个都构成了自己的“细粒度域”,因为尽管所有论坛都属于网络犯罪领域,但它们涵盖具有不同属性的不同市场领域。我们在基于学习的系统中描述了这些领域的差异:监督模型在应用于新论坛时会降低准确性,而半监督学习和领域适应的标准技术在此数据上的作用有限,这表明需要改进这些技术。我们发布了来自四个论坛的1,938个带注释的帖子的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号