首页> 外文会议>International Conference on Information Reuse and Integration for Data Science >Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses
【24h】

Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses

机译:基于自然语言的在线评论数据集集成,以识别性交易企业

获取原文

摘要

There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.
机译:在在线评论网站中自动识别与性贩运有关的广告的兴趣日益浓厚。主要的挑战是要识别用于指示非法业务的文本审阅中不断变化的模式。这项工作描述了一种使用自然语言处理和机器学习来识别非法商业广告的新颖方法。该方法依赖于建立一套对已知非法企业的审查的训练集。通过将小型高精度的已知非法企业(Rubmap)与来自通用审查站点(Yelp)的大量在线审查相集成,来创建此培训数据。然后,将标准自然语言预处理技术应用于文本审阅,并通过术语频率逆文档加权将其转换成词袋模型。生成的文档术语矩阵用于训练分类器,然后从其余评论中识别可疑活动。因此,此方法利用高精度,低调用率数据集从大型低精度,高调用率数据集中识别相关实例。 Yelp在线论坛对456,050条评论进行了评估,该评论包含各种机器学习算法和不同数量的文本功能。使用随机森林分类器,该方法的f1得分为0.77。对于精简的分类器,文本特征的数量也可以从1,473减少到447,而准确性仅会下降一点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号