Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses

机译：基于自然语言的在线评论数据集集成，以识别性交易企业

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.

机译：在在线评论网站中自动识别与性贩运有关的广告的兴趣日益浓厚。主要的挑战是要识别用于指示非法业务的文本审阅中不断变化的模式。这项工作描述了一种使用自然语言处理和机器学习来识别非法商业广告的新颖方法。该方法依赖于建立一套对已知非法企业的审查的训练集。通过将小型高精度的已知非法企业（Rubmap）与来自通用审查站点（Yelp）的大量在线审查相集成，来创建此培训数据。然后，将标准自然语言预处理技术应用于文本审阅，并通过术语频率逆文档加权将其转换成词袋模型。生成的文档术语矩阵用于训练分类器，然后从其余评论中识别可疑活动。因此，此方法利用高精度，低调用率数据集从大型低精度，高调用率数据集中识别相关实例。 Yelp在线论坛对456,050条评论进行了评估，该评论包含各种机器学习算法和不同数量的文本功能。使用随机森林分类器，该方法的f1得分为0.77。对于精简的分类器，文本特征的数量也可以从1,473减少到447，而准确性仅会下降一点。

著录项

来源
《International Conference on Information Reuse and Integration for Data Science》|2020年|259-264|共6页
会议地点
作者
Maria Diaz; Anand Panangadan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Business; Machine learning; Law enforcement; Natural language processing; Training; Training data; Urban areas;

机译：商业;机器学习;法律执行;自然语言处理;培训;培训数据;城市地区;

相似文献

外文文献
中文文献
专利

1. Integration of spatial datasets to support the review of hydrometric networks and the identification of representative catchments [J] . Laize CLR Hydrology and Earth System Sciences . 2004,第6期

机译：整合空间数据集以支持水文网络的审查和代表性流域的确定
2. Integration of spatial datasets to support the review of hydrometric networks and the identification of representative catchments [J] . C. L. R.Laize Hydrology and Earth System Sciences . 2004,第6期

机译：整合空间数据集以支持水文网络的审查和代表性流域的确定
3. Text mining datasets of β-hydroxybutyrate (BHB) supplement products’ consumer online reviews [J] . Ji Li, Dan Lowe, Luke Wayment, Data in Brief . 2020,第2期

机译：β-羟基丁酸酯（BHB）补充产品的文本挖掘数据集
4. Virtual indicators of sex trafficking to identify potential victims in online advertisements [C] . Michelle Ibanez, Rich Gazan Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining . 2016

机译：虚拟性交易指标，以识别在线广告中的潜在受害者
5. KM vs Human Trafficking: An Exploratory Study on Using Emojis for a Knowledge Driven Approach to Identifying Online Human Sex Trafficking [D] . Whitney, Jessica Christine. 2017

机译：知识管理与人口贩运：使用表情符号作为知识驱动方法识别在线人口贩运的探索性研究
6. Detecting Topic and Sentiment Trends in Physician Rating Websites: Analysis of Online Reviews Using 3-Wave Datasets [O] . Adnan Muhammad Shah, Rizwan Ali Naqvi, Ok-Ran Jeong 2021

机译：检测医师评级网站的主题和情感趋势：使用3波数据集的在线评论分析
7. Integration of spatial datasets to support the review of hydrometric networks and the identification of representative catchments [O] . Laize C. L. R. 2004

机译：整合空间数据集以支持水文网络的审查和代表性流域的确定

Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses

摘要

著录项

相似文献

相关主题

期刊订阅