Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing

TAKANOBU OTSUKA; DEYUE DENG; TAKAYUKI ITO

首页> 外文期刊>Electronics and communications in Japan >Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing

【24h】

Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing

机译：使用三字共现和大规模数据处理对有害文档分类进行文本过滤

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Young people are increasingly using the Internet. However, this creates the problem that the material they encounter may adversely affect them. Therefore we propose a method of automatically classifying harmful sentences. Research on information filtering has been improving the performance of filters by introducing co-occurrence information. Extended from two words co-occurrence information, which is commonly studied in such work, we have created training data using co-occurrence information with three words. However, compared with two-word co-occurrence information, the processing time becomes a problem because of the increase in the amount of training data. In addition, we have found that noise caused by the increase in co-occurrences exceeds the capacity of double-precision floating-point calculations. We improved processing speed by implementing a text filtering system with three-word cooccurrence using a Bayesian filter in order to parallelize a fast MyISAM database. In addition, by using BigDecimal to remove the noise caused by the increase in the number of co-occurrences, we achieved a high F-value.

机译：年轻人越来越多地使用互联网。但是，这产生了一个问题，即它们遇到的材料可能会对它们产生不利影响。因此，我们提出了一种自动分类有害句子的方法。信息过滤的研究一直在通过引入共现信息来提高过滤器的性能。从在此类工作中经常研究的两个单词的共现信息扩展，我们使用三个单词的共现信息创建了训练数据。然而，与两字共现信息相比，由于训练数据量的增加，处理时间成为问题。另外，我们发现由共现增加引起的噪声超过了双精度浮点计算的能力。为了实现快速MyISAM数据库的并行化，我们使用贝叶斯过滤器实现了具有三个单词共现的文本过滤系统，从而提高了处理速度。此外，通过使用BigDecimal消除由共现次数增加引起的噪声，我们实现了较高的F值。

著录项

来源
《Electronics and communications in Japan》 |2015年第10期|31-40|共10页
作者
TAKANOBU OTSUKA; DEYUE DENG; TAKAYUKI ITO;
展开▼
作者单位

Graduate School of Information Science, Nagoya Institute of Technology Gokiso, Japan;

Techno-Business Administration Program, Nagoya Institute of Technology Gokiso, Japan;

Techno-Business Administration Program, Nagoya Institute of Technology Gokiso, Japan;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
text filtering; three-word co-occurrence; large-scale data processing;

机译：文字过滤;三字共现;大规模数据处理;

相似文献

外文文献
中文文献
专利

1. Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network [J] . Akhter Muhammad Pervez, Jiangbin Zheng, Naqvi Irfan Raza, Quality Control, Transactions . 2020,第期

机译：使用单层多功能过滤器卷积神经网络的文档级文本分类
2. Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model [J] . Computer and Information Science . 2009,第4期

机译：基于矢量空间模型的贝叶斯分类文本文档预处理
3. Mining coherent topics in documents using word embeddings and large-scale text data [J] . Yao Liang, Zhang Yin, Chen Qinfei, Engineering Applications of Artificial Intelligence . 2017,第sepa期

机译：使用词嵌入和大规模文本数据挖掘文档中的连贯主题
4. An Efficient Filtered Classifier for Classification of Unseen Test Data in Text Documents [C] . G.Naga Chandrika, E.Srinivasa Reddy IEEE International Conference on Computational Intelligence and Computing Research . 2017

机译：一种有效的过滤分类器，用于对文本文档中看不见的测试数据进行分类
5. A digital filter model for data mining of text documents. [D] . Goldman, Jeffrey Alan. 1998

机译：用于文本文档数据挖掘的数字过滤器模型。
6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents [O] . Deepak Agnihotri, Kesari Verma, Priyanka Tripathi -1

机译：计算N-gram的对称强度：文本文档自动分类中的两遍过滤方法
7. 2-way Text Classification for Harmful Web Documents ⋆ [O] . Youngsoo Kim, Taekyong Nam, Dongho Won 2008

机译：有害Web文档的双向文本分类⋆
8. Natural Language Text Classification and Filtering with Trigrams and Evolutionary Nearest Neighbour Classifiers. Software Engineering (SEN). [R] . Langdon, W. B. 2000

机译：基于Trigrams和进化最近邻分类器的自然语言文本分类和过滤。软件工程（sEN）。

Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing

摘要

著录项

相似文献

相关主题

期刊订阅