首页> 外文期刊>Electronics and communications in Japan >Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing
【24h】

Text Filtering for Harmful Document Classification Using Three-Word Co-Occurrence and Large-Scale Data Processing

机译:使用三字共现和大规模数据处理对有害文档分类进行文本过滤

获取原文
获取原文并翻译 | 示例
           

摘要

Young people are increasingly using the Internet. However, this creates the problem that the material they encounter may adversely affect them. Therefore we propose a method of automatically classifying harmful sentences. Research on information filtering has been improving the performance of filters by introducing co-occurrence information. Extended from two words co-occurrence information, which is commonly studied in such work, we have created training data using co-occurrence information with three words. However, compared with two-word co-occurrence information, the processing time becomes a problem because of the increase in the amount of training data. In addition, we have found that noise caused by the increase in co-occurrences exceeds the capacity of double-precision floating-point calculations. We improved processing speed by implementing a text filtering system with three-word cooccurrence using a Bayesian filter in order to parallelize a fast MyISAM database. In addition, by using BigDecimal to remove the noise caused by the increase in the number of co-occurrences, we achieved a high F-value.
机译:年轻人越来越多地使用互联网。但是,这产生了一个问题,即它们遇到的材料可能会对它们产生不利影响。因此,我们提出了一种自动分类有害句子的方法。信息过滤的研究一直在通过引入共现信息来提高过滤器的性能。从在此类工作中经常研究的两个单词的共现信息扩展,我们使用三个单词的共现信息创建了训练数据。然而,与两字共现信息相比,由于训练数据量的增加,处理时间成为问题。另外,我们发现由共现增加引起的噪声超过了双精度浮点计算的能力。为了实现快速MyISAM数据库的并行化,我们使用贝叶斯过滤器实现了具有三个单词共现的文本过滤系统,从而提高了处理速度。此外,通过使用BigDecimal消除由共现次数增加引起的噪声,我们实现了较高的F值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号