The ineffectiveness of within-document term frequency in text classification

机译：文本分类中文档内术语频率的无效性

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the exponential-family approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.

机译：为了进行分类，通常将文档表示为一袋单词。这种表示形式由构成文档的各个术语以及每个术语在文档中出现的次数组成。所有分类方法都使用这些术语。通常还以模型中某些复杂功能为代价来利用局部项频率。示例包括朴素的贝叶斯多项式模型（MM），狄利克雷复合多项式模型（DCM）和DCM的指数族逼近（EDCM）以及支持向量机（SVM）。尽管通常声称在文档中合并本地单词频率可以提高文本分类性能，但是我们在这里测试这种声明是否正确。在本文中，我们通过实验表明，忽略文档中每个单词出现频率的MM，EDCM和SVM模型的简化形式与包含本地术语频率的MM，DCM，EDCM和SVM模型的性能大致相同。我们还提出了一种新形式的朴素贝叶斯多元伯努利模型（MBM），该模型能够利用局部项频率，并且再次表明与普通MBM相比，它没有显着优势。我们得出的结论是，单词突发性是如此之强，以至于单词的额外出现在本质上没有为分类器添加任何有用的信息。

著录项

期刊名称 Springer Open Choice
作者
W. John Wilbur; Won Kim;
展开▼
作者单位

展开▼
年(卷),期 -1(12),5
年度 -1
页码 509–525
总页数 17
原文格式 PDF
正文语种
中图分类外科学;
关键词
Within-document frequency Bag-of-words Word burstiness;

机译：文档内频率;词袋;词突发性;

相似文献

外文文献
中文文献
专利

1. The ineffectiveness of within-document term frequency in text classification [J] . W. John Wilbur, Won Kim Information retrieval . 2009,第5期

机译：文本分类中文档内术语频率的无效性
2. Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification [J] . Thirumoorthy Karpagalingam, Muneeswaran Karuppaiah Computing and informatics . 2020,第5期

机译：基于组合文档频率和文本分类术语频率的最佳特征子集选择
3. OPTIMAL FEATURE SUBSET SELECTION BASED ON COMBINING DOCUMENT FREQUENCY AND TERM FREQUENCY FOR TEXT CLASSIFICATION [J] . Karpagalingam Thirumoorthy, Karuppaiah Muneeswaran Computing and informatics . 2020,第5期

机译：基于组合文本频率和文本分类术语频率的最佳特征子集选择
4. Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification [C] . Yoon Kim, Owen Zhang Workshop on computational approaches to subjectivity, sentiment and social media analysis . 2014

机译：可信度调整术语频率：用于情感分析和文本分类的监督术语加权方案
5. Long Term Evolution – Orthogonal Frequency Division Multiplexing Time and Frequency Synchronization Techniques [D] . Ho, Ky-Bao Huu 2012

机译：长期演进–正交频分复用时间和频率同步技术
6. An Improved Double Channel Long Short-Term Memory Model for Medical Text Classification [O] . Shengbin Liang, Xinan Chen, Jixin Ma, 2021

机译：医学文本分类的改进双通道长短短期记忆模型
7. The ineffectiveness of within-document term frequency in text classification [O] . 2009

机译：文本分类中文档内术语频率的无效性

The ineffectiveness of within-document term frequency in text classification

摘要

著录项

相似文献

相关主题

期刊订阅