首页> 外文会议>9th International conference on language resources and evaluation >On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter
【24h】

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

机译:关于Twitter情感分析的阻止词,过滤和数据稀疏性

获取原文

摘要

Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space.
机译:Twitter的情感分类通常受到嘈杂性质(缩写,不规则形式)的推文数据的影响。一种流行的过程来减少文本数据的噪声是通过使用预编译的秒表列表或更复杂的动态停止识别方法来删除停止键。然而,在过去几年中辩论了在Twitter情绪分类的背景下消除了阻塞词的有效性。在本文中,我们调查了删除阻止是否有助于或妨碍Twitter情绪分类方法的有效性。为此,我们将六个不同的停止识别方法应用于来自六个不同的数据集的推特数据,并观察到删除的删除术语影响两个众所周知的监督情绪分类方法。我们通过观察数据稀疏性水平的波动,分类器的特征空间的大小及其分类性能来评估删除阻止的影响。我们的结果表明,使用预编译的秒表列表负面影响Twitter情绪分类方法的性能。在另一方面,所述动态生成停用词列表,通过去除只出现在文集一旦这些罕见的术语,似乎是最佳的方法,以保持高的分类性能,同时减少数据稀疏和基本上收缩特征空间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号