首页> 外文期刊>Knowledge and information systems >Sentiment analysis on big sparse data streams with limited labels
【24h】

Sentiment analysis on big sparse data streams with limited labels

机译:有限标签大稀疏数据流的情感分析

获取原文
获取原文并翻译 | 示例
           

摘要

Sentiment analysis is an important task in order to gain insights over the huge amounts of opinionated texts generated on a daily basis in social media like Twitter. Despite its huge amount, standard supervised learning methods won’t work upon such sort of data due to lack of labels and the impracticality of (human) labeling at this scale. In this work, we leverage distant supervision and semi-supervised learning to annotate a big stream of tweets from 2015 which consists of 228 million tweets without retweets (and 275 million with retweets). We present the insights from our annotation process regarding the effect of different semi-supervised learning approaches, namely Self-Learning, Co-Training and Expectation–Maximization. Moreover, we propose two annotation modes, the batch mode where all labeled and unlabeled data are available to the algorithms from the beginning and a lightweight streaming mode that processes the data in batches based on their arrival time in the stream. Our experiments show that stream processing with a sliding window of three months achieves comparable results to batch processing while being more efficient. Finally, to tackle the class imbalance problem, as our dataset is imbalanced toward the positive sentiment class, and its aggravation by the semi-supervised learning methods, we employ data augmentation in the semi-supervised learning process in order to equalize the class distribution. Our results show that semi-supervised learning coupled with data augmentation outperforms significantly the default semi-supervised annotation process. We make the so-called TSentiment15 sentiment-annotated dataset available to the community to be used for evaluation purposes and for developing new methods.
机译:情绪分析是一个重要的任务,以便在像Twitter这样的社交媒体上每天生成的大量自传文本中获得洞察。尽管其巨额巨大,但标准的监督学习方法由于缺乏标签和(人类)标记的不切实际,而且在此规模上的不切实际的程度将不起作用。在这项工作中,我们利用遥远的监督和半监督学习从2015年诠释了一大鸣叫,其中包括22800万推文没有转派(和转推的2.75亿)。我们从我们的注释过程中介绍了不同半监督学习方法,即自学,共同培训和期望最大化的效果。此外,我们提出了两个注释模式,批量模式从开始的开头和轻量级流模式可获得所有标记和未标记数据的批次模式,该算法基于流中的到达时间在批量中处理数据。我们的实验表明,具有三个月的滑动窗口的流处理可以实现与批量处理相当的结果,同时更有效。最后,为了解决类别的不平衡问题,因为我们的数据集对积极的情绪类别不平衡,并且通过半监督学习方法的恶化,我们在半监督学习过程中采用数据增强,以均衡类分布。我们的研究结果表明,半监督学习与数据增强率达到明显默认的半监督注释过程。我们使所谓的Tsentiment15的情绪注释数据集可用于社区,以用于评估目的,并开发新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号