首页> 外文期刊>Journal of Data Analysis and Information Processing >Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis
【24h】

Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis

机译:分布矢量词表示的降维和情感分析的图释词干

获取原文
           

摘要

Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons; we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.
机译:诸如Twitter和Internet电影数据库(IMDb)之类的社交媒体平台包含大量数据,这些数据可用于电影销售,股市波动,品牌观点或时事的预测性情绪分析。使用斯坦福(Stanford)的IMDb数据集,我们确定了一些最重要的短语,这些短语可用于识别各种电影评论中的情感。由于Twitter的流API具有实时性,因此Twitter的数据特别有吸引力。以流方式有效分析此数据需要有效的模型,可以通过减少输入向量的维数来改进模型。过去,这样做的一种方式是使用表情符号。我们提出了一种通过识别具有相似情感的表情符号中的通用结构来进一步减少这些特征的方法。我们还检查了表情符号用法的性别分布,发现某些表情符号的趋势在男性和女性之间不成比例。尽管Twitter上的性别分布大致相等,但表情符号的使用主要还是女性。此外,我们发现可以通过特征选择来减少分布式矢量表示,例如Word2Vec产生的矢量表示。此分析是对来自新数据集的大型表情符号语料库中的1000条推文进行手动标记的样本进行的,该数据集由约850万条包含表情符号的推文组成,并于2015年5月收集了五天。类似的表情符号,我们能够使用两个正则表达式来表征正向表情符号和负向表情符号,这两个表达式占大型表情语料库中表情符号使用量的90%以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号