首页> 外文会议>International Conference on Software Analysis, Evolution, and Reengineering >NIRMAL: Automatic identification of software relevant tweets leveraging language model
【24h】

NIRMAL: Automatic identification of software relevant tweets leveraging language model

机译:nirmal:软件相关推文的自动识别利用语言模型

获取原文

摘要

Twitter is one of the most widely used social media platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active users generate close to 500 million tweets per day. Such rapid generation of user generated content in large magnitudes results in the problem of information overload. Users who are interested in information related to a particular domain have limited means to filter out irrelevant tweets and tend to get lost in the huge amount of data they encounter. A recent study by Singer et al. found that software developers use Twitter to stay aware of industry trends, to learn from others, and to network with other developers. However, Singer et al. also reported that developers often find Twitter streams to contain too much noise which is a barrier to the adoption of Twitter. In this paper, to help developers cope with noise, we propose a novel approach named NIRMAL, which automatically identifies software relevant tweets from a collection or stream of tweets. Our approach is based on language modeling which learns a statistical model based on a training corpus (i.e., set of documents). We make use of a subset of posts from StackOverflow, a programming question and answer site, as a training corpus to learn a language model. A corpus of tweets was then used to test the effectiveness of the trained language model. The tweets were sorted based on the rank the model assigned to each of the individual tweets. The top 200 tweets were then manually analyzed to verify whether they are software related or not, and then an accuracy score was calculated. The results show that decent accuracy scores can be achieved by various variants of NIRMAL, which indicates that NIRMAL can effectively identify software related tweets from a huge corpus of tweets.
机译:Twitter是今天最广泛使用的社交媒体平台之一。它使用户能够共享和查看名为“推文”的短140个字符消息。大约28400万活跃用户每天产生接近的500万推文。这种快速生成的用户生成的大小的内容导致信息过载的问题。对与特定域相关的信息感兴趣的用户具有过滤掉无关推文的有限手段,并倾向于在他们遇到的大量数据中丢失。最近的Singer等人的研究。发现软件开发人员使用Twitter保持意识到行业趋势,从他人中学到,并与其他开发人员网络。但是,Singer等人。还报告说,开发人员经常发现Twitter流遏制太多的噪音,这是通过推特采用的障碍。在本文中,为了帮助开发人员应对噪音,我们提出了一种名为nirmal的新方法,它自动识别来自集合或鸣叫流的软件相关推文。我们的方法是基于语言建模,其基于培训语料库(即文件集)学习统计模型。我们利用Sackoverflow,编程问题和应答站点的帖子子集作为学习语言模型的培训语料库。然后使用推文的语料库来测试培训的语言模型的有效性。推文根据分配给每个单个推文的模型排序。然后手动分析前200名推文以验证它们是否是相关的软件,然后计算精度得分。结果表明,可以通过NIRMAL的各种变体来实现体面的精度分数,这表明NIRMAL能够有效地从推特的庞大语料库中有效地识别软件相关推文。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号