NIRMAL: Automatic identification of software relevant tweets leveraging language model

机译：nirmal：软件相关推文的自动识别利用语言模型

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Twitter is one of the most widely used social media platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active users generate close to 500 million tweets per day. Such rapid generation of user generated content in large magnitudes results in the problem of information overload. Users who are interested in information related to a particular domain have limited means to filter out irrelevant tweets and tend to get lost in the huge amount of data they encounter. A recent study by Singer et al. found that software developers use Twitter to stay aware of industry trends, to learn from others, and to network with other developers. However, Singer et al. also reported that developers often find Twitter streams to contain too much noise which is a barrier to the adoption of Twitter. In this paper, to help developers cope with noise, we propose a novel approach named NIRMAL, which automatically identifies software relevant tweets from a collection or stream of tweets. Our approach is based on language modeling which learns a statistical model based on a training corpus (i.e., set of documents). We make use of a subset of posts from StackOverflow, a programming question and answer site, as a training corpus to learn a language model. A corpus of tweets was then used to test the effectiveness of the trained language model. The tweets were sorted based on the rank the model assigned to each of the individual tweets. The top 200 tweets were then manually analyzed to verify whether they are software related or not, and then an accuracy score was calculated. The results show that decent accuracy scores can be achieved by various variants of NIRMAL, which indicates that NIRMAL can effectively identify software related tweets from a huge corpus of tweets.

机译：Twitter是今天最广泛使用的社交媒体平台之一。它使用户能够共享和查看名为“推文”的短140个字符消息。大约28400万活跃用户每天产生接近的500万推文。这种快速生成的用户生成的大小的内容导致信息过载的问题。对与特定域相关的信息感兴趣的用户具有过滤掉无关推文的有限手段，并倾向于在他们遇到的大量数据中丢失。最近的Singer等人的研究。发现软件开发人员使用Twitter保持意识到行业趋势，从他人中学到，并与其他开发人员网络。但是，Singer等人。还报告说，开发人员经常发现Twitter流遏制太多的噪音，这是通过推特采用的障碍。在本文中，为了帮助开发人员应对噪音，我们提出了一种名为nirmal的新方法，它自动识别来自集合或鸣叫流的软件相关推文。我们的方法是基于语言建模，其基于培训语料库（即文件集）学习统计模型。我们利用Sackoverflow，编程问题和应答站点的帖子子集作为学习语言模型的培训语料库。然后使用推文的语料库来测试培训的语言模型的有效性。推文根据分配给每个单个推文的模型排序。然后手动分析前200名推文以验证它们是否是相关的软件，然后计算精度得分。结果表明，可以通过NIRMAL的各种变体来实现体面的精度分数，这表明NIRMAL能够有效地从推特的庞大语料库中有效地识别软件相关推文。

著录项

来源
《International Conference on Software Analysis, Evolution, and Reengineering》|2015年||共10页
会议地点
作者
Sharma Abhishek; Yuan Tian; Lo David;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals: An approach and evaluation at Roche Diagnostics GmbH [J] . Quirchmayr Thomas, Paech Barbara, Kohl Roland, Empirical Software Engineering . 2018,第6期

机译：从自然语言用户手册中提取基于规则的半自动域术语和与软件功能相关的信息：Roche Diagnostics GmbH的一种方法和评估
2. Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties [J] . Castro Dayvid W., Souza Ellen, Vitorio Douglas, Applied Soft Computing . 2017,第期

机译：用于推文语言识别的平滑N-GRAM模型：巴西和欧洲葡萄牙民族品种的案例研究
3. A Systematic Identification of Formal and Semi-Formal Languages and Techniques for Software-Intensive Systems-of-Systems Requirements Modeling [J] . Cristiane Aparecida Lana, Milena Guessi, Pablo Oliveira Antonino, IEEE systems journal . 2019,第3期

机译：软件密集型系统需求建模的形式化和半形式化语言和技术的系统识别
4. NIRMAL: Automatic identification of software relevant tweets leveraging language model [C] . Sharma Abhishek, Yuan Tian, Lo David International Conference on Software Analysis, Evolution, and Reengineering . 2015

机译：NIRMAL：利用语言模型自动识别与软件相关的推文
5. Is Every Tweet Created Equal? A Framework to Identify Relevant Tweets for Business Research [D] . Chee, Thad. 2017

机译：每次推文都是平等的吗？识别企业研究相关推文的框架
6. Identification of Relevant Phytochemical Constituents for Characterization and Authentication of Tomatoes by General Linear Model Linked to Automatic Interaction Detection (GLM-AID) and Artificial Neural Network Models (ANNs) [O] . Marcos Hernández Suárez, Gonzalo Astray Dopazo, Dina Larios López, -1

机译：通过关联到自动交互检测（GLM-AID）和人工神经网络模型（ANN）的通用线性模型鉴定用于番茄表征和鉴定的相关植物化学成分
7. NIRMAL: Automatic identification of software relevant tweets leveraging language model [O] . Abhishek Sharma, Yuan Tian, David Lo 2015

机译：nirmal：软件相关推文的自动识别利用语言模型

NIRMAL: Automatic identification of software relevant tweets leveraging language model

摘要

著录项

相似文献

相关主题

期刊订阅