首页> 外文学位 >A Case Study on Determining the Big Data Veracity: A Method to Compute the Relevance of Twitter Data
【24h】

A Case Study on Determining the Big Data Veracity: A Method to Compute the Relevance of Twitter Data

机译:确定大数据准确性的案例研究:一种计算Twitter数据相关性的方法

获取原文
获取原文并翻译 | 示例

摘要

Twitter data (tweets) has all the attributes of Big Data. Also, it has become the source of information where people post their real-time experiences and their opinions on various day-to-day issues. Therefore, twitter data mining is being used for knowledge extraction and prediction in various domains. As its popularity and size grow, the veracity of knowledge extracted becomes a concern. Veracity is one of the V's of Big Data. The integrity of data, data authenticity, trusted origin, trustworthiness are some of the aspects that deal with Veracity. This thesis deals with the Veracity aspect of Big Data, in particular, veracity in Twitter data, from the truthful vantage point. In this research, we have compared existing Big Data Veracity models with a newly proposed measure. The proposed Veracity measure is entropy and it is compared with two other models, namely Objectivity, Truthfulness and Credibility model(OTC) and Diffusion, Geographic and Spam indices (DGS model) of Veracity. Our approach is to define topics on the set of tweets related to a domain and compute the veracity measures of the topics. The proposed model is based on the bag-of-words model for topic definition. Based on the values of the measures further inferences are achieved.;For our analysis, we selected three domains. The domains we chose are the flu, food poisoning, and politics. The topics for flu and food poisoning data are based on anchor words taken from CDC website. Anchor words of topics for Politics data are taken from "ontheissues.org" website. The entropy, OTC model, and DGS model are calculated for each topic. Our analysis shows no correlation between entropy, OTC model, and DGS model when compared as time series. Computed values of the models could position the topics in a veracity spectrum.
机译:Twitter数据(推文)具有大数据的所有属性。而且,它已成为人们发布实时经验和对各种日常问题的观点的信息源。因此,twitter数据挖掘已用于各个领域的知识提取和预测。随着其普及性和规模的增长,提取的知识的准确性成为一个问题。准确性是大数据的V之一。数据的完整性,数据的真实性,可信赖的来源,可信赖性是处理Veracity的某些方面。本文从真实的角度探讨了大数据的准确性方面,特别是Twitter数据的准确性。在这项研究中,我们将现有的大数据准确性模型与一项新提出的措施进行了比较。所提出的准确性度量是熵,并将它与另外两个模型进行比较,即客观性,真实性和可信度模型(OTC)以及准确性的扩散,地理和垃圾邮件指数(DGS模型)。我们的方法是在与域相关的一组推文上定义主题,并计算主题的准确性度量。所提出的模型基于用于主题定义的词袋模型。基于度量的值,可以得出进一步的推论。为进行分析,我们选择了三个域。我们选择的领域是流感,食物中毒和政治。流感和食物中毒数据的主题基于选自CDC网站的主语。政治数据主题的锚定词来自“ ontheissues.org”网站。为每个主题计算熵,OTC模型和DGS模型。我们的分析显示,与时间序列相比,熵,OTC模型和DGS模型之间没有相关性。模型的计算值可以将主题定位在准确性范围内。

著录项

  • 作者

    Paryani, Jyotsna.;

  • 作者单位

    Oklahoma State University.;

  • 授予单位 Oklahoma State University.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2017
  • 页码 63 p.
  • 总页数 63
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号