首页> 外文学位 >A model for assessing and improving quality of textual data from product-related discussion forums.
【24h】

A model for assessing and improving quality of textual data from product-related discussion forums.

机译:用于评估和改善与产品相关的论坛中的文本数据质量的模型。

获取原文
获取原文并翻译 | 示例

摘要

New sources of customer data, such as web-based product reviews, online discussions, and social networking interactions are now becoming extremely popular among companies that seek feedback from their clients. However, the quality of these data is often questionable due to the lack of control over their production. Often, the user-generated data are inaccurate and inconsistent, which complicates their analysis and leads to unreliable conclusions. In the recent years, numerous studies have been conducted to better understand and, consequently, manage the quality of the data. Yet, much remains to be discovered, particularly in the area of unstructured textual data quality.;The goal of this study was to develop a model for assessing and improving the quality of unstructured textual data from product-related discussion forums. To achieve this, an extensive review of the literature was performed to identify a set of quality attributes and metrics that are suitable for web-based textual data. Then, accuracy of detecting low-quality discussion threads was measured and compared among several classification algorithms that were trained using three types of predictors: quality metrics, words from the discussion thread titles, and words from the discussion threads themselves. The best-performing algorithm was identified and selected for the quality assessment and improvement model. Finally, low-quality discussion threads from another random sample were detected and removed using the developed model, the effect of which was measured using another text classification algorithm as a way of validating the model.;As a result, it was found that the quality of web-based discussion threads is most accurately detected by the Support Vector Machine algorithm that was trained using the textual content of the threads rather than their titles or quality metrics. Moreover, the study also confirmed that the effectiveness of analyzing unstructured discussion data improves significantly after low-quality threads are detected and removed using the developed model.
机译:客户数据的新来源,例如基于Web的产品评论,在线讨论和社交网络交互,现在在寻求客户反馈的公司中变得非常流行。但是,由于缺乏对生产数据的控制,这些数据的质量常常令人怀疑。通常,用户生成的数据不准确且不一致,这会使它们的分析复杂化,并导致得出不可靠的结论。近年来,进行了许多研究,以更好地理解并管理数据的质量。但是,还有很多事情有待发现,尤其是在非结构化文本数据质量方面。这项研究的目的是开发一种模型,用于评估和改善来自与产品相关的论坛的非结构化文本数据的质量。为了实现这一目标,对文献进行了广泛的审查,以确定一组适合基于Web的文本数据的质量属性和度量。然后,测量低质量讨论线程的准确性,并在使用三种类型的预测变量训练的几种分类算法之间进行比较:质量指标,讨论线程标题中的单词以及讨论线程本身中的单词。确定性能最佳的算法,然后选择该算法用于质量评估和改进模型。最后,使用开发的模型检测并删除了来自另一个随机样本的低质量讨论线程,并使用另一种文本分类算法作为验证模型的方法来测量其效果;结果发现质量通过使用线程的文本内容而不是其标题或质量指标进行训练的支持向量机算法,可以最准确地检测到基于Web的讨论线程的数量。此外,研究还证实,使用开发的模型检测到并删除低质量的线程后,分析非结构化讨论数据的有效性将显着提高。

著录项

  • 作者

    Ilikchyan, Armen.;

  • 作者单位

    Indiana State University.;

  • 授予单位 Indiana State University.;
  • 学科 Management.;Web studies.;Industrial engineering.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 142 p.
  • 总页数 142
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号