【24h】

Large Scale Personality Classification of Bloggers

机译:博客的大规模人格分类

获取原文

摘要

Personality is a fundamental component of an individual's affective behavior. Previous work on personality classification has emerged from disparate sources: Varieties of algorithms and feature-selection across spoken and written data have made comparison difficult. Here, we use a large corpus of blogs to compare classification feature selection; we also use these results to identify characteristic language information relating to personality. Using Support Vector Machines, the best accuracies range from 84.36% (openness to experience) to 70.51% (neuroticism). To achieve these results, the best performing features were a combination of: (1) stemmed bigrams; (2) no exclusion of stopwords (i.e. common words); and (3) the boolean, presence or absence of features noted, rather than their rate of use. We take these findings to suggest that both the structure of the text and the presence of common words are important. We also note that a common dictionary of words used for content analysis (LIWC) performs less well in this classification task, which we propose is due to their conceptual breadth. To get a better sense of how personality is expressed in the blogs, we explore the best performing features and discuss how these can provide a deeper understanding of personality language behavior online.
机译:人格是个人情感行为的基本组成部分。以前有关人格分类的工作来自不同的方面:口头和书面数据的算法和特征选择的多样性使比较变得困难。在这里,我们使用大量的博客来比较分类特征的选择;我们还使用这些结果来识别与人格相关的特征性语言信息。使用支持向量机时,最佳精度范围为84.36%(开放性)到70.51%(神经病)。为了获得这些结果,最佳性能是以下各项的组合:(1)茎状二元组; (2)不排除停用词(即常用词); (3)所指出的功能的布尔值,存在或不存在,而不是其使用率。我们根据这些发现表明,文本的结构和常见单词的存在都很重要。我们还注意到,用于内容分析(LIWC)的通用单词词典在此分类任务中的表现较差,我们建议这样做是由于其概念上的广度。为了更好地了解博客中人格的表达方式,我们探索了表现最佳的功能,并讨论了这些功能如何在网上提供对人格语言行为的更深刻理解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号