...
首页> 外文期刊>Expert systems with applications >A language-independent authorship attribution approach for author identification of text documents
【24h】

A language-independent authorship attribution approach for author identification of text documents

机译:作者识别文本文件的语言无关的作者归因方法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In the Authorship Attribution (AA) task, the most likely author of textual documents, such as books, papers, news, and text messages and posts are identified using statistical and computational methods. In this paper, a new computational approach is presented for identifying the most likely author of text documents. The proposed solution emphasizes lazy profile-based classification and, by using the Term Frequency-Inverse Document Frequency (TF_IDF) scheme, introduces a new measure for identifying important terms of documents. The importance of the terms is then used to calculate the similarity between an anonymous document and known documents. The proposed solution works with raw text documents and does not require any NLP tools for preprocessing, which makes it language-independent. The efficiency of the proposed solution has been evaluated by conducting several experiments on two English and Persian datasets, each of which contains six corpora with different number of authors. The obtained results demonstrate that the proposed solution outperforms state-ofthe-art stylometric features, employed by seven well-known classifiers, by obtaining 0.902 accuracy for the English dataset and 0.931 accuracy for the Persian dataset. In addition, supplementary experiments have been conducted to evaluate the effects of documents' length on the accuracy of the proposed solution, to examine the computation time of the proposed solution and competitive classifiers, and to identify the most effective stylometric features and classifiers.
机译:在Autheration Attribution(AA)任务中,使用统计和计算方法确定了书籍,论文,新闻和短信和帖子的最有可能的文本文件的作者。在本文中,介绍了一种新的计算方法,用于识别文本文件的最可能作者。所提出的解决方案强调基于借调的基于配置文件的分类,并且通过使用术语频率反转文档频率(TF_IDF)方案,引入识别重要文档的新措施。然后用于计算术语的重要性来计算匿名文档和已知文档之间的相似性。建议的解决方案与原始文本文档一起工作,不需要任何用于预处理的NLP工具,这使其独立于语言。通过对两个英语和波斯数据集进行多个实验进行了评估了所提出的解决方案的效率,每个实验中包含六个具有不同数量的作者。所获得的结果表明,所提出的解决方案优于七种着名的分类器的最先进的仪表特征,通过获得对英语数据集的0.902准确度和波斯数据集的0.931精度。此外,已经进行了补充实验,以评估文件的影响对所提出的解决方案的准确性,检查所提出的解决方案和竞争分类器的计算时间,并识别最有效的衡量计量功能和分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号