A language-independent authorship attribution approach for author identification of text documents

Ramezani Reza

首页> 外文期刊>Expert systems with applications >A language-independent authorship attribution approach for author identification of text documents

【24h】

A language-independent authorship attribution approach for author identification of text documents

机译：作者识别文本文件的语言无关的作者归因方法

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the Authorship Attribution (AA) task, the most likely author of textual documents, such as books, papers, news, and text messages and posts are identified using statistical and computational methods. In this paper, a new computational approach is presented for identifying the most likely author of text documents. The proposed solution emphasizes lazy profile-based classification and, by using the Term Frequency-Inverse Document Frequency (TF_IDF) scheme, introduces a new measure for identifying important terms of documents. The importance of the terms is then used to calculate the similarity between an anonymous document and known documents. The proposed solution works with raw text documents and does not require any NLP tools for preprocessing, which makes it language-independent. The efficiency of the proposed solution has been evaluated by conducting several experiments on two English and Persian datasets, each of which contains six corpora with different number of authors. The obtained results demonstrate that the proposed solution outperforms state-ofthe-art stylometric features, employed by seven well-known classifiers, by obtaining 0.902 accuracy for the English dataset and 0.931 accuracy for the Persian dataset. In addition, supplementary experiments have been conducted to evaluate the effects of documents' length on the accuracy of the proposed solution, to examine the computation time of the proposed solution and competitive classifiers, and to identify the most effective stylometric features and classifiers.

机译：在Autheration Attribution（AA）任务中，使用统计和计算方法确定了书籍，论文，新闻和短信和帖子的最有可能的文本文件的作者。在本文中，介绍了一种新的计算方法，用于识别文本文件的最可能作者。所提出的解决方案强调基于借调的基于配置文件的分类，并且通过使用术语频率反转文档频率（TF_IDF）方案，引入识别重要文档的新措施。然后用于计算术语的重要性来计算匿名文档和已知文档之间的相似性。建议的解决方案与原始文本文档一起工作，不需要任何用于预处理的NLP工具，这使其独立于语言。通过对两个英语和波斯数据集进行多个实验进行了评估了所提出的解决方案的效率，每个实验中包含六个具有不同数量的作者。所获得的结果表明，所提出的解决方案优于七种着名的分类器的最先进的仪表特征，通过获得对英语数据集的0.902准确度和波斯数据集的0.931精度。此外，已经进行了补充实验，以评估文件的影响对所提出的解决方案的准确性，检查所提出的解决方案和竞争分类器的计算时间，并识别最有效的衡量计量功能和分类器。

著录项

来源
《Expert systems with applications》 |2021年第10期|115139.1-115139.15|共15页
作者
Ramezani Reza;
展开▼
作者单位

Univ Isfahan Fac Comp Engn Dept Software Engn Esfahan Iran;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Authorship attribution; Author identification; Text similarity; Term frequency; Inverse document frequency; NLP;

机译：作者归属;作者识别;文本相似;术语频率;逆文档频率;NLP;

相似文献

外文文献
中文文献
专利

1. CAG : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph [J] . Sarwar Raheem, Urailertprasert Norawit, Vannaboot Nattapol, Quality Control, Transactions . 2020,第期

机译：CAG：使用共同作者图形的多作者文档的仪表作主归属
2. Bucketed common vector scaling for authorship attribution in heterogeneous web collections: A scaling approach for authorship attribution [J] . Hayri Volkan Agun, Ozgur Yilmazel Journal of Information Science . 2020,第5期

机译：异构网络收藏中作者归属的跨普通矢量缩放：作者归因的缩放方法
3. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation [J] . Tareef Kamil Mustafa, Norwati Mustapha, Masrah Azrifah Azmi, Journal of computer sciences . 2010,第3期

机译：删除最大项目集：改进用于作者调查的文本挖掘中的风格作者归属算法
4. Authorship Attribution of Electronic Documents Comparing the Use of Normalized Compression Distance and Support Vector Machine in Authorship Attribution [C] . Walter Ribeiro de Oliveira Jr., Edson J. R. Justino, Luiz S. Oliveira International conference on neural information processing . 2012

机译：电子文档的作者身份归属，比较归一化压缩距离和支持向量机在作者身份归属中的使用
5. A Natural Language Processing and Machine-Learning Based Approach to Authorship Attribution of Tweets [D] . Day, Siobahn Caroline. 2018

机译：基于自然语言处理和机器学习的推文作者身份归属方法
6. Authorship identification of documents with high content similarity [O] . Andi Rexha, Mark Kröll, Hermann Ziak, -1

机译：内容相似度高的文档的作者身份标识
7. $CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph [O] . Raheem Sarwar, Norawit Urailertprasert, Nattapol Vannaboot, 2020

机译：$ CAG $：使用共同作者图形的多作者文件的款式验证归属

A language-independent authorship attribution approach for author identification of text documents

摘要

著录项

相似文献

相关主题

期刊订阅