首页> 外文期刊>EPJ Data Science >Exploiting citation networks for large-scale author name disambiguation
【24h】

Exploiting citation networks for large-scale author name disambiguation

机译:利用引文网络大规模消除作者姓名歧义

获取原文
           

摘要

We present a novel algorithm and validation method for disambiguating author names in very large bibliographic data sets and apply it to the full Web of Science (WoS) citation index. Our algorithm relies only upon the author and citation graphs available for the whole period covered by the WoS. A pair-wise publication similarity metric, which is based on common co-authors, self-citations, shared references and citations, is established to perform a two-step agglomerative clustering that first connects individual papers and then merges similar clusters. This parameterized model is optimized using an h-index based recall measure, favoring the correct assignment of well-cited publications, and a name-initials-based precision using WoS metadata and cross-referenced Google Scholar profiles. Despite the use of limited metadata, we reach a recall of 87% and a precision of 88% with a preference for researchers with high h-index values. 47 million articles of WoS can be disambiguated on a single machine in less than a day. We develop an h-index distribution model, confirming that the prediction is in excellent agreement with the empirical data, and yielding insight into the utility of the h-index in real academic ranking scenarios.
机译:我们提出了一种新颖的算法和验证方法,用于消除大型书目数据集中作者姓名的歧义,并将其应用于完整的Web of Science(WoS)引用索引。我们的算法仅依赖于WoS涵盖的整个时期内的作者和引文图。建立基于共同合著者,自我引文,共享参考文献和引用的成对出版相似度度量,以执行两步的聚类聚类,该聚类聚类首先连接各个论文,然后合并相似的聚类。该参数化模型使用基于h索引的召回措施进行了优化,这有利于正确引用被引用的出版物,并使用WoS元数据和交叉引用的Google Scholar个人资料来实现基于名称首字母的精度。尽管使用了有限的元数据,但我们的召回率达到了87%,准确率达到了88%,并且偏爱具有高h指数值的研究人员。在不到一天的时间内,可以在一台机器上消除4700万篇WoS的歧义。我们开发了h指数分布模型,确认预测与经验数据非常吻合,并深入了解了h指数在实际学术排名方案中的效用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号