【24h】

Using Similarity Measures to Select Pretraining Data for NER

机译:使用相似措施选择NER的预先预订数据

获取原文

摘要

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.
机译:在大量未标记数据上佩戴的字向量和语言模型(LMS)可以大大提高各种自然语言处理(NLP)任务。但是,预先训练数据与目标任务数据之间的相似性的测量和影响将留给直觉。我们提出了三项成本效益的措施来量化源预测和目标任务数据之间的相似性的不同方面。我们证明,这些措施是以30数据对的命名实体识别(NER)的预用模型的有用性的良好预测因素。结果还提出,预磨损的LMS比预磨词载体更有效,更可预测,但是当预先训练数据不同意时,预先训练的字矢量更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号