Using Similarity Measures to Select Pretraining Data for NER

机译：使用相似措施选择NER的预先预订数据

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word vectors and Language Models (LMs) pretrained on a large amount of unlabelled data can dramatically improve various Natural Language Processing (NLP) tasks. However, the measure and impact of similarity between pretraining data and target task data are left to intuition. We propose three cost-effective measures to quantify different aspects of similarity between source pretraining and target task data. We demonstrate that these measures are good predictors of the usefulness of pretrained models for Named Entity Recognition (NER) over 30 data pairs. Results also suggest that pretrained LMs are more effective and more predictable than pretrained word vectors, but pretrained word vectors are better when pretraining data is dissimilar.

机译：在大量未标记数据上佩戴的字向量和语言模型（LMS）可以大大提高各种自然语言处理（NLP）任务。但是，预先训练数据与目标任务数据之间的相似性的测量和影响将留给直觉。我们提出了三项成本效益的措施来量化源预测和目标任务数据之间的相似性的不同方面。我们证明，这些措施是以30数据对的命名实体识别（NER）的预用模型的有用性的良好预测因素。结果还提出，预磨损的LMS比预磨词载体更有效，更可预测，但是当预先训练数据不同意时，预先训练的字矢量更好。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|xciii p. 1401-2101|共11页
会议地点
作者
Xiang Dai; Sarvnaz Karimi; Ben Hachey; Cecile Paris;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Selecting a semantic similarity measure for concepts in two different CAD model data ontologies [J] . Wenlong Lu, Yuchu Qin, Qunfen Qi, Advanced engineering informatics . 2016,第3期

机译：为两个不同的CAD模型数据本体中的概念选择语义相似性度量
2. Selecting Multiview Point Similarity from Different Methods of Similarity Measure to Perform Document Comparison [J] . S. Kalpana, S. Vigneshwari Indian Journal of Science and Technology . 2016,第10期

机译：从不同的相似性度量方法中选择多视点相似性以进行文档比较
3. Attribute selection using fuzzy roughset based customized similarity measure for lung cancer microarray gene expression data [J] . C. Arunkumar, S. Ramakrishnan Future Computing and Informatics Journal . 2018,第1期

机译：基于模糊粗糙集的定制相似性量度用于肺癌微阵列基因表达数据的属性选择
4. Using Similarity Measures to Select Pretraining Data for NER [C] . Xiang Dai, Sarvnaz Karimi, Ben Hachey, Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：使用相似性度量选择NER的预训练数据
5. Similarity Measures and Anomaly Detection for Mixed Data [D] . Davidow, Matthew Brody. 2020

机译：相似度量和混合数据的异常检测
6. Gene selection and classification for cancer microarray data based on machine learning and similarity measures [O] . Qingzhong Liu, Andrew H Sung, Zhongxue Chen, 2011

机译：基于机器学习和相似性度量的癌症微阵列数据的基因选择和分类
7. Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media [O] . Xiang Dai, Sarvnaz Karimi, Ben Hachey, 2020

机译：经济型预测数据的成本效益：在社交媒体上预先曝光伯特的案例研究
8. Quantifying Similarity and Distance Measures for Vector-Based Datasets: Histograms, Signals, and Probability Distribution Functions. [R] . Tschopp, M. A., Hernandez-Rivera, E. 2017

机译：量化基于矢量的数据集的相似性和距离度量：直方图，信号和概率分布函数。

Using Similarity Measures to Select Pretraining Data for NER

摘要

著录项

相似文献

相关主题

期刊订阅