首页> 外国专利> SYSTEM AND METHOD FOR BUILDING DIVERSE LANGUAGE MODELS

SYSTEM AND METHOD FOR BUILDING DIVERSE LANGUAGE MODELS

机译:建立多种语言模型的系统和方法

摘要

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for collecting web data in order to create diverse language models. A system configured to practice the method first crawls, such as via a crawler operating on a computing device, a set of documents in a network of interconnected devices according to a visitation policy, wherein the visitation policy is configured to focus on novelty regions for a current language model built from previous crawling cycles by crawling documents whose vocabulary considered likely to fill gaps in the current language model. A language model from a previous cycle can be used to guide the creation of a language model in the following cycle. The novelty regions can include documents with high perplexity values over the current language model.
机译:本文公开了用于收集网络数据以便创建各种语言模型的系统,方法和非暂时性计算机可读存储介质。一种被配置为实践该方法的系统,例如,根据访问策略,首先例如通过在计算设备上运行的爬行器,对互连设备网络中的一组文档进行爬行,其中,访问策略被配置为专注于针对访问者的新颖性区域。通过爬行其词汇被认为可能填补当前语言模型空白的文档,从先前的爬行周期构建的当前语言模型。上一个周期的语言模型可以用来指导下一个周期的语言模型的创建。新颖性区域可以包括在当前语言模型上具有高困惑度值的文档。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号