首页> 外文期刊>Language Resources and Evaluation >Building and evaluating web corpora representing national varieties of English
【24h】

Building and evaluating web corpora representing national varieties of English

机译:建立和评估代表国家英语变体的网络语料库

获取原文
获取原文并翻译 | 示例
           

摘要

Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.
机译:语料库是语言研究以及培训统计自然语言处理系统的重要资源。尽管已经建立了非常大的英语语料库,但是对于许多英语品种而言,只有相对较小的语料库可用。可以利用国家顶级域名(例如.au,.ca)来自动构建网络语料库,但尚不清楚此类语料库是否会反映相应的国家英语品种;即,从.ca域构建的网络语料库是否对应于加拿大英语?在本文中,我们从与英语被广泛使用的国家相对应的国家顶级域名构建网络语料库。然后,我们根据关键词,基于卡方检验和拼写变体的语料比较测度以及已知在特定英语单词中被标记的单词的频率对这些语料库进行统计分析。我们发现有证据表明网络语料库确实反映了相应的英语国家变体。然后,通过对加拿大主义进行分析的案例研究,我们证明了这些语料库可能是有价值的词典资源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号