首页> 外文会议>9th International conference on language resources and evaluation >TaLAPi - A Thai Linguistically Annotated Corpus for Language Processing
【24h】

TaLAPi - A Thai Linguistically Annotated Corpus for Language Processing

机译:Talapi - 一种语言处理的泰语语言加工语料库

获取原文

摘要

This paper discusses a Thai corpus, TaLAPi, fully annotated with word segmentation (WS), part-of-speech (POS) and named entity (NE) information with the aim to provide a high-quality and sufficiently large corpus for real-life implementation of Thai language processing tools. The corpus contains 2,720 articles (l,043,471words) from the entertainment and lifestyle (NE&L) domain and 5,489 articles (3,181,487 words) in the news (NEWS) domain, with a total of 35 POS tags and 10 named entity categories. In particular, we present an approach to segment and tag foreign and loan words expressed in transliterated or original form in Thai text corpora. We see this as an area for study as adapted and un-adapted foreign language sequences have not been well addressed in the literature and this poses a challenge to the annotation process due to the increasing use and adoption of foreign words in the Thai language nowadays. To reduce the ambiguities in POS tagging and to provide rich information for facilitating Thai syntactic analysis, we adapted the POS tags used in ORCHID and propose a framework to tag Thai text and also addresses the tagging of loan and foreign words based on the proposed segmentation strategy. TaLAPi also includes a detailed guideline for tagging the 10 named entity categories.
机译:本文讨论了泰国语料库,塔拉帕,用字分割(WS),言语部分(POS)和命名实体(NE)信息完全注释,其目的是为现实生活提供高质量和足够大的语料库实现泰语处理工具。语料库中包含来自娱乐和生活方式(NE&L)域的2,720篇文章(L,043,471Words)和5,489篇文章(新闻)域中的5,489篇文章(3,181,487个字),共有35个POS标签和10个命名实体类别。特别是,我们提出了一种在泰国文本语料库中以音译或原始形式表示的段和标签外国和贷款词。我们将此视为研究的一个地区,因为在文献中没有很好地解决,而且不适应的外语序列在文献中没有很好地解决,这对泰国语中的外国语言的增加和采用了泰国语言中的外国语言增加了挑战。为了减少POS标记的含量,并提供丰富的信息以促进泰语句法分析,我们调整了兰花中使用的POS标签,并提出了一个框架来标记泰语文本,并根据拟议的分割策略解决贷款和外交词的标记。塔拉帕还包括一个标记10名名为实体类别的详细指导。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号