首页> 外文期刊>Language Resources and Evaluation >Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence
【24h】

Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence

机译:创建匈牙利中老年法院记录和私人信件的带注释的语料库

获取原文
获取原文并翻译 | 示例
           

摘要

The paper introduces a novel annotated corpus of Old and Middle Hungarian (16-18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/.
机译:本文介绍了一种新的带注释的匈牙利中古语料库(16-18世纪),选择其文本是为了尽可能接近给定历史时期的本土语言。语料库由审判中证人的证词和私人信件样本组成。这些文本不仅经过形态分析,而且每个文件都包含元数据,这也将促进社会语言学研究。使用PurePos PoS标记器和最初为现代匈牙利人开发的匈牙利形态分析仪HuMor组成的注释系统,将文本分为子句,手动规范化并在句法上进行注释,该系统最初用于现代匈牙利语,但适用于分析匈牙利中古匈牙利语的形态结构。使用基于Web的易于使用的手动消歧界面,手动检查并纠正了自动消歧的形态注释。标注的规范化过程和手动验证需要大量的团队合作,并为不断完善标记的统计模型和计算形态提供了持续的反馈。本文讨论了标准化过程中出现的一些典型问题及其解决方案。此外,我们还描述了自动注释工具,半自动消歧过程以及查询界面,其特殊功能也使纠正注释成为可能。通过显示http://tmk.nytud.hu/上的地址,可以免费显示第一版经过完全标准化和注释的匈牙利语文集的beta版本,以显示所选文本的原始版本,标准化版本和经过分析的版本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号