首页> 外文会议>2014 IEEE/ACM Joint Conference on Digital Libraries >The anatomy of a search and mining system for digital humanities
【24h】

The anatomy of a search and mining system for digital humanities

机译:数字人文搜索与挖掘系统的剖析

获取原文
获取原文并翻译 | 示例

摘要

Samtla (Search And Mining Tools with Linguistic Analysis) is an online integrated research environment designed in collaboration with historians and linguists to facilitate the study of digitised texts written in any language. It currently supports the research of two corpora: the Genizah collection held by the Taylor-Schechter Genizah Research Unit in Cambridge University, and a collection of Aramaic incantation texts from late antiquity. In contrast to standard search engines and text mining systems that rely on the bag-of-words representation of text, Samtla provides the retrieval and discovery of fuzzy text patterns/motifs (aka “formulae” to historians), which is achieved through applying a character-based n-gram statistical language model built on top of a powerful generalised suffix tree data structure. This paper brie y describes the major components of Samtla and their underlying techniques.
机译:Samtla(具有语言分析功能的搜索和挖掘工具)是一个在线综合研究环境,与历史学家和语言学家合作设计,以促进对以任何语言编写的数字化文本的研究。目前,它支持两种语料库的研究:由剑桥大学的泰勒·谢克特·热尼扎研究部持有的热尼扎收藏,以及古代晚期的阿拉姆语咒语文本的收藏。与依赖于单词的词袋表示法的标准搜索引擎和文本挖掘系统相比,Samtla提供了对模糊文本模式/图案(对历史学家而言又称为“公式”)的检索和发现,这是通过应用基于字符的n元语法统计语言模型,建立在强大的广义后缀树数据结构之上。本文简述了Samtla的主要组成部分及其基础技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号