首页> 外国专利> System and method for transcribing historical records into digitized text

System and method for transcribing historical records into digitized text

机译:用于将历史记录转录为数字化文本的系统和方法

摘要

A handwriting recognition system converts word images on documents, such as document images of historical records, into computer searchable text. Word images (snippets) on the document are located, and have multiple word features identified. For each word image, a word feature vector is created representing multiple word features. Based on the similarity of word features (e.g., the distance between feature vectors), similar words are grouped together in clusters, and a centroid that has features most representative of words in the cluster is selected. A digitized text word is selected for each cluster based on review of a centroid in the cluster, and is assigned to all words in that cluster and is used as computer searchable text for those word images where they appear in documents. An analyst may review clusters to permit refinement of the parameters used for grouping words in clusters, including the adjustment of weights and other factors used for determining the distance between feature vectors.
机译:手写识别系统将文档上的文字图像(例如历史记录的文档图像)转换为计算机可搜索的文本。找到文档上的单词图像(摘要),并确定多个单词特征。对于每个单词图像,创建代表多个单词特征的单词特征向量。基于词特征的相似性(例如,特征向量之间的距离),将相似的词聚类成簇,并选择在聚类中具有最能代表词的特征的质心。根据对群集中质心的检查,为每个群集选择一个数字化的文本单词,并将其分配给该群集中的所有单词,并将其用作文档中出现的那些单词图像的计算机可搜索文本。分析人员可以查看聚类以允许优化用于对聚类中的单词进行分组的参数,包括权重和用于确定特征向量之间距离的其他因素的调整。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号