首页> 外文期刊>BMC Medical Informatics and Decision Making >Detection of sentence boundaries and abbreviations in clinical narratives
【24h】

Detection of sentence boundaries and abbreviations in clinical narratives

机译:检测临床叙事中的句子边界和缩写

获取原文
           

摘要

Background In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms. Methods The problem is addressed by two binary classifiers for abbreviation and sentence detection. A support vector machine exploiting a linear kernel is trained on different combinations of feature sets for each classification task. Feature relevance ranking is applied to investigate which features are important for the particular task. The methods are applied to German language texts from a medical record system, authored by specialized physicians. Results Two collections of 3,024 text snippets were annotated regarding the role of period characters for training and testing. Cohen's kappa resulted in 0.98. For abbreviation and sentence boundary detection we can report an unweighted micro-averaged F-measure using a 10-fold cross validation of 0.97 for the training set. For test set based evaluation we obtained an unweighted micro-averaged F-measure of 0.95 for abbreviation detection and 0.94 for sentence delineation. Language-dependent resources and rules were found to have less impact on abbreviation detection than on sentence delineation. Conclusions Sentence detection is an important task, which should be performed at the beginning of a text processing pipeline. For the text genre under scrutiny we showed that support vector machines exploiting a linear kernel produce state of the art results for sentence boundary detection. The results are comparable with other sentence boundary detection methods applied to English clinical texts. We identified abbreviation detection as a supportive task for sentence delineation.
机译:背景技术在西方语言中,句号由于其作为句子定界符和缩写标记的双重作用而非常模糊。这在具有大量拼写,标点,词汇异常以及频繁出现的简短形式的临床自由文本中尤为重要。方法该问题由两个用于缩写和句子检测的二进制分类器解决。针对每个分类任务,在特征集的不同组合上训练了利用线性核的支持向量机。功能相关性排名用于调查哪些功能对特定任务很重要。该方法适用于由专业医师编写的医疗记录系统中的德语文本。结果注释了3,024个文本片段的两个集合,这些字符涉及时期字符在训练和测试中的作用。科恩的卡帕值为0.98。对于缩写和句子边界检测,我们可以使用训练集的0.97的10倍交叉验证来报告未加权的微平均F测度。对于基于测试集的评估,我们获得了0.95的缩写词检测和0.94的句子描述非加权微平均F度量。发现与语言相关的资源和规则对缩写检测的影响小于对句子描述的影响。结论句子检测是一项重要的任务,应在文本处理管道的开始执行。对于受审查的文本类型,我们表明利用线性核的支持向量机可产生用于句子边界检测的最新结果。结果可与应用于英语临床文本的其他句子边界检测方法相媲美。我们将缩写检测确定为句子描述的辅助任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号