中文生物医学文本无词典分词方法研究

王军辉; 胡铁军; 李丹亚; 钱庆; 方安

首页> 中文期刊> 《情报学报》 >中文生物医学文本无词典分词方法研究

中文生物医学文本无词典分词方法研究

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

为了在不利用词典的条件下实现对中文生物医学文本的有效切分,结合中文生物医学文本专业术语多、新术语不断出现和结构式摘要的特点,引入一种基于重现原理的无词典分词方法,并在实际应用过程中从分词长度上限值的设定和层次特征项抽取两方面对其进行了改进.实验结果表明,该方法可以在不需要词典和语料库学习的情况下,实现对生物医学文本中关键性专业术语的有效抽取,分词准确率约为84.51%.最后,基于本研究中的分词结果,对生物医学领域的词长分布进行了初步探讨,结果表明中文生物医学领域的词长分布与普通汉语文本有非常大的差异.研究结果对在处理中文生物医学文本时N-gram模型中N值的确定具有一定的参考价值.%In order to segment Chinese biomedical text without thesaurus, combining with the characteristics of Chinese biomedical text, such as lots of specialized terms, new terms emerging and Structured Abstract, the paper introduces a method of Chinese word segmentation without thesaurus based on recurrence, and improves it in the process of practical application in two ways. First, do not set the upper limit of the length of terms, second, extracting terms and hierarchical terms at one time. Experimental results show that, without the help of thesaurus and corpus learning, the algorithm can extract the crucial specialized terms in the biomedical text effectively, and the Accuracy Rate is about 84.51％. Finally, a preliminary study for the word length distribution in the field of biomedicine has been done, and the results prove that, the word length distribution in the field of Chinese biomedicine is very different from General Chinese' s, it could provide reference for determining the value of N in N-gram model in the process of Chinese biomedical text.

著录项

来源
《情报学报》 |2011年第2期|197-203|共7页
作者
王军辉; 胡铁军; 李丹亚; 钱庆; 方安;
展开▼
作者单位

中国医学科学院医学信息研究所;

北京100020;

中国医学科学院医学信息研究所;

北京100020;

中国医学科学院医学信息研究所;

北京100020;

中国医学科学院医学信息研究所;

北京100020;

中国医学科学院医学信息研究所;

北京100020;

展开▼
原文格式 PDF
正文语种 chi
中图分类
关键词
无词典分词; 结构式;

相似文献

中文文献
外文文献
专利

1. 基于重现的无词典分词方法在中文生物医学文本挖掘中的应用 [J] . 王军辉 ,胡铁军 ,李丹亚 . 医学信息学杂志 . 2009,第002期
2. 一种改进逐字二分中文分词词典设计 [J] . 杨毅 ,王禹桥 . 湘潭大学自然科学学报 . 2009,第004期
3. 一个基于改进的反序分词词典的中文分词算法 [J] . 赵艳红 ,费洪晓 . 深圳职业技术学院学报 . 2004,第004期
4. 基于最大概率分词算法的中文分词方法研究 [J] . 丁洁 . 科技信息 . 2010,第021期
5. 基于结合词典的CNN-BiGRU-CRF网络中文分词研究 [J] . 郭振鹏 ,张起贵 . 电子设计工程 . 2021,第016期
6. 中文生物医学文本无词典分词方法研究 [C] . 王军辉 ,胡铁军 ,李丹亚 . 中国医学科学院/北京协和医学院医学信息研究所/图书馆2009年学术年会 . 2010
7. 基于无词典分词的中文生物医学文献相关性数据库构建方法研究 [A] . 王军辉 . 2009

中文生物医学文本无词典分词方法研究

摘要

著录项

相似文献

相关主题

期刊订阅