中文词性标注是中文信息处理领域的一项基础工作.提出了一种基于条件随机场CRFs(Conditional Random Fields)模型的无监督的中文词性标注方法.首先利用词典对获得的已分好词的生文本进行词性标注,得到初始标注语料,然后利用CRFs对语料进行迭代标注,逐步优化标注结果.并以宾州树库为实验语料,考察了不同规模的标注数据对模型性能的影响,在四份不同规模语料上的实验表明,词性标注正确率提高了1.88%~2.26%.%Chinese part-of-speech (POS) tagging is an infrastructure of Chinese information processing. This paper proposes a new unsupervised tagging approach for Chinese POS using condition random fields (CRFs). First, by using dictionary we tag the POS of pre-segmented texts obtained and get elementary tagging corpus. Then we use CRFs to tag recursively on the corpus and gradually optimise the tagging result. In the paper we take the Pennsylvania TreeBank as the experimental corpus to survey the effect of tagging data with different size on model performance. According to the experiments using four different size corpus,our approach improves the POS tagging accuracy up to 1.88% ~ 2.26%.
展开▼