The hot phrases in the social network text streams can reflect the hidden hot topics and sudden events.This paper proposes a hot phrase mining technology which can support various hot degree measures without word segmentation. We first construct an AC-Trie using the candidate phrases gathered from text streams.Based on such AC-Trie,we record the historical occurrence frequency of phrases on the Trie by scanning the following streams in single-pass.Furthermore,the AC-Trie needs to be reconstructed using the new samples in the text stream because of the evolution of hot phrases.Thus,we start the reconstruction dynamically according to estimating the occurrence frequency of the missed phrases.The experiments on the Sina micro-blog show that our approach is effective (precision of 89%)and efficient (overhead is 2%of naïve ap-proach).%在线社交网络文本流中的热点短语能反映文本流中隐含的热点话题和突发事件。本文提出了一种无需分词并能支持多种热度度量函数的热点短语挖掘技术。首先用文本流的某个典型时段采样得到候选短语,构建AC-Trie前缀树。然后,基于该前缀树,单遍扫描后续的文本流,将候选短语的历史出现频率记录在Trie相应节点上,从而支持多种基于历史频率的热度计算方法。此外,为及时发现新的热点短语并减少AC-Trie的构建次数,本文通过分析Trie树各节点上的遗漏短语频率,动态确定候选短语的更新时机。新浪微博数据集上的实验验证了本文方法的有效性(准确率达89%)和高效性(时空开销仅为基准算法的2%)。
展开▼