首页> 外文期刊>International journal of software innovation >A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification
【24h】

A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification

机译:基于本体的特征向量的Web文本文档分类新方法

获取原文
获取原文并翻译 | 示例
           

摘要

The task of extracting the used feature vector in mining tasks (classification, clustering ...etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.
机译:在挖掘任务(分类,聚类等)中,提取使用过的特征向量的任务被认为是增强文本处理能力的最重要任务。本文提出了一种新的方法来构建网络文本文档分类过程中使用的特征向量。在生成的特征向量中添加语义。该方法基于利用WordNet本体的层次结构的优势,从生成的特征向量中消除了与任何WordNet词法类别都没有语义关系的无意义的词;这可以减少特征向量的大小而不会丢失文本信息,还可以通过将每个单词与其对应的WordNet词法类别连接起来来丰富特征向量。对于挖掘任务,向量空间模型(VSM)用于表示文本文档,术语频率反文档频率(TFIDF)用作术语加权技术。使用五种不同分类器(SVM,JRIP,J48,Naive)进行了几次实验,对基于本体的方法与主成分分析(PCA)方法进行了评估,并且对基于本体的归约技术进行了评估,而无需将语义添加到生成的特征向量中-Bayes和kNN)。实验结果表明,作者提出的方法相对于其他传统方法的有效性,以实现更好的分类精度F值,精度和查全率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号