A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification

Mohamed K. Elhadad; Khaled M. Badran; Gouda I. Salama

首页> 外文期刊>International journal of software innovation >A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification

【24h】

A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification

机译：基于本体的特征向量的Web文本文档分类新方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The task of extracting the used feature vector in mining tasks (classification, clustering ...etc.) is considered the most important task for enhancing the text processing capabilities. This paper proposes a novel approach to be used in building the feature vector used in web text document classification process; adding semantics in the generated feature vector. This approach is based on utilizing the benefit of the hierarchal structure of the WordNet ontology, to eliminate meaningless words from the generated feature vector that has no semantic relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text, also enriching the feature vector by concatenating each word with its corresponding WordNet lexical category. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting technique. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach, and against an ontology based reduction technique without the process of adding semantics to the generated feature vector using several experiments with five different classifiers (SVM, JRIP, J48, Naive-Bayes, and kNN). The experimental results reveal the effectiveness of the authors' proposed approach against other traditional approaches to achieve a better classification accuracy F-measure, precision, and recall.

机译：在挖掘任务（分类，聚类等）中，提取使用过的特征向量的任务被认为是增强文本处理能力的最重要任务。本文提出了一种新的方法来构建网络文本文档分类过程中使用的特征向量。在生成的特征向量中添加语义。该方法基于利用WordNet本体的层次结构的优势，从生成的特征向量中消除了与任何WordNet词法类别都没有语义关系的无意义的词；这可以减少特征向量的大小而不会丢失文本信息，还可以通过将每个单词与其对应的WordNet词法类别连接起来来丰富特征向量。对于挖掘任务，向量空间模型（VSM）用于表示文本文档，术语频率反文档频率（TFIDF）用作术语加权技术。使用五种不同分类器（SVM，JRIP，J48，Naive）进行了几次实验，对基于本体的方法与主成分分析（PCA）方法进行了评估，并且对基于本体的归约技术进行了评估，而无需将语义添加到生成的特征向量中-Bayes和kNN）。实验结果表明，作者提出的方法相对于其他传统方法的有效性，以实现更好的分类精度F值，精度和查全率。

著录项

来源
《International journal of software innovation》 |2018年第1期|1-10|共10页
作者
Mohamed K. Elhadad; Khaled M. Badran; Gouda I. Salama;
展开▼
作者单位

Computer Engineering Department, Military Technical College, Cairo, Egypt;

Computer Engineering Department, Military Technical College, Cairo, Egypt;

Computer Engineering Department, Military Technical College, Cairo, Egypt;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Dimensionality Reduction; Feature Extraction; Feature Vectr Generation; kNN; Natural Language Processing; Ontology; PCA; Semantic Similarity; Semantic Similarity Measures; Term Frequency Inverse Document Frequency; Vector Space Model; Web Text Documents Classification; Wordnet;

机译：降维;特征提取;特征向量生成;kNN;自然语言处理;本体论;PCA;语义相似度;语义相似性度量;术语频率逆文档频率;向量空间模型;Web文本文档分类;词网;

相似文献

外文文献
中文文献
专利

1. A document classification using feature vectors based on teaching guideline for educational information on the web [J] . Minoru Nakayama, Yasutaka Shimizu 電子情報通信学会技術研究報告. 情報セキュリティ. Information Security . 2000,第11期

机译：使用特征向量的文档分类，基于网络教学指南在网络上的教育信息
2. A document classification using feature vectors based on teaching guideline for educational information on the web [J] . Minoru Nakayama, Yasutaka Shimizu 電子情報通信学会技術研究報告. 情報セキュリティ. Information Security . 2000,第11期

机译：使用特征向量的文档分类，基于网络教学指南在网络上的教育信息
3. A novel approach for ontology-based dimensionality reduction for web text document classification [J] . Elhadad Mohamed K., Badran Khaled Shafee S., Salama Gouda I. International journal of software innovation . 2017,第4期

机译：基于本体的Web文本文档分类降维的新方法
4. Hierarchical Approach to Select Feature Vectors for Classification of Text Documents [C] . Nagesh Kapalavayi, S. N. Jayaram Murthy, Gongzhu Hu IEEE International Conference on Computer Systems and Applications . 2006

机译：选择要素向量的分层方法，用于文本文档的分类
5. Categorization of Phishing Detection Features and Using the Feature Vectors to Classify Phishing Websites [D] . Namasivayam, Bhuvana. 2017

机译：对网络钓鱼检测特征的分类，并使用特征向量对网络钓鱼网站进行分类
6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents [O] . Deepak Agnihotri, Kesari Verma, Priyanka Tripathi -1

机译：计算N-gram的对称强度：文本文档自动分类中的两遍过滤方法
7. Text Document Pre-Processing Using the Bayes Formula for Classification Based on the Vector Space Model [O] . R. Rajkumar, V. P. Kallimani, Lee Lam Hong, 2009

机译：基于向量空间模型的贝叶斯分类文本文档预处理

A Novel Approach for Ontology- Based Feature Vector Generation for Web Text Document Classification

摘要

著录项

相似文献

相关主题

期刊订阅