Automatic text classification into predefined categories is an increasingly important task given the vast number of electronic documents available on the Internet and enterprise servers. Successful text classification relies heavily on the vital task of dimensionality reduction, which aims to improve classification accuracy, give greater expression to the classification process, and improve classification computational efficiency. In this paper, two algorithms for feature selection are presented, based on sampling and weighted sampling that build on the C4.5 algorithm. The results demonstrate considerable improvements with regard to classification accuracy - up to 10% - compared to traditional algorithms such as C4.5, Naieve Bayes and Support Vector Machines. The classification process is performed using the Naieve Bayes model in the space of reduced dimensionality. Experiments were carried out using data sets based on the Reuters-21578 collection.
展开▼