Classification algorithms face runtime complexity due to high data dimension, especially in the context of big data. Feature selection (FS) is a technique for reducing dimensions and improving learning performance. In this paper, the authors proposed a hybrid FS algorithm for classification in the context of big data. Firstly, only the most relevant features are selected using symmetric uncertainty (SU) as a measure of correlation. The features are distributed into subsets using Apache Spark to calculate SU between each feature and target class in parallel. Then a Binary PSO (BPSO) algorithm is used to find the optimal FS. The BPSO has limited convergence and restricted inertial weight adjustment, so the authors suggested using a multiple inertia weight strategy to influence the changes in particle motions so that the search process is more varied. Also, the authors proposed a parallel fitness evaluation for particles under Spark to accelerate the algorithm. The results showed that the proposed FS achieved higher classification performance with a smaller size in reasonable time.
展开▼