...
首页> 外文期刊>International Journal of Data Science and Analytics >Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis
【24h】

Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

机译:Apache Spark-based scalable feature extraction approaches for protein sequence and their clustering performance analysis

获取原文
获取原文并翻译 | 示例
           

摘要

Abstract Genome sequencing projects are rapidly contributing to the rise of high-dimensional protein sequence datasets. Extracting features from a high-dimensional protein sequence dataset poses many challenges. However, many features extraction methods exist, but extracting features from millions of protein sequences becomes impractical because these approaches are not scalable. Therefore, to design an efficient scalable feature extraction approach that extracts significant features, we have proposed two Apache Spark-based scalable feature extraction approaches that extracts significantly important features based on statistical properties from huge protein sequences, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in four clustering algorithms, i.e., scalable random sampling with iterative optimization fuzzy c-means (SRSIO-FCM), scalable literal fuzzy c-means (SLFCM), kernelized SRSIO-FCM (KSRSIO-FCM), and kernelized SLFCM (KSLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies–Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM, SLFCM, KSRSIO-FCM, and KSLFCM clustering algorithms achieve significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号