Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.
A substantial portion of today’s multimedia data exists in the form of unstructured text. However, the unstructured nature of text poses a significant task in meeting users’ information requirements. Text classification (TC) has been extensively employed in text mining to facilitate multimedia data processing. However, accurately categorizing texts becomes challenging due to the increasing presence of non-informative features within the corpus. Several reviews on TC, encompassing various feature selection (FS) approaches to eliminate non-informative features, have been previously published. However, these reviews do not adequately cover the recently explored approaches to TC problem-solving utilizing FS, such as optimization techniques.
... Show MoreFeature selection (FS) constitutes a series of processes used to decide which relevant features/attributes to include and which irrelevant features to exclude for predictive modeling. It is a crucial task that aids machine learning classifiers in reducing error rates, computation time, overfitting, and improving classification accuracy. It has demonstrated its efficacy in myriads of domains, ranging from its use for text classification (TC), text mining, and image recognition. While there are many traditional FS methods, recent research efforts have been devoted to applying metaheuristic algorithms as FS techniques for the TC task. However, there are few literature reviews concerning TC. Therefore, a comprehensive overview was systematicall
... Show MoreAnalysis of image content is important in the classification of images, identification, retrieval, and recognition processes. The medical image datasets for content-based medical image retrieval ( are large datasets that are limited by high computational costs and poor performance. The aim of the proposed method is to enhance this image retrieval and classification by using a genetic algorithm (GA) to choose the reduced features and dimensionality. This process was created in three stages. In the first stage, two algorithms are applied to extract the important features; the first algorithm is the Contrast Enhancement method and the second is a Discrete Cosine Transform algorithm. In the next stage, we used datasets of the medi
... Show MoreThis paper proposes two hybrid feature subset selection approaches based on the combination (union or intersection) of both supervised and unsupervised filter approaches before using a wrapper, aiming to obtain low-dimensional features with high accuracy and interpretability and low time consumption. Experiments with the proposed hybrid approaches have been conducted on seven high-dimensional feature datasets. The classifiers adopted are support vector machine (SVM), linear discriminant analysis (LDA), and K-nearest neighbour (KNN). Experimental results have demonstrated the advantages and usefulness of the proposed methods in feature subset selection in high-dimensional space in terms of the number of selected features and time spe
... Show MoreText Clustering consists of grouping objects of similar categories. The initial centroids influence operation of the system with the potential to become trapped in local optima. The second issue pertains to the impact of a huge number of features on the determination of optimal initial centroids. The problem of dimensionality may be reduced by feature selection. Therefore, Wind Driven Optimization (WDO) was employed as Feature Selection to reduce the unimportant words from the text. In addition, the current study has integrated a novel clustering optimization technique called the WDO (Wasp Swarm Optimization) to effectively determine the most suitable initial centroids. The result showed the new meta-heuristic which is WDO was employed as t
... Show MoreFeature selection represents one of the critical processes in machine learning (ML). The fundamental aim of the problem of feature selection is to maintain performance accuracy while reducing the dimension of feature selection. Different approaches were created for classifying the datasets. In a range of optimization problems, swarming techniques produced better outcomes. At the same time, hybrid algorithms have gotten a lot of attention recently when it comes to solving optimization problems. As a result, this study provides a thorough assessment of the literature on feature selection problems using hybrid swarm algorithms that have been developed over time (2018-2021). Lastly, when compared with current feature selection procedu
... Show MoreFeatures are the description of the image contents which could be corner, blob or edge. Scale-Invariant Feature Transform (SIFT) extraction and description patent algorithm used widely in computer vision, it is fragmented to four main stages. This paper introduces image feature extraction using SIFT and chooses the most descriptive features among them by blurring image using Gaussian function and implementing Otsu segmentation algorithm on image, then applying Scale-Invariant Feature Transform feature extraction algorithm on segmented portions. On the other hand the SIFT feature extraction algorithm preceded by gray image normalization and binary thresholding as another preprocessing step. SIFT is a strong algorithm and gives more accura
... Show MoreFeature selection algorithms play a big role in machine learning applications. There are several feature selection strategies based on metaheuristic algorithms. In this paper a feature selection strategy based on Modified Artificial Immune System (MAIS) has been proposed. The proposed algorithm exploits the advantages of Artificial Immune System AIS to increase the performance and randomization of features. The experimental results based on NSL-KDD dataset, have showed increasing in performance of accuracy compared with other feature selection algorithms (best first search, correlation and information gain).
In this paper, a new hybridization of supervised principal component analysis (SPCA) and stochastic gradient descent techniques is proposed, and called as SGD-SPCA, for real large datasets that have a small number of samples in high dimensional space. SGD-SPCA is proposed to become an important tool that can be used to diagnose and treat cancer accurately. When we have large datasets that require many parameters, SGD-SPCA is an excellent method, and it can easily update the parameters when a new observation shows up. Two cancer datasets are used, the first is for Leukemia and the second is for small round blue cell tumors. Also, simulation datasets are used to compare principal component analysis (PCA), SPCA, and SGD-SPCA. The results sh
... Show More