Computational Biology: Feature Selection, Clustering, and Classification
Cheat Sheet: Computational Biology Final Exam Revision
FEATURE SELECTION
What is feature selection?
Reducing the feature space by removing some of the (non-relevant) features. Also known as:
- variable selection
- feature reduction
- attribute selection
- variable subset selection
Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. In big data mining field of Machine Learning implementation, feature selection plays an important role in disease classification.
Why use feature selection?
- It is cheaper to measure less variables.
- Prediction accuracy may improve
- the curse of dimensionality
Feature Selection Methods
1. FILTER
Based on an evaluation criterion for quantifying how well feature (subsets) discriminate the classes. The selection is based on data-related measures, such as separability or crowding.
2. WRAPPER
Wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This method searches for a feature which is best-suited for the machine learning algorithm and aims to improve the mining performance. To evaluate the features, the predictive accuracy used for classification tasks and goodness of cluster is evaluated using clustering.
DATASET
Figure 1: Prostate dataset
IMPLEMENTATION
Figure 2: Result of ranked attributes
1. Open WEKA Explorer, choose Preprocess and open dataset file which is Prostate_5.arff file (access in computer folder)
2. Choose “Select attributes”.
3. Select the ‘InfoGainAttributeEval’ as Attribute Evaluator and Ranker as Search Method.
4. Click start button to run the feature selection.
5. We can see the Ranked attributes and we can choose the number of attributes that we want to select for example top 5, top10, top 20 or top 100.
6. Choose preprocess tab, remain the top 5 attributes and remove others than that.
Figure 3: Result of Naive Bayes with feature selection
7. Choose the classify tab, select one of five classification algorithms and run it to get result.
Figure 4: Result of Naive Bayes without feature selection
Based on figure 4, If the dataset runs with classification without feature selection, the accuracy will be lower than the classification with feature selection. Naive Bayes is chosen to calculate the accuracy of the dataset after the feature selection is applied to the dataset. The accuracy may vary depending on what features were selected.
CLUSTERING
What is Clustering
A definition of clustering could be “the process of organising objects into groups whose members are similar in some way”.
Clustering is crucial because it determines the intrinsic grouping among the present unlabeled data. They make some assumptions about data points to constitute their similarity. Each hypothesis will construct different but equally valid clusters.
Why we need Clustering
Given a dataset you don’t know anything about, a clustering algorithm can discover groups of objects where the average distances between the members of each cluster are closer than to members in other clusters.
What is the difference between Clustering and Classification
- Classification is the result of supervised learning which means that there is a known label that you want the system to generate. For example, if you built a fruit classifier, it would say “this is an orange, this is an apple”, based on you showing it examples of apples and oranges. 2
- Clustering is the result of unsupervised learning which means that you’ve seen lots of examples, but don’t have labels.
There are two types of clustering algorithm:
- Partition algorithm
- Hierarchical algorithm
We will use K-means from partition algorithm for this example.
1. Open WEKA Explorer, choose Preprocess and open dataset file which is Prostate_5.arff file (access in computer folder)
2. Choose “cluster” tab.
3. Select the ‘SimpleKMeans’ as clusterer.
4. Click start button to run the clustering.
Figure 5: Result of K-Means from clustering algorithm
Based on figure 5, within cluster sum of squared errors showed 493.13 which is the result of this experiment. by using Euclidean as distance function, we can conclude that if using a higher number of clusters, the result which is within cluster sum of squared errors will lower. The lower within cluster sum of squared errors means it has a better result. Next, by using Manhattan as distance function, we can conclude the same as Euclidean which is if using a higher number of clusters, the result which is within cluster sum of squared errors will lower.
CLASSIFICATION
Classification is the result of supervised learning which means that there is a known label that you want the system to generate. It assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks.
There are five popular types of classification algorithms :
- Logistic Regression
- Naive Bayes
- Decision Tree
- k-Nearest Neighbors
- Support Vector Machines
These are 5 algorithms that you can try on your classification problem as a starting point.
A standard machine learning classification problem will be used to demonstrate each algorithm. Specifically, the Ionosphere binary classification problem. This is a good dataset to demonstrate classification algorithms because the input variables are numeric and all have the same scale the problem only has two classes to discriminate. For this example, we will go through Support Vector Machines (SVM) classification algorithm.
Figure 6: Result of SVM classification algorithm
1. Open WEKA Explorer, choose Preprocess and open dataset file which is Prostate_5.arff file (access in computer folder)
2. Choose “classify” tab.
3. Select the ‘SMO’ as classifier.
4. Click start button to run the classification.
The C parameter, called the complexity parameter in Weka controls how flexible the process for drawing the line to separate the classes can be. A value of 0 allows no violations of the margin, whereas the default is 1. A key parameter in SVM is the type of Kernel to use. The simplest kernel is a Linear kernel that separates data with a straight line or hyperplane. The default in Weka is a Polynomial Kernel that will separate the classes using a curved or wiggly line, the higher the polynomial, the more wiggly (the exponent value). A popular and powerful kernel is the RBF Kernel or Radial Basis Function Kernel that is capable of learning closed polygons and complex shapes to separate the classes.
CROSS VALIDATION & CONFUSION MATRIX
The Confusion Matrix shows that the number of false positives and false negatives. The false positives are 3 and the false negatives are 5 in this matrix. in another word from the confusion matrix interpret as For the evaluation on cross validation, the false positives are 3 and the false negatives is 5. The 3 instance of a class “a” have been assigned to a class “b” and 5 instances of class “b=1” are assigned to class “a”
Why use feature selection?
- It is cheaper to measure less variables.
- Prediction accuracy may improve
- the curse of dimensionality
Feature Selection Methods
1. FILTER
Based on an evaluation criterion for quantifying how well feature (subsets) discriminate the classes. The selection is based on data-related measures, such as separability or crowding.
2. WRAPPER
Wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This method searches for a feature which is best-suited for the machine learning algorithm and aims to improve the mining performance. To evaluate the features, the predictive accuracy used for classification tasks and goodness of cluster is evaluated using clustering.
DATASET
Figure 1: Prostate dataset
