Computational Biology Cheat Sheet
Feature Selection (FS)
Concept
- To reduce the feature space of a dataset by removing non-relevant features.
- AKA variable selection or attribute selection
Reason
- Reduce cost due to less variables used
- Improve prediction accuracy
- Prevent the curse of dimensionality
- Example: overfitting
Methods
Wrapper
Concept:
- Evaluation of performance of wrapper method is depends on the machine learning algorithm used.
- The feature selected is best-suited to the algorithm.
- The feature selected is evaluated by predictive accuracy of the classifier from clustering.
- Examples:
- forward feature slection
- Start with empty set of features (reduced set)
- The best of the features from the original set will added into the reduced set at each iteration.
- backward feature selection
- Start with the full set of original features.
- Reduce the non-relevant feature at each iteration.
- Combination of forward selection and backward elimination
- Recursive feature elimination
- forward feature slection
- In WEKA
- :
- Attrubute Evaluator (other option): WrapperSubsetEval
- Best First can be backward or forward.
Filter
- Relies on the general uniqueness of the data to be evaluated and pick feature subset. (not including any mining algorithm)
- criterion: (by ranking techniques based on basis of statistical scores)
- distance
- information
- dependency
- consistency
- Ranking method provide simplicity, produce excellent and relevant features.
- Independent of machine learning algorithm.
- Parameters:
- Correlation (Pearson’s correlation): Similarity of information that contribute by the features.
- Correlated features are irrelevant and redundant.
- Correlation (Pearson’s correlation): Similarity of information that contribute by the features.
- Entropy: Quantum of information contributed by the features.
- Is the measure of the average information content.
- Lower entropy, higher information contribution. (calculated by excluding feature f1)
- Formula:
- Threshold value or relevancy check is used to determines optimality of the features.
- Mostly used for Unsupervised learning.
- Mutual information: Amount of uncertainty in X due to the knowledge of Y.
- Mostly used in calculating the amount of shared information about the class by a feature.
- Used for dimensionality reduction in Supervised learning.
- High mutual information value = optimal features (able to influence the predictive model toward the right prediction, increase accuracy)
- Entropy: Quantum of information contributed by the features.
Embedded
- Iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.
- Common method: Regularization (Penalization method)
- Penalize a feature given a coefficient threshold.
- Introduce additional constraint that bias the model to optimize the algorithm.
- Example: LASSO, Elastic Net, Ridge Regression
FS on WEKA
- Attribute Evaluator (AE)
- Attribute is evaluated in the context of the output variable (e.g. the class).
- Some AE required the use of specific SM
- e.g. CorrelationAttributeEval with RankerSearchMethod
- Search Method (SM)
- To try or to navigate different combinations of attributes in the dataset in order to arrive on a short list of chosen features.
FS on Python
Initialization:
Library -> sklearn.feature_selection (SelectKBest, chi2)
To print out the loaded dataset: data.head( )
To retrieve the best features:
k is the number of features want to retrieve.
Save the scores of best features into data frame (dfscores).
Save the original features set into data frame (dfcolumns).
Save the selected best features with respect scores into featureScores.
Display the selected features with scores.
To find out the n number of best features using variable.nlargest(n, ‘Score’).
The selected features can also be illustrated using features importance.
Graph can be plot for features.
To get the correlation of the features: data,corr( )
Display the correlation heatmap: sns.heatmap()
Library -> seaborn
Clustering
Concept
- A type of unsupervised learning (no label data).
- Principal: Detect patterns.
- Basic Idea: Group together similar instances. (2D point patterns)
- Crucially dependet on the measure of similarity or distance between points to be clustered
- Soft clustering -> Grouping the data items such that an object can exist in multiple clusters.
- Hard clustering -> Grouping the data items such that each piece is only assigned to one cluster.
Algorithm
- Partition algorithms (Flat)
- K-means (iterative)
- pick K random points as cluster center
- assign data points to closest cluster center
- change the cluster center to the average of its assigned points
- stop when no points assignment change
- Running time:
- assign data points to closest cluster center = O(KN) time
- change cluster center to the average of its assigned points = O(N) time
- Mixture of Gaussian
- Spectral Clustering
- K-means (iterative)
- Hierarchical algorithms
- Bottom up-agglomerative
- Produces not one clustering, but a family of clusterings represented by a dendrogram.
- Basic idea:
- First merge very similar instances.
- incrementally build larger clusters out of smaller clusters.
- Stop when only one clsuster left.
- To define closest for cluster:
- Bottom up-agglomerative
- Closest pair (single-link clustering)
- Farthest pair (complete-link clustering)
- Average of all pairs
- Top-down-divisive
Performance Measurement
Good clustering algorithm:
- High intra-cluster similarity (similar inside the cluster)
- Low inter-cluster similarity (distinct between cluster)
Silhouette Analysis
To study the separation distance between the resulting clusters.
Example
- Image segmentation
- Clustering gene expression data
- Image pattern recognization
- Cloud computing environment
Clustering on WEKA
Attribute in Clustering Tab:
Clusterer: K-Mean or others
Cluster Mode: Can adjust the classes that need to be clustered. (Alternative: choose Training dataset when no class labels)
Can visualize the result by right-clicking on the result list at the bottom left.
Classification
Concept
- Determining a (categorical) class (or label) for an element in a dataset.
- Data is grouped into categories based on a training dataset.
- Basic terminology:
- Classifier: An algorithm that maps input to specific category.
- Classification model: To draw conclusion from input values given for training.
- Feature
- Binary classification: Involve only 2 classes.
- Multiclass classification: Involve more than 2 classes.
- Multi-label classification: Each sample is mapped to a target labels (> 1 class).
Types and Example
Type
- Linear classifier (Losgistic regression, Naive Bayes, Fisher’s linear discriminant)
- Logistic Regression
- Estimates discrete values (Binary values) based on given set of independent variables.
- Predict the probability (0-1) of occurrence of an event by fitting data to a logit function.
- Tuning method:
- Interaction terms
- Remove features
- Regularize techniques
- Non-linear model
- Naive Bayes
- Based on Bayes’ Theorem (assumption of independence between predictors).
- Able to calculate posterior probability.
- Logistic Regression
- Support Vector Machine (Least square)
- Supervised learning
- For classification and regression
- Objective: To decide an optimal hyperplane in an N-dimensional space that differently distinguish the data point.
- Quadratic classifier
- Kernel estimation (k-nearest neighbor/KNN)
- KNN
- For classification and regression
- Classify new case depends on the vote of its k neighbors.
- Common level of a class is measured by distance function (Hamming distance).
- Computationally expensive.
- Variables need to be normalized and pre-processed.
- KNN
- Decision Tree (Random Forest)
- Supervised learning
- Actualize both categoriacal and continuous dependent variables with versatile features.
- Basic idea: splits the populations into 2 or more homogeneous sets based on the most significant attributes making the groups as distinct possible.
- Neural Networks
- Learning Vector Quantization
Example
- Cancer tumor cell idenfication
- Sentiment analysis
- Facial keypoints detection
Classification on WEKA
Launch WEKA and turn to “Classify” Tab
Attribute: Classifier
**Right-clicking on the classifier choose can tune the classifier, the tuning interface is different according to the classifier choose.
List of classifier
Test options: Choose supplied test set from the pre-defined folder.
Start the classification process.
Accuracy and confusion matrix can get from the result that provide by WEKA.
Classification on Python (NB)
Step 1
Import Library: import sklearn
Step 2
Load dataset: from sklearn.datsets import load_breast_cancer
To view the organized dataset.
Step 3
To split data into sets (Training and Testing).
from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=42)
Test size is set to 33% of the original volume.
Step 4
Import Library of Machine learning model
from sklearn.naive_bayes import GaussianNB
qnb = GaussianNB( )
model = qnb.fit(train, train_labels)
To do prediction with test set using train model: qnb.predict(test)
Setp 5
Evaluate the accuracy of the model:
from sklearn.metrics import accuracy_score
accuracy_score(test_labels, prediction)
Principal Component Analysis PCA
Concept
- It is for dimensionality reduction
- It is an algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.
(PC = principal components), corvariance matrix
- Exploratory data analyses and de-noising of signals in stock market trading, analysis of genome data and gene expression levels in the field of bioinformatics.
- Help to identify patterns in data based on the correlation between features.
- Highly sensitive to data scaling.
Dimensionality Reduction
- To reduce the complexity of a model and avoid overfitting.
- Feature selection and Feature extraction
- To improve storeage space or the computational efficiency of the learning algorithm, predictive performance.
Cross Validation
Concept
- To evaluate the machine learning model by training it on the subset of the available data and then evaluating them on the remaining input data.
Types
- Holdout method
- Dataset divided into 2 sets (training and testing).
- Training set: Use to train and fit the model.
- Testing set: Use as input for model prediction.
- Peformance measurement: Mean absolute test error.
- K-fold Cross Validation Method
- Dataset divided into k subsets (ideally k = 5 or 10)
- Higher value of k will lead to less bised model.
- Train the model using k-1 folds and validate and test the model on the remaining kth fold.
- The process is repeated until every k-fold serve as the test set.
- Advantage: Variance of the resulting estimate is reduced as k is increased.
- Disadvantage: The training algorithm has to be rerun from scratch k times (take more computation time).
- Leave-one-out Cross Validation
- Is k-fold cross validation taken to its logical extreme, with k equal to N (the number of data points in the set)
- The evaluation given is leave-one-out cross validation error (LOO-XVE)
- Drawback: Expensive to compute.
Performance measures
- Classification accuracy
- Is the ratio of number of correct predictions to the total number of input samples.
- Logarithmic loss
- Penalize the false classifications
- Work well for multiclass classification
- No upper bound => [0, ∞)
- near 0 = high accuracy
- Confusion matrix
- True positive (TP) = Predict: Yes, Actual: Yes
- True negative (TN) = Predict: No, Actual: No
- False positive (FP) = Predict: Yes, Actual: No
- False negative (FN) = Predict: No, Actual: Yes
- Area under curve (AUC)
- Used for binary classification problem
- Probability of a classifier that will rank a randomly chosen positive example higher than a randomly chosen negative example.
- True Positive rate (Sensitivity): TP/(FN+TP)
- False Positive rate (Specificity): FP/(FP+TN)
- FI score
- Harmonic mean (HM) between precision and recall.
- The greater the FI score, the better the performance of the model.
- Precision: The number of correct positive results divide by the number of positive results predicted by the classifier.
- Recall: The number of correct positive results divided by the number of all samples that should have been identified.
- Mean absolute error
- The average of the difference between the original values and the predicted values.
- No direction of the error.
- Need complicated linear programming tools.
- Mean squared error (MSE)
- Similar to mean absolute error.
- Takes average of the square of the difference between the original values and the predicted values.
- Easy to compute.
Cross Validation on WEKA
Cross validation is in the box of “Test options”.
Inside the box, there are few options available, the options are shown in the figure below.
- Training Dataset
- Use when all the data are available.
- No predictions are needed.
- Create a descriptive model.
- Supplied Test Set
- Use when data is very large.
- Useful when test set is defined by a third party.
- Percentage Split
- Use to get a quick idea of the performance of a model.
- Common split is 66%(train set) to 34%(test set).
- Cross Validation
- Use when unsure.
- Provide more accurate estimate of the performance than others.
- not suitable for large data.
- Common k = 5 or 10
Cross Validation on Python
Train and Test set
Test size = 33%
Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)
k-fold Cross Validation
k = 10
Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)
Leove-one-out Cross Validation
Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)
Repeated Random Test-Train Splits
this method has the speed of using a train-test split and the reduction of variance in th estimated performance of k-fold cross validation.
Drawback: May cause redundancy in evaluation.
