Computational Biology Cheat Sheet

Posted on May 31, 2024 in Computers

Feature Selection (FS)

Concept

To reduce the feature space of a dataset by removing non-relevant features.
AKA variable selection or attribute selection

Reason

Reduce cost due to less variables used
Improve prediction accuracy
Prevent the curse of dimensionality
- Example: overfitting

Methods

Wrapper

Concept:

UzRofsPJaL7CA9NtC8uQK4SGeHfrFzcY76Q8LjiG

Evaluation of performance of wrapper method is depends on the machine learning algorithm used.
The feature selected is best-suited to the algorithm.
The feature selected is evaluated by predictive accuracy of the classifier from clustering.

Examples:
- forward feature slection
  - Start with empty set of features (reduced set)
  - The best of the features from the original set will added into the reduced set at each iteration.
- backward feature selection
  - Start with the full set of original features.
  - Reduce the non-relevant feature at each iteration.
- Combination of forward selection and backward elimination
- Recursive feature elimination

In WEKA
:
Attrubute Evaluator (other option): WrapperSubsetEval
Best First can be backward or forward.

Filter

eDgtT2PAAAAAElFTkSuQmCC

Relies on the general uniqueness of the data to be evaluated and pick feature subset. (not including any mining algorithm)
criterion: (by ranking techniques based on basis of statistical scores)
- distance
- information
- dependency
- consistency
Ranking method provide simplicity, produce excellent and relevant features.
Independent of machine learning algorithm.
Parameters:
- Correlation (Pearson’s correlation): Similarity of information that contribute by the features.
  - Correlated features are irrelevant and redundant.

- Entropy: Quantum of information contributed by the features.
  - Is the measure of the average information content.
  - Lower entropy, higher information contribution. (calculated by excluding feature f1)
  - Formula:
  - Threshold value or relevancy check is used to determines optimality of the features.
  - Mostly used for Unsupervised learning.
- Mutual information: Amount of uncertainty in X due to the knowledge of Y.
  - Mostly used in calculating the amount of shared information about the class by a feature.
  - Used for dimensionality reduction in Supervised learning.
  - High mutual information value = optimal features (able to influence the predictive model toward the right prediction, increase accuracy)

Embedded

zBV9NRYo0iQAAAAAElFTkSuQmCC

Iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.
Common method: Regularization (Penalization method)
- Penalize a feature given a coefficient threshold.
- Introduce additional constraint that bias the model to optimize the algorithm.
- Example: LASSO, Elastic Net, Ridge Regression

FS on WEKA

Attribute Evaluator (AE)
- Attribute is evaluated in the context of the output variable (e.g. the class).
- Some AE required the use of specific SM
  - e.g. CorrelationAttributeEval with RankerSearchMethod
Search Method (SM)
- To try or to navigate different combinations of attributes in the dataset in order to arrive on a short list of chosen features.

FS on Python

Initialization:

Library -> sklearn.feature_selection (SelectKBest, chi2)

To print out the loaded dataset: data.head( )

AYgApo6i0nFZAAAAAElFTkSuQmCC

To retrieve the best features:

k is the number of features want to retrieve.

PDVAAAAAElFTkSuQmCC

Save the scores of best features into data frame (dfscores).

Save the original features set into data frame (dfcolumns).

+FuGYBahVBCmGZLhGvedpOWnOhhfp3gd+Ax9HD3V

Save the selected best features with respect scores into featureScores.

AZuzupy8W1mlAAAAAElFTkSuQmCC

Display the selected features with scores.

ixo1byM3Lh2vVheraWqSkpKGvT4ORUT3upNxDXkk

To find out the n number of best features using variable.nlargest(n, ‘Score’).

H+RXDASungmDAAAAABJRU5ErkJggg==

The selected features can also be illustrated using features importance.

Graph can be plot for features.

To get the correlation of the features: data,corr( )

Display the correlation heatmap: sns.heatmap()

9LHy9AAAAAElFTkSuQmCC

Library -> seaborn

Clustering

Concept

kXnIdMctAJcAAAAASUVORK5CYII=

A type of unsupervised learning (no label data).
Principal: Detect patterns.
Basic Idea: Group together similar instances. (2D point patterns)
Crucially dependet on the measure of similarity or distance between points to be clustered

Soft clustering -> Grouping the data items such that an object can exist in multiple clusters.
Hard clustering -> Grouping the data items such that each piece is only assigned to one cluster.

Algorithm

Partition algorithms (Flat)
- K-means (iterative)
  - pick K random points as cluster center
  - assign data points to closest cluster center
  - change the cluster center to the average of its assigned points
  - stop when no points assignment change
  - Running time:
    1. assign data points to closest cluster center = O(KN) time
    2. change cluster center to the average of its assigned points = O(N) time
- Mixture of Gaussian
- Spectral Clustering
Hierarchical algorithms
- Bottom up-agglomerative
  - Produces not one clustering, but a family of clusterings represented by a dendrogram.
  - Basic idea:
    1. First merge very similar instances.
    2. incrementally build larger clusters out of smaller clusters.
    3. Stop when only one clsuster left.
  - To define closest for cluster:

- - 1. Closest pair (single-link clustering)
    2. Farthest pair (complete-link clustering)
    3. Average of all pairs
- Top-down-divisive

Performance Measurement

Good clustering algorithm:

High intra-cluster similarity (similar inside the cluster)
Low inter-cluster similarity (distinct between cluster)

Silhouette Analysis

To study the separation distance between the resulting clusters.

R887qkAAAAAElFTkSuQmCC

Example

Image segmentation
Clustering gene expression data
Image pattern recognization
Cloud computing environment

Clustering on WEKA

Attribute in Clustering Tab:

Clusterer: K-Mean or others

Cluster Mode: Can adjust the classes that need to be clustered. (Alternative: choose Training dataset when no class labels)

Can visualize the result by right-clicking on the result list at the bottom left.

EePF96A88XXsfz5XvxfOmeqqkRE3Cqf6oGVA2oGl

Classification

Concept

Determining a (categorical) class (or label) for an element in a dataset.
Data is grouped into categories based on a training dataset.

Basic terminology:
1. Classifier: An algorithm that maps input to specific category.
2. Classification model: To draw conclusion from input values given for training.
3. Feature
4. Binary classification: Involve only 2 classes.
5. Multiclass classification: Involve more than 2 classes.
6. Multi-label classification: Each sample is mapped to a target labels (> 1 class).

Types and Example

Type

Linear classifier (Losgistic regression, Naive Bayes, Fisher’s linear discriminant)
- Logistic Regression
  - Estimates discrete values (Binary values) based on given set of independent variables.
  - Predict the probability (0-1) of occurrence of an event by fitting data to a logit function.
  - Tuning method:
    - Interaction terms
    - Remove features
    - Regularize techniques
    - Non-linear model
- Naive Bayes
- - Based on Bayes’ Theorem (assumption of independence between predictors).
  - Able to calculate posterior probability.

Support Vector Machine (Least square)
- Supervised learning
- For classification and regression
- Objective: To decide an optimal hyperplane in an N-dimensional space that differently distinguish the data point.
Quadratic classifier
Kernel estimation (k-nearest neighbor/KNN)
- KNN
  - For classification and regression
  - Classify new case depends on the vote of its k neighbors.
  - Common level of a class is measured by distance function (Hamming distance).
  - Computationally expensive.
  - Variables need to be normalized and pre-processed.

Decision Tree (Random Forest)
- Supervised learning
- Actualize both categoriacal and continuous dependent variables with versatile features.
- Basic idea: splits the populations into 2 or more homogeneous sets based on the most significant attributes making the groups as distinct possible.
Neural Networks
Learning Vector Quantization

Example

Cancer tumor cell idenfication
Sentiment analysis
Facial keypoints detection

Classification on WEKA

Launch WEKA and turn to “Classify” Tab

Attribute: Classifier

**Right-clicking on the classifier choose can tune the classifier, the tuning interface is different according to the classifier choose.

1Wq1p7WhAAAAAElFTkSuQmCC

List of classifier

2rsGvMj450bx76t2AsMrZ9mJ0YU39ifK+N0f8D9V

Test options: Choose supplied test set from the pre-defined folder.

Start the classification process.

gKUkFPn9R2cAAAAASUVORK5CYII=

Accuracy and confusion matrix can get from the result that provide by WEKA.

Classification on Python (NB)

Step 1

Import Library: import sklearn

Step 2

Load dataset: from sklearn.datsets import load_breast_cancer

To view the organized dataset.

qUb4IU1pmAgAAAABJRU5ErkJggg==

Step 3

To split data into sets (Training and Testing).

from sklearn.model_selection import train_test_split

train, test, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=42)

Test size is set to 33% of the original volume.

Step 4

Import Library of Machine learning model

from sklearn.naive_bayes import GaussianNB

qnb = GaussianNB( )

model = qnb.fit(train, train_labels)

To do prediction with test set using train model: qnb.predict(test)

Setp 5

Evaluate the accuracy of the model:

from sklearn.metrics import accuracy_score

accuracy_score(test_labels, prediction)

Principal Component Analysis PCA

Concept

It is for dimensionality reduction
It is an algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.
(PC = principal components), corvariance matrix
Exploratory data analyses and de-noising of signals in stock market trading, analysis of genome data and gene expression levels in the field of bioinformatics.
Help to identify patterns in data based on the correlation between features.
Highly sensitive to data scaling.

Dimensionality Reduction

To reduce the complexity of a model and avoid overfitting.
Feature selection and Feature extraction
To improve storeage space or the computational efficiency of the learning algorithm, predictive performance.

Cross Validation

Concept

To evaluate the machine learning model by training it on the subset of the available data and then evaluating them on the remaining input data.

Types

Holdout method
- Dataset divided into 2 sets (training and testing).
- Training set: Use to train and fit the model.
- Testing set: Use as input for model prediction.
- Peformance measurement: Mean absolute test error.
K-fold Cross Validation Method
- Dataset divided into k subsets (ideally k = 5 or 10)
- Higher value of k will lead to less bised model.
- Train the model using k-1 folds and validate and test the model on the remaining kth fold.
- The process is repeated until every k-fold serve as the test set.
- Advantage: Variance of the resulting estimate is reduced as k is increased.
- Disadvantage: The training algorithm has to be rerun from scratch k times (take more computation time).
Leave-one-out Cross Validation
- Is k-fold cross validation taken to its logical extreme, with k equal to N (the number of data points in the set)
- The evaluation given is leave-one-out cross validation error (LOO-XVE)
- Drawback: Expensive to compute.

Performance measures

Classification accuracy
- Is the ratio of number of correct predictions to the total number of input samples.
Logarithmic loss
- Penalize the false classifications
- Work well for multiclass classification
- No upper bound => [0, ∞)
- near 0 = high accuracy
Confusion matrix
- True positive (TP) = Predict: Yes, Actual: Yes
- True negative (TN) = Predict: No, Actual: No
- False positive (FP) = Predict: Yes, Actual: No
- False negative (FN) = Predict: No, Actual: Yes
Area under curve (AUC)
- Used for binary classification problem
- Probability of a classifier that will rank a randomly chosen positive example higher than a randomly chosen negative example.
- True Positive rate (Sensitivity): TP/(FN+TP)
- False Positive rate (Specificity): FP/(FP+TN)
FI score
- Harmonic mean (HM) between precision and recall.
- The greater the FI score, the better the performance of the model.
- Precision: The number of correct positive results divide by the number of positive results predicted by the classifier.
- Recall: The number of correct positive results divided by the number of all samples that should have been identified.
Mean absolute error
- The average of the difference between the original values and the predicted values.
- No direction of the error.
- Need complicated linear programming tools.
Mean squared error (MSE)
- Similar to mean absolute error.
- Takes average of the square of the difference between the original values and the predicted values.
- Easy to compute.

Cross Validation on WEKA

Cross validation is in the box of “Test options”.

Inside the box, there are few options available, the options are shown in the figure below.

AxhPEmhf5LAVAAAAAElFTkSuQmCC

Training Dataset
- Use when all the data are available.
- No predictions are needed.
- Create a descriptive model.
Supplied Test Set
- Use when data is very large.
- Useful when test set is defined by a third party.
Percentage Split
- Use to get a quick idea of the performance of a model.
- Common split is 66%(train set) to 34%(test set).
Cross Validation
- Use when unsure.
- Provide more accurate estimate of the performance than others.
- not suitable for large data.
- Common k = 5 or 10

Cross Validation on Python

Train and Test set

Test size = 33%

Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)

tcqF1VqCagoAAAAASUVORK5CYII=

k-fold Cross Validation

k = 10

Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)

AIyACh2XvoESAAAAAElFTkSuQmCC

Leove-one-out Cross Validation

Library: pandas, sklearn(model_selection), sklearn.linear_model(LogisticRegression)

D54rQ7WTFkuSAAAAAElFTkSuQmCC

Repeated Random Test-Train Splits

this method has the speed of using a train-test split and the reduction of variance in th estimated performance of k-fold cross validation.

Drawback: May cause redundancy in evaluation.

EPoLSmXyFl0AAAAASUVORK5CYII=

Computational Biology Cheat Sheet

Feature Selection (FS)

Concept

Reason

Methods

FS on WEKA

FS on Python

Clustering

Concept

Algorithm

Performance Measurement

Example

Clustering on WEKA

Classification

Concept

Types and Example

Classification on WEKA

Classification on Python (NB)

Principal Component Analysis PCA

Concept

Dimensionality Reduction

Cross Validation

Concept

Types

Performance measures

Cross Validation on WEKA

Cross Validation on Python

Recent Notes

Subjects

Publicidad