Demystify DNA Sequencing with Machine Learning and Python

Demystify DNA Sequencing with Machine Learning and Python

Demystify DNA _Inside Image


A genome is a complete collection of DNA in an organism. All living species possess a genome, but they differ considerably in size. The human genome, for instance, is arranged into 23 chromosomes, which is a little bit like an encyclopedia being organized into 23 volumes and if you count all the characters (individual DNA “base pairs”), there would be more than 6 billion in each human genome. So it’s a huge compilation.

As a data-driven science, genomics extensively utilizes machine learning to capture dependencies in data and infer new biological hypotheses. Nonetheless, the ability to extract new insights from the exponentially increasing volume of genomics data, requires more powerful machine learning models. By efficiently leveraging large data sets, deep learning has reconstructed fields such as Computer Vision and Natural Language Processing. It has become the method of preference for many genomics modeling tasks, including predicting the influence of genetic variation on gene regulatory mechanisms such as DNA receptiveness and splicing.

Problem Statement

Interpret a DNA structure with machine learning algorithms used to build a prediction model on DNA sequence data.



The order or sequence of these bases, determines what biological instructions are contained in a strand of DNA. For example, the sequence ATCGTT, might instruct for blue eyes, while ATCGCT might instruct for brown.

DNA sequence data handling using Python:

Here, a DNA sequence is first converted into binary using the 2-1 bit encoding scheme that maps T to 00, C to 01, A to 10, and G to 11.

Now since machine learning or deep learning models require input to be feature matrices or numerical values and have our data in character or string format. So, the next step is to encode these characters into matrices.

There are 3 general approaches to encode sequence data:

1. Ordinal encoding DNA Sequence

2. One-hot encoding DNA Sequence

3. DNA sequence as a “language”, known as k-mer counting

Let us implement each of them and see which one gives us the perfect input features.

Ordinal encoding DNA sequence data:

In this approach, we need to encode each nitrogen base as an ordinal value. For example “ATGC” becomes [0.25, 0.5, 0.75, 1.0]. Any other base such as “N” can be a 0.

Now, let us create functions for creating a NumPy array object from a sequence string, and a label encoder with the DNA sequence alphabet “a”, “c”, “g” and “t”. Also add a character for anything else, “n”.

One-hot encoding DNA Sequence:

Another approach is to use one-hot encoding to represent the DNA sequence. This is widely used in deep learning methods and lends itself well to algorithms like convolutional neural networks. In this example, “ATGC” would become [0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0], and these one-hot encoded vectors can either be concatenated or turned into 2-dimensional arrays.

It returns a list of k-mer “words.” we can then merge the “words” into a “sentence”, then apply your favorite natural language processing methods on the “sentences” as you normally would do.

You can tune both the word length and the amount of overlap. This allows you to determine how the DNA sequence information and vocabulary size will be important in your application. For example, if you use words of length 6, and there are 4 letters, you have a vocabulary of size 4096 possible words. You can then go on and create a bag-of-words model, like you would in NLP.

Human DNA sequence and class labels

Below we find the definitions for each of the 7 classes and how many there are in the human training data:

Now we have all our data loaded, the next step is to convert a sequence of characters into k-mer words, default size = 6 (hexamers). The function Kmers_funct() will collect all possible overlapping k-mers of a specified length from any sequence string.

The DNA sequence is changed to lowercase, divided into all possible k-mer words of length 6 and is ready for the next step.

We need to now convert the lists of k-mers for each gene into string sentences of words that can be used to create the Bag of Words model. We will make a target variable ‘y’ to hold the class labels.

Now, for human we have 4380 genes converted into uniform length feature vectors of 4-gram k-mer (length 6) counts. For chimp and dog, we have the same number of features with 1682 and 820 genes respectively.

So now that we know how to transform our DNA sequences into uniform length numerical vectors in the form of k-mer counts and ngrams, we can now go ahead and build a classification model that can predict the DNA sequence function based only on the sequence itself.

Here we will use the human data to train the model, holding out 20% of the human data to test the model. Then we can challenge the model’s generalizability by trying to predict sequence function in other species (the chimpanzee and dog).

Next, train/test split human dataset and build simple multinomial naive Bayes classifier.

You might want to do some parameter tuning and build a model with different ngram sizes. Here we’ll go ahead with an ngram size of 4 and a model alpha of 0.1.

Okay, so let’s look at some model performance metrics like the confusion matrix, accuracy, precision, recall, and f1 score. We are getting really good results on our unseen data, so it looks like our model did not overfit to the training data.


Predictions on test Human DNA sequence:

Let’s see how our model performs on the DNA sequences from other species. First, we’ll try with the Chimpanzee, which we expect to be very similar to humans and then the Dog DNA sequences.

Predictions on test Chimpanzee DNA sequence:

Predictions on test Human DNA sequence:

The model seems to produce good results on human data. It also does well on Chimpanzee as the Chimpanzee and humans share the same genetic hierarchy. The performance on the dog’s data is not as good as expected, given a greater divergence of dogs from humans than chimpanzes.

You can find the complete code on DLTK Github :