The Problem with Healthcare and Genomics Today

Nearly 20 years ago, we sequenced the first human genome. After $3 billion in investment and 13 years of research, we had finally cracked the code to life. All 3 billion A's, C's, T's and G's which code for human life. The future for creating personalized medicine.

Unfortunately, we're still incredibly short of achieving personalized medicine and curing genetic disease. And while the cost to sequence a human genome today has gone down from 8-figures to only $47, we're still at a similar standpoint in genomics research. Why? Because biology is fundamentally hard to read. Humans cannot fundamentally conceptualize the hundreds or possibly thousands of mutations which result in complex diseases such as cancer.

Inspired by this and the research being done by companies like Deep Genomics, I decided to try to tackle this problem exactly one year ago. With the rise of more powerful computing and a rapid increase in genomic data, machine learning can be used to help increase our understanding of the human genome. So much of the drug discovery process and genomics research involves brute force, repetitive experiments + recognizing patterns in data. A skill models are fundamentally designed to understand. With better models for understanding our genome, we can limit the time it takes to discover targets for rare disease by at least 50-70%.

Last month, I got the incredible opportunity to present my research at the Re-Work Deep Learning Summit in Montreal, where I gave a talk on my project in front of machine learning researchers and practitioners at Google Brain, Facebook AI Research, and MIT!

Here's a link to the Jupyter notebook demo (run in Chrome). I'll be looking to deploy an online demo in February, so stay tuned!

Phase 1: Predicting Transcription Factor Binding using Convolutional Neural Networks

Identifying Regulatory Variants that impact CTCF Transcription-Factor DNA binding in A549 lung epithelial cells.

Phase 1 uses convolutional neural networks to understand transcription-factor binding patterns in A549 lung epithelial cells. Using one-hot encoded ChIP-seq data which gives us signals for binding strength across an entire genome, the model is able to learn motifs, or potentially disease-causing regulatory variants that can negatively impact gene-expression (process of encoding DNA to RNA to proteins), achieving an accuracy of 90.5%, surpassing traditional approaches which use Position Weight Matrices (PWMs) models by nearly 20%.

The implications of a model which can learn transcription factor binding patterns are huge; clinical trials suggest mutations in genes which bind to transcription factors are known to cause forms of breast and lung cancers, as well as rare neurological diseases such as Rhett Syndrome.

One-hot encoded genomic data: Converts raw sequencing data into a 4 x k image structure using four different channels for each nucleotide.


Convolutional neural networks are able to learn inherent structures in genomic data, by modelling them like images. Convolutional filters, which are typically used to detect shapes in images, are utilized as motif detectors. These motif detectors are trained to learn an understanding of potential patterns in DNA which correlate to a strong site for a protein to bind to the sequence.

The adaptation of convolutional neural networks from computer vision to genomics can be accomplished by considering a window of genome sequence as an image. Instead of processing 2-D images with three color channels (R, G, B), we consider a genome sequence as a fixed length 1-D sequence window with four channels (A, C, G, T). Therefore the genomic task of modeling DNA sequence protein-binding specificity is analogous to the computer vision task of two-class image classification.

For our model, I use an architecture similar to DeepBind. I use 3 layers of 1d convolutions, the ReLU activation function, global max pooling (instead of average pooling), and a 3-layer MLP network, consisting of an output classification layer with 2 neurons (bind, no bind). During the training phase, the backpropagation and update stages simultaneously update all motifs, thresholds and network weights of the model to improve prediction accuracy, as implemented in the DeepBind paper.

The model inputs a set of sequences s, for which it outputs the binding score for a respective protein it’s trained on (for our model, it was the CTCF protein which helps package DNA in cells). The score is compared to the target value, 0 (no bind) or 1 (bind), obtained from the ChIP-SEQ data.

We use ChIP-SEQ data to train our model, which allows us to analyze how transcription factor proteins bind to DNA sequences. It combines chromatin immunoprecipitation data which determines how proteins are associated with regions of a DNA sequence, alongside DNA sequencing data.

The output of our convolution stage is an (𝑛 + π‘š βˆ’ 1) Γ— 𝑑 array 𝑋 where 𝑑 is the number of tun-able motif detectors within the DeepBind model, 𝑛 is the length of the DNA sequence input, and m represents the number of parameters in each motif detector. Element 𝑋𝑖,π‘˜ is essentially the score of motif detector π‘˜ aligned to position 𝑖 of padded sequence 𝑆. After training on how a gene binds to a single transcription factor, the filter should have a strong understanding of both positive and negative mutations. (i.e: T instead of A)

One of the largest applications of these transcription-factor binding models is predicting the effect of genetic variants, to help expand our knowledge of all possible harmful and helpful mutations in a given DNA-protein interaction. Genetic variants that create or abrogate binding sites can alter gene expression patterns and potentially lead to disease. A promising direction in precision medicine is to use binding models to identify, group and visualize variants that potentially change protein binding.


To explore the effects of genetic variations using these models, we use mutation maps, which illustrate the effect that every possible mutation in a sequence may have on binding affinity. A mutation map conveys two types of information:

First, for a given sequence, the mutation map shows how important each base is by the height of the base letter, where taller bases indicate a stronger interaction. Second, the mutation map showcase a heat map which indicates whether a given mutation will result in a positive or negative effect on DNA-protein binding.

For example, when we look at mutations where a nucleotide is replaced by β€˜T’ across the LDR1 gene, we see a strong positive impact on binding when it is replaced. This model could help aid the understanding of harmful genetic variants in genes, and in cases such as Rett’s syndrome, identify more possibly harmful mutations that may cause the disease. Once we’ve identified target mutations, researchers can now start to focus on identifying if negative mutations are pathological, or disease causing.

The sequence map generated by my model for the CTCF transcriptiuon factor protein in humans.

My model achieves a 91% accuracy on predicting DNA-protein binding across A549 lung epithelial cells. After training the model on CTCF binding patterns, my model successfuly detect a motif sequence map, showcasing a common pattern seen in the exon region of DNA that causes protein-DNA binding. This information allows for us to predict the secondary structure of proteins, and gauge common positive and negative mutations for DNA-protein interactions.

Phase 2: Using Bidirectional LSTMs for Molecular De Novo Drug Design

The second phase of the project uses bidirectional long short-term memory networks for molecular de novo design (hence Project De Novo)! The model can capture the structure and syntax of SMILES molecular representations with near-perfect accuracy, achieving a loss of ~1.3. After sampling the model on 100,000 SMILES strings (millions of characters of chemical information!), the model can generate valid molecular structures 84% of the time.

The initial models (version 1: March 2019) remain open-source and allow for genomics + cheminformatics researchers to train deep learning models on any SMILES string dataset and obtain detailed analysis within hours. Using fragment-based drug discovery, I'm currently applying bidirectional LSTMs and transformer models for creating SMILES strings which are more likely to bind to targets.

A molecule generated by the language model I created and trained.

Read more about how I apply language models to chemical data in this blog post.

Initial Open-Source Starter Code

I've released some of Project De Novo's initial starter code for the transcription-factor binding and molecule generation models below. More models I've been developing over the summer and this fall will be released soon, with documentation.

Phase 1: Predicting transcription-factor binding

Phase 2: Generating SMILES molecular structures

Feel free to reach out to me at! The main project page is viewable here. You can view my portfolio here.