Github repo (80 stars and counting):

seyonechithrananda/bert-loves-chemistry

Overview of Work:

ChemBERTa is a new language modelling algorithm that aims to make it easier to design new medicines, molecules, and materials, improving lead optimization in drug discovery. I began working on ChemBERTa around one year ago, after a research internship at the machine learning software company Integrate.ai. The algorithm began as an open-source computational tool I released within the DeepChem library, founded by Professor Vijay Pande's lab at Stanford, where I met my research mentor Dr. Bharath Ramsundar, the co-creator of the project. After joining DeepChem, I was motivated by the rise of natural language processing, specifically the 'transformer' model created by Google Brain and OpenAI. I decided to investigate using these attention-based models for drug discovery.

In February, I won the $15,000 Emergent Ventures Fellowship awarded by Dr. Tyler Cowen and the Thiel Foundation, providing me with the necessary resources for developing large-scale models using GPU computing. I was also selected for a research internship with Prof. Alan Aspuru-Guzik's lab at Harvard and the University of Toronto, where I worked on a different project but received mentorship and feedback on my algorithm.

In October, I was the lead co-author on the first version of my paper, published at the Neural Information Processing Systems Conference in Vancouver. Currently, ChemBERTa's paper is being submitted to top journals, including Nature Chemistry and Chemical Science, due to the success of the algorithm, which has received over 250,000 users and multiple citations so far.

ChemBERTa is the first model to systematically evaluate transformers for molecular property prediction. By utilizing structural biochemistry-inspired model training, the model can learn a strong molecular representation. ChemBERTa outperforms current deep-learning-based predictive models on tasks such as predicting biochemical toxicity, and inhibition of HIV replication with little training data.

Poster

Co-authors + Mentors

Publication Details:

Powered by Fruition