Molecular Properties Prediction

Overview
The goal of this assignment is to explore the use of deep learning, and in particular Graph Neural Networks (GNNs), for molecular property prediction. Traditional cheminformatics relies on molecular fingerprints and other engineered features; GNNs instead learn representations directly from the molecular graph structure, starting from hand-designed atom and bond features. Both paradigms involve feature engineering, but GNNs automate the step of combining local features into a global molecular representation.
This assignment challenges you to understand both approaches, classical feature-based models and deep graph representation learning, and critically compare their strengths and limitations. A key part of the exercise is honest evaluation: on small datasets like Tox21, you may find that fingerprint baselines are competitive with or outperform GNNs. This is a valid and important result, not a failure.
You will work with the Tox21 dataset, which consists of thousands of molecules labeled for activity against 12 different toxicity-related biological targets. Your task is to build and evaluate both traditional machine learning models and GNN-based approaches to predict molecular activity.
Dataset: Tox21
The Tox21 dataset (publicly available via MoleculeNet or Kaggle) is a benchmark in computational toxicology. It includes:
- Compounds represented as SMILES strings
- 12 binary classification tasks (e.g., activation of nuclear receptors or stress pathways)
- ~8,000 molecules for training/testing
- Multi-label setup (a molecule can be active in multiple assays)
The dataset has been preprocessed and split into training, validation, and test sets.
Practical considerations you must handle:
- Missing labels. Not every molecule has been tested on every assay. The label matrix contains missing values (NaNs). You must mask these missing entries during loss computation; do not treat them as negatives.
- Class imbalance. Several of the 12 targets are heavily imbalanced, with fewer than 5% positive examples. Be aware of this when interpreting your results and choosing evaluation metrics.
Molecular Representations
SMILES (Simplified Molecular-Input Line-Entry System)
SMILES is a textual representation of chemical structures, where atoms and bonds are encoded as ASCII strings (e.g., CC(=O)Oc1ccccc1C(=O)O for aspirin). While convenient for storage and parsing, SMILES must be converted to structured formats (e.g., molecular graphs) to be usable for learning, or features need to be extracted for classical ML.
From SMILES to Graphs
You can use cheminformatics libraries like RDKit to convert a SMILES string into a molecular graph, where:
- Nodes = atoms (with features like element type, degree, formal charge, aromaticity)
- Edges = bonds (with features like bond type, conjugation, ring membership)
Note that this atom/bond featurization is itself a form of feature engineering. GNNs learn to compose these local features into molecular-level representations through message passing, but the input features are still hand-designed.
Modeling with GNNs
GNNs operate directly on molecular graphs by passing messages between atoms, aggregating neighborhood information, and learning hierarchical molecular representations. In this assignment, you may explore architectures such as:
- Graph Convolutional Networks (GCNs)
- Message Passing Neural Networks (MPNNs)
- AttentiveFP or D-MPNN (Directed Message Passing Neural Network, implemented in the ChemProp package)
You do not have to explore all of these architectures: choose one or two that you find interesting and implement them. These models typically include:
- Atom/bond featurization
- Message passing layers
- A readout/pooling step to produce a fixed-size molecular embedding
- A final MLP for classification
Suggested libraries: PyTorch Geometric, DGL, or DeepChem. You may use existing implementations of the architectures you choose. The purpose of this assignment is not to implement GNNs from scratch, but to understand how they work, how to train them, and to critically evaluate their performance against classical baselines.
Compute: All experiments should be runnable on Google Colab (free tier GPU is sufficient). If training takes more than a few hours, simplify your model or reduce hyperparameter search.
Assignment Tasks
Part 1: Classical ML Baseline with Molecular Fingerprints
- Use RDKit to compute Morgan fingerprints for all molecules. Morgan fingerprints (also known as Extended Connectivity Fingerprints, or ECFP) encode the local chemical environment of each atom up to a given radius. ECFP4 refers specifically to Morgan fingerprints with radius 2, and is one of the most widely used representations for molecular similarity and property prediction. You are free to experiment with different radii and bit vector lengths, or to explore other fingerprint types (e.g., MACCS keys, RDKit topological fingerprints). For this, do not hesitate to read the literature and/or brainstorm with an LLM-assistant.
- Train Random Forest or Gradient Boosting classifiers (e.g., XGBoost) using these fingerprints.
- Report performance on all 12 targets.
This serves as your baseline. You may also explore other classical models (e.g., SVM, logistic regression) or additional features (e.g., molecular descriptors) if you wish.
Part 2: Graph Neural Network Modeling
- Convert each SMILES string into a molecular graph with appropriate atom and bond features.
- Build a GNN model using one of the suggested architectures.
- Train and evaluate on the same prediction tasks and metrics as Part 1.
- Splitting comparison. Compare model performance using random splits vs. scaffold splits (where molecules are grouped by their core chemical scaffold, so that structurally similar molecules do not appear in both train and test). Discuss which splitting strategy is more appropriate for real-world deployment, and how the choice of split affects your conclusions about model performance.
Part 3: Analysis and Ablation
Conduct a structured ablation study on GNN design choices. Investigate how at least two of the following affect performance:
- Number of message passing layers
- Atom featurization (which features to include, how to encode them)
- Readout/pooling function (sum, mean, attention-based)
- Combining GNN embeddings with classical fingerprint features
- Comparison to sequence models (e.g., LSTM, 1D-CNN) operating on SMILES strings directly
Report results with clear tables or figures showing the effect of each choice.
Additional datasets (optional)
You are encouraged to validate your findings on additional datasets from the MoleculeNet benchmark suite (e.g., BBBP, BACE) or the Open Graph Benchmark (ogbg-molhiv). These provide useful points of comparison, especially for understanding how dataset size affects the relative performance of fingerprints vs. GNNs.
Evaluation Metrics
Use Area Under the ROC Curve (AUC-ROC) for each of the 12 targets, and report:
- Individual AUC scores per target
- Mean and median AUC across tasks
Given the class imbalance in several Tox21 targets, also report AUC-PR (Area Under the Precision-Recall Curve) as a complementary metric. Discuss any cases where AUC-ROC and AUC-PR tell different stories.
Reproducibility: Run all experiments with at least 3 different random seeds and report mean and standard deviation of your metrics. Single-run numbers are not meaningful on a dataset of this size.
Deliverables
You must submit the following:
- A written report (PDF)
- A report of at most 10 pages; shorter is better if concise
- Clearly describe your methodology, experiments, results, and findings
- Include key plots or tables where relevant (e.g., ROC curves, performance comparisons across splits, ablation results)
- Include a limitations section: discuss where your models fail, what they cannot do, and what you would do with more time or data
- Do not include code in the report
- Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Should include all scripts/notebooks for data loading, feature generation, model training, and evaluation
- Include a
requirements.txtorenvironment.yml
Why This Matters in Practice
Drug discovery increasingly depends on accurate, data-driven prediction of molecular properties. Traditional cheminformatics models remain highly competitive, especially on smaller datasets typical of early-stage drug discovery programs. GNNs offer a way to learn molecular representations from graph structure, but whether this translates to better predictions depends heavily on dataset size, molecular diversity, and the specific property being predicted. Understanding when and why one approach outperforms another is more valuable than assuming deep learning always wins.