ChEMBL Database Tutorial¶
This tutorial demonstrates how to use the ChEMBL interface in PROVESID to query chemical compounds, structures, and properties.
What is ChEMBL?¶
ChEMBL is a manually curated database of bioactive molecules with drug-like properties maintained by EMBL-EBI. It contains:
- Over 2.3 million distinct compounds
- Chemical structures and properties
- Bioactivity data from scientific literature
- Drug/clinical candidate information
Setup¶
First, import the ChEMBL class. On first run, the database will be automatically downloaded (~5GB compressed, ~29GB uncompressed).
from provesid import CheMBL
# Initialize ChEMBL (auto-downloads database if needed)
chembl = CheMBL()
print("ChEMBL database loaded successfully!")
ChEMBL database loaded successfully!
1. Search by ChEMBL ID¶
The most direct way to retrieve a compound is by its ChEMBL ID.
# Search for aspirin (CHEMBL25)
aspirin = chembl.search_by_chembl_id('CHEMBL25')
print(f"ChEMBL ID: {aspirin['chembl_id']}")
print(f"Name: {aspirin['pref_name']}")
print(f"SMILES: {aspirin['canonical_smiles']}")
print(f"InChI: {aspirin['standard_inchi']}")
print(f"InChI Key: {aspirin['standard_inchi_key']}")
print(f"Max Phase: {aspirin['max_phase']}")
print(f"\\nSynonyms ({len(aspirin['synonyms'])} total):")
for syn in aspirin['synonyms'][:5]: # Show first 5 synonyms
print(f" - {syn}")
ChEMBL ID: CHEMBL25 Name: ASPIRIN SMILES: CC(=O)Oc1ccccc1C(=O)O InChI: InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) InChI Key: BSYNRYMUTXBXSQ-UHFFFAOYSA-N Max Phase: 4 \nSynonyms (80 total): - Acetylsalicylic acid - Aspirin - Aspirin - Aspirin - Acetyl salicylate
2. Search by Compound Name¶
Search for compounds by name (partial matching supported).
# Search for caffeine
results = chembl.search_by_name('caffeine')
print(f"Found {len(results)} compound(s) matching 'caffeine'\n")
for compound in results[:3]: # Show first 3 results
print(f"ChEMBL ID: {compound['chembl_id']}")
print(f"Name: {compound['pref_name']}")
print(f"SMILES: {compound['canonical_smiles']}")
print("-" * 60)
Found 4 compound(s) matching 'caffeine' ChEMBL ID: CHEMBL113 Name: CAFFEINE SMILES: Cn1c(=O)c2c(ncn2C)n(C)c1=O ------------------------------------------------------------ ChEMBL ID: CHEMBL70246 Name: ACEFYLLINE SMILES: Cn1c(=O)c2c(ncn2CC(=O)O)n(C)c1=O ------------------------------------------------------------ ChEMBL ID: CHEMBL1200569 Name: CAFFEINE CITRATE SMILES: Cn1c(=O)c2c(ncn2C)n(C)c1=O.O=C(O)CC(O)(CC(=O)O)C(=O)O ------------------------------------------------------------
3. Search by Chemical Structure¶
Search using InChI, InChI Key, or SMILES.
# Search by SMILES (aspirin)
smiles = 'CC(=O)Oc1ccccc1C(=O)O'
compound = chembl.search_by_smiles(smiles)
if compound:
print(f"Found compound: {compound['pref_name']} ({compound['chembl_id']})")
else:
print("Compound not found")
Found compound: ASPIRIN (CHEMBL25)
# Search by InChI Key (aspirin)
inchikey = 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N'
compound = chembl.search_by_inchikey(inchikey)
if compound:
print(f"Found compound: {compound['pref_name']} ({compound['chembl_id']})")
print(f"SMILES: {compound['canonical_smiles']}")
else:
print("Compound not found")
Found compound: ASPIRIN (CHEMBL25) SMILES: CC(=O)Oc1ccccc1C(=O)O
# Search by InChI (aspirin)
inchi = 'InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)'
compound = chembl.search_by_inchi(inchi)
if compound:
print(f"Found compound: {compound['pref_name']} ({compound['chembl_id']})")
else:
print("Compound not found")
Found compound: ASPIRIN (CHEMBL25)
4. Retrieve Physicochemical Properties¶
Get calculated molecular properties for a compound.
# Get aspirin's properties
aspirin = chembl.search_by_chembl_id('CHEMBL25')
props = chembl.get_properties(aspirin['molregno'])
if props:
print(f"Molecular Properties for {aspirin['pref_name']}:")
print(f" Molecular Weight: {props['mw_freebase']:.2f}")
print(f" ALogP: {props['alogp']:.2f}")
print(f" Hydrogen Bond Acceptors: {props['hba']}")
print(f" Hydrogen Bond Donors: {props['hbd']}")
print(f" Polar Surface Area: {props['psa']:.2f}")
print(f" Rotatable Bonds: {props['rtb']}")
print(f" Aromatic Rings: {props['aromatic_rings']}")
print(f" Heavy Atoms: {props['heavy_atoms']}")
print(f" Lipinski Violations: {props['num_ro5_violations']}")
Molecular Properties for ASPIRIN: Molecular Weight: 180.16 ALogP: 1.31 Hydrogen Bond Acceptors: 3 Hydrogen Bond Donors: 1 Polar Surface Area: 63.60 Rotatable Bonds: 2 Aromatic Rings: 1 Heavy Atoms: 13 Lipinski Violations: 0
5. ID Conversion¶
Convert between ChEMBL IDs and internal molregno identifiers.
# ChEMBL ID to molregno
chembl_id = 'CHEMBL25'
molregno = chembl.chembl_id_to_molregno(chembl_id)
print(f"{chembl_id} -> molregno: {molregno}")
# molregno to ChEMBL ID
converted_id = chembl.molregno_to_chembl_id(molregno)
print(f"molregno {molregno} -> {converted_id}")
CHEMBL25 -> molregno: 1280 molregno 1280 -> CHEMBL25
6. Complete Workflow Example¶
Let's search for a drug, retrieve its properties, and analyze them.
# Search for ibuprofen
results = chembl.search_by_name('ibuprofen')
if results:
ibuprofen = results[0]
print(f"===== {ibuprofen['pref_name']} =====")
print(f"\nIdentifiers:")
print(f" ChEMBL ID: {ibuprofen['chembl_id']}")
print(f" Molregno: {ibuprofen['molregno']}")
print(f"\nStructure:")
print(f" SMILES: {ibuprofen['canonical_smiles']}")
print(f" InChI Key: {ibuprofen['standard_inchi_key']}")
# Get properties
props = chembl.get_properties(ibuprofen['molregno'])
if props:
print(f"\nProperties:")
print(f" MW: {props['mw_freebase']:.2f}")
print(f" LogP: {props['alogp']:.2f}")
print(f" HBA: {props['hba']}")
print(f" HBD: {props['hbd']}")
print(f" PSA: {props['psa']:.2f}")
# Lipinski's Rule of Five
print(f"\nLipinski's Rule of Five:")
print(f" MW < 500: {props['mw_freebase'] < 500}")
print(f" LogP < 5: {props['alogp'] < 5}")
print(f" HBA < 10: {props['hba'] < 10}")
print(f" HBD < 5: {props['hbd'] < 5}")
print(f" Violations: {props['num_ro5_violations']}")
===== IBUPROFEN ===== Identifiers: ChEMBL ID: CHEMBL521 Molregno: 11674 Structure: SMILES: CC(C)Cc1ccc(C(C)C(=O)O)cc1 InChI Key: HEFNNWSXXWATRW-UHFFFAOYSA-N Properties: MW: 206.28 LogP: 3.07 HBA: 1 HBD: 1 PSA: 37.30 Lipinski's Rule of Five: MW < 500: True LogP < 5: True HBA < 10: True HBD < 5: True Violations: 0
7. Batch Processing Example¶
Process multiple compounds at once.
import pandas as pd
# List of common drugs by ChEMBL ID
drug_ids = [
'CHEMBL25', # Aspirin
'CHEMBL521', # Ibuprofen
'CHEMBL112', # Acetaminophen
'CHEMBL113', # Caffeine
]
# Collect data
drug_data = []
for chembl_id in drug_ids:
compound = chembl.search_by_chembl_id(chembl_id)
if compound:
props = chembl.get_properties(compound['molregno'])
if props:
drug_data.append({
'ChEMBL ID': chembl_id,
'Name': compound['pref_name'],
'MW': props['mw_freebase'],
'LogP': props['alogp'],
'HBA': props['hba'],
'HBD': props['hbd'],
'PSA': props['psa'],
'Ro5 Violations': props['num_ro5_violations']
})
# Display as table
df = pd.DataFrame(drug_data)
print(df.to_string(index=False))
ChEMBL ID Name MW LogP HBA HBD PSA Ro5 Violations CHEMBL25 ASPIRIN 180.16 1.31 3 1 63.60 0 CHEMBL521 IBUPROFEN 206.28 3.07 1 1 37.30 0 CHEMBL112 ACETAMINOPHEN 151.16 1.35 2 2 49.33 0 CHEMBL113 CAFFEINE 194.19 -1.03 6 0 61.82 0
8. Advanced: Exploring Multiple Synonyms¶
Compounds can have multiple names and synonyms.
# Search for compounds with 'acetyl' in the name
results = chembl.search_by_name('acetyl', limit=5)
print(f"Found {len(results)} compounds with 'acetyl'\n")
for i, compound in enumerate(results, 1):
print(f"{i}. {compound['pref_name']} ({compound['chembl_id']})")
props = chembl.get_properties(compound['molregno'])
if props:
mw = props.get('mw_freebase')
alogp = props.get('alogp')
mw_str = f"{mw:.1f}" if isinstance(mw, (int, float)) else "N/A"
alogp_str = f"{alogp:.2f}" if isinstance(alogp, (int, float)) else "N/A"
print(f" MW: {mw_str}, LogP: {alogp_str}")
print()
Found 5 compounds with 'acetyl' 1. ASPIRIN (CHEMBL25) MW: 180.2, LogP: 1.31 2. SULFAMETHOXAZOLE (CHEMBL443) MW: 253.3, LogP: 1.37 3. ACETYLROTENOLENE (CHEMBL268834) MW: 452.5, LogP: 3.38 4. MELATONIN (CHEMBL45) MW: 232.3, LogP: 1.86 5. MERCAPTOACETYLTRIGLYCINE (CHEMBL9080) MW: 373.2, LogP: N/A
Summary¶
In this tutorial, we covered:
- ✓ Initializing the ChEMBL database
- ✓ Searching by ChEMBL ID, name, InChI, InChI Key, and SMILES
- ✓ Retrieving molecular properties
- ✓ Converting between ID formats
- ✓ Complete workflow examples
- ✓ Batch processing multiple compounds
Next Steps¶
- Explore the database schema in
src/provesid/data/schema_documentation.txt - Check out bioactivity data tables (activities, assays, targets)
- Combine with other PROVESID tools (PubChem, ChEBI, etc.)
Resources¶
- ChEMBL Database: https://www.ebi.ac.uk/chembl/
- ChEMBL Documentation: https://chembl.gitbook.io/chembl-interface-documentation/
- PROVESID Documentation: [Link to your docs]