ChEMBL Database Interface¶
provesid.chembl.CheMBL
¶
Interface to the ChEMBL SQLite database for chemical compound queries.
The ChEMBL database contains manually curated bioactive compounds with drug-like properties. This class provides methods to search compounds by various identifiers and retrieve structural and property information.
Parameters¶
db_name : str, optional Name of the SQLite database file (default: 'chembl_36.db') auto_download : bool, optional If True, automatically download database if not found (default: True) db_url : str, optional Custom URL for database download (default: ChEMBL FTP URL)
Attributes¶
path : str Path to the data directory db_path : str Full path to the SQLite database file conn : sqlite3.Connection SQLite database connection cursor : sqlite3.Cursor Database cursor for queries
Examples¶
chembl = CheMBL() compound = chembl.search_by_chembl_id('CHEMBL25') # Aspirin print(compound['pref_name']) 'ASPIRIN' props = chembl.get_properties(compound['molregno']) print(f"MW: {props['mw_freebase']}") MW: 180.16
Notes¶
The database file is approximately 5GB. Initial setup will download and extract the database from the EMBL-EBI FTP server (~1.5GB compressed).
Functions¶
__init__(db_name='chembl_36.db', auto_download=True, db_url=None)
¶
Initialize ChEMBL database interface.
Parameters¶
db_name : str, optional Database filename (default: 'chembl_36.db') auto_download : bool, optional Auto-download if database missing (default: True) db_url : str, optional Custom download URL (default: ChEMBL FTP)
Raises¶
FileNotFoundError If database not found and auto_download is False ChEMBLError If database connection or validation fails
download_database(url=None, force=False)
¶
Download and extract ChEMBL SQLite database from EMBL-EBI FTP.
Downloads the compressed tar.gz archive (~1.5GB), extracts the SQLite database (~5GB), and validates the database integrity by querying the molecule_dictionary table.
Parameters¶
url : str, optional Download URL (default: DEFAULT_DB_URL) force : bool, optional If True, re-download even if database exists (default: False)
Raises¶
ChEMBLError If download, extraction, or validation fails
Examples¶
chembl = CheMBL(auto_download=False) # Will raise FileNotFoundError chembl.download_database(force=True) # Explicit download
search_by_chembl_id(chembl_id)
¶
Search for compound by ChEMBL ID.
Parameters¶
chembl_id : str ChEMBL identifier (e.g., 'CHEMBL25' for aspirin)
Returns¶
dict or None Compound information including structure, or None if not found
Examples¶
chembl = CheMBL() aspirin = chembl.search_by_chembl_id('CHEMBL25') print(aspirin['pref_name']) 'ASPIRIN'
search_by_name(name, limit=100)
¶
Search for compounds by name (case-insensitive partial match).
Searches both preferred names and synonyms.
Parameters¶
name : str Compound name or partial name to search limit : int, optional Maximum number of results (default: 100)
Returns¶
list of dict List of matching compounds with structure information
Examples¶
chembl = CheMBL() results = chembl.search_by_name('aspirin') print(len(results)) 1 print(results[0].get("chembl_id")) 'CHEMBL25'
search_by_inchi(inchi)
¶
Search for compound by Standard InChI.
Parameters¶
inchi : str Standard InChI string
Returns¶
dict or None Compound information, or None if not found
Examples¶
chembl = CheMBL() inchi = 'InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)' compound = chembl.search_by_inchi(inchi) print(compound['chembl_id']) 'CHEMBL25'
search_by_inchikey(inchikey)
¶
Search for compound by Standard InChI Key.
Parameters¶
inchikey : str Standard InChI Key (e.g., 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N')
Returns¶
dict or None Compound information, or None if not found
Examples¶
chembl = CheMBL() compound = chembl.search_by_inchikey('BSYNRYMUTXBXSQ-UHFFFAOYSA-N') print(compound['pref_name']) 'ASPIRIN'
search_by_smiles(smiles)
¶
Search for compound by canonical SMILES.
Note: This performs exact string matching. For similarity searches, consider using RDKit or other cheminformatics tools.
Parameters¶
smiles : str Canonical SMILES string
Returns¶
dict or None Compound information, or None if not found
Examples¶
chembl = CheMBL() compound = chembl.search_by_smiles('CC(=O)Oc1ccccc1C(=O)O') print(compound['chembl_id']) 'CHEMBL25'
get_compound(molregno)
¶
Get complete compound information by internal molregno.
Retrieves data from molecule_dictionary, compound_structures, and molecule_synonyms tables.
Parameters¶
molregno : int Internal molecule registry number
Returns¶
dict or None Dictionary with compound information including: - molregno, chembl_id, pref_name, max_phase - canonical_smiles, standard_inchi, standard_inchi_key - molfile (if available) - synonyms: list of alternative names Returns None if not found
Examples¶
chembl = CheMBL() compound = chembl.get_compound(15) print(compound['pref_name']) 'ASPIRIN' print(compound['canonical_smiles']) 'CC(=O)Oc1ccccc1C(=O)O' print(compound.get("synonyms", [])[:3]) ['Acetylsalicylic acid', 'Aspirin', '2-Acetoxybenzoic acid']
get_properties(molregno)
¶
Get physicochemical properties for a compound.
Parameters¶
molregno : int Internal molecule registry number
Returns¶
dict or None Dictionary with properties including: - mw_freebase: Molecular weight - alogp: Calculated LogP - hba: Hydrogen bond acceptors - hbd: Hydrogen bond donors - psa: Polar surface area - rtb: Rotatable bonds - ro3_pass: Rule of 3 compliance - num_ro5_violations: Lipinski violations - aromatic_rings, heavy_atoms, etc. Returns None if not found
Examples¶
chembl = CheMBL() props = chembl.get_properties(15) # Aspirin print(f"MW: {props['mw_freebase']:.2f}") MW: 180.16 print(f"LogP: {props['alogp']:.2f}") LogP: 1.19
chembl_id_to_molregno(chembl_id)
¶
provesid.chembl.ChEMBLError
¶
Bases: Exception
Custom exception for ChEMBL database errors
Overview¶
The ChEMBL module provides access to the ChEMBL SQLite database, a manually curated database of bioactive molecules with drug-like properties maintained by EMBL-EBI. The database contains over 2.3 million compounds with chemical structures, properties, and bioactivity data.
Database Information¶
- Database: ChEMBL v36
- Format: SQLite (~5GB uncompressed)
- Source: EMBL-EBI FTP
- Auto-download: Yes (on first use)
- Compressed size: ~1.5GB
Key Features¶
- Multiple search methods: Search by ChEMBL ID, name, InChI, InChI Key, or SMILES
- Local database: Fast queries with no API rate limits
- Offline access: Works offline after initial download
- Comprehensive data: Structures, properties, identifiers, and metadata
- Easy integration: Simple Python API consistent with other PROVESID modules
Database Schema¶
The ChEMBL class queries the following main tables:
Core Tables¶
-
molecule_dictionary: Primary compound information
molregno: Internal molecule registry number (primary key)chembl_id: ChEMBL identifier (e.g., CHEMBL25)pref_name: Preferred compound namemax_phase: Maximum clinical trial phase (0-4)therapeutic_flag: Drug/therapeutic indicatormolecule_type: Type classification
-
compound_structures: Chemical structure representations
molregno: Foreign key to molecule_dictionarycanonical_smiles: Canonical SMILES stringstandard_inchi: Standard InChI representationstandard_inchi_key: Standard InChI Keymolfile: Molfile structure data
-
compound_properties: Physicochemical properties
molregno: Foreign key to molecule_dictionarymw_freebase: Molecular weightalogp: Calculated LogP (lipophilicity)hba: Hydrogen bond acceptorshbd: Hydrogen bond donorspsa: Polar surface areartb: Rotatable bondsaromatic_rings: Number of aromatic ringsheavy_atoms: Heavy atom countnum_ro5_violations: Lipinski Rule of Five violations
-
molecule_synonyms: Alternative compound names
molregno: Foreign key to molecule_dictionarysynonyms: Synonym/alternative namesyn_type: Type of synonym- Note: All compound lookups automatically include a list of synonyms
-
chembl_id_lookup: ChEMBL ID mapping table
chembl_id: ChEMBL identifierentity_type: Type of entity (COMPOUND, TARGET, ASSAY, etc.)entity_id: Internal ID (e.g., molregno for compounds)
For complete schema documentation, see src/provesid/data/schema_documentation.txt.
Quick Start¶
from provesid import CheMBL
chembl = CheMBL()
# Search by ChEMBL ID
compound = chembl.search_by_chembl_id('CHEMBL25') # Aspirin
print(compound['pref_name']) # 'ASPIRIN'
# Get molecular properties
props = chembl.get_properties(compound['molregno'])
print(f"MW: {props['mw_freebase']:.2f}")
print(f"LogP: {props['alogp']:.2f}")
# View synonyms
print(f"Synonyms: {compound['synonyms'][:5]}") # First 5 synonyms
Usage Examples¶
Example 1: Search by Name¶
from provesid import CheMBL
chembl = CheMBL()
# Search for compounds by name
results = chembl.search_by_name('caffeine')
for compound in results:
print(f"{compound['chembl_id']}: {compound['pref_name']}")
print(f" SMILES: {compound['canonical_smiles']}")
if compound['synonyms']:
print(f" Synonyms: {', '.join(compound['synonyms'][:3])}")
Example 2: Structure-Based Search¶
from provesid import CheMBL
chembl = CheMBL()
# Search by SMILES
smiles = 'CC(=O)Oc1ccccc1C(=O)O'
compound = chembl.search_by_smiles(smiles)
# Search by InChI Key
inchikey = 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N'
compound = chembl.search_by_inchikey(inchikey)
# Both return the same compound (aspirin)
print(compound['chembl_id']) # 'CHEMBL25'
Example 3: Property Analysis¶
from provesid import CheMBL
chembl = CheMBL()
# Get compound and properties
compound = chembl.search_by_chembl_id('CHEMBL25')
props = chembl.get_properties(compound['molregno'])
# Check Lipinski's Rule of Five
print("Lipinski's Rule of Five:")
print(f" MW < 500: {props['mw_freebase'] < 500}")
print(f" LogP < 5: {props['alogp'] < 5}")
print(f" HBA < 10: {props['hba'] < 10}")
print(f" HBD < 5: {props['hbd'] < 5}")
print(f" Total violations: {props['num_ro5_violations']}")
Example 4: Batch Processing¶
from provesid import CheMBL
import pandas as pd
chembl = CheMBL()
# Process multiple compounds
drug_ids = ['CHEMBL25', 'CHEMBL521', 'CHEMBL112']
data = []
for chembl_id in drug_ids:
compound = chembl.search_by_chembl_id(chembl_id)
if compound:
props = chembl.get_properties(compound['molregno'])
data.append({
'ChEMBL ID': chembl_id,
'Name': compound['pref_name'],
'MW': props['mw_freebase'],
'LogP': props['alogp']
})
df = pd.DataFrame(data)
print(df)
Example 5: ID Conversion¶
from provesid import CheMBL
chembl = CheMBL()
# Convert ChEMBL ID to internal molregno
molregno = chembl.chembl_id_to_molregno('CHEMBL25')
print(f"CHEMBL25 -> molregno: {molregno}")
# Convert back
chembl_id = chembl.molregno_to_chembl_id(molregno)
print(f"molregno {molregno} -> {chembl_id}")
Manual Database Download¶
If you prefer to download the database manually:
from provesid import CheMBL
# Initialize without auto-download
chembl = CheMBL(auto_download=False)
# Or download explicitly
chembl.download_database(force=True)
Download from command line:
cd src/provesid/data
wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_36_sqlite.tar.gz
tar -xzf chembl_36_sqlite.tar.gz
mv chembl_36/chembl_36.db .
Performance Notes¶
- First query: May take a few seconds as SQLite loads indexes
- Subsequent queries: Very fast (local SQLite, no network overhead)
- Name searches: May be slower due to LIKE queries and synonym matching
- Exact matches: InChI Key and SMILES searches are fast (indexed)
Error Handling¶
from provesid import CheMBL, ChEMBLError
try:
chembl = CheMBL()
compound = chembl.search_by_chembl_id('CHEMBL25')
if compound is None:
print("Compound not found")
else:
print(f"Found: {compound['pref_name']}")
except ChEMBLError as e:
print(f"ChEMBL error: {e}")
except FileNotFoundError as e:
print(f"Database not found: {e}")
Comparison with Other Data Sources¶
| Feature | ChEMBL | PubChem | ChEBI |
|---|---|---|---|
| Database Size | 2.3M compounds | 110M+ compounds | 190K+ entities |
| Focus | Bioactive drugs | All chemistry | Biology-focused |
| API | Local SQLite | REST API | REST API + SDF |
| Offline | Yes | No | Partial (SDF) |
| Speed | Very fast | Rate limited | Moderate |
| Bioactivity | Yes | Yes | Limited |
References¶
- ChEMBL Database: https://www.ebi.ac.uk/chembl/
- ChEMBL Documentation: https://chembl.gitbook.io/chembl-interface-documentation/
- Database Downloads: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/
- Schema Documentation: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/schema_documentation.html
See Also¶
- PubChem API - For broader compound coverage
- ChEBI - For biological entities and ontology
- NCI Resolver - For identifier resolution