Skip to content

ChEMBL Database Interface

provesid.chembl.CheMBL

Interface to the ChEMBL SQLite database for chemical compound queries.

The ChEMBL database contains manually curated bioactive compounds with drug-like properties. This class provides methods to search compounds by various identifiers and retrieve structural and property information.

Parameters

db_name : str, optional Name of the SQLite database file (default: 'chembl_36.db') auto_download : bool, optional If True, automatically download database if not found (default: True) db_url : str, optional Custom URL for database download (default: ChEMBL FTP URL)

Attributes

path : str Path to the data directory db_path : str Full path to the SQLite database file conn : sqlite3.Connection SQLite database connection cursor : sqlite3.Cursor Database cursor for queries

Examples

chembl = CheMBL() compound = chembl.search_by_chembl_id('CHEMBL25') # Aspirin print(compound['pref_name']) 'ASPIRIN' props = chembl.get_properties(compound['molregno']) print(f"MW: {props['mw_freebase']}") MW: 180.16

Notes

The database file is approximately 5GB. Initial setup will download and extract the database from the EMBL-EBI FTP server (~1.5GB compressed).

Functions

__init__(db_name='chembl_36.db', auto_download=True, db_url=None)

Initialize ChEMBL database interface.

Parameters

db_name : str, optional Database filename (default: 'chembl_36.db') auto_download : bool, optional Auto-download if database missing (default: True) db_url : str, optional Custom download URL (default: ChEMBL FTP)

Raises

FileNotFoundError If database not found and auto_download is False ChEMBLError If database connection or validation fails

download_database(url=None, force=False)

Download and extract ChEMBL SQLite database from EMBL-EBI FTP.

Downloads the compressed tar.gz archive (~1.5GB), extracts the SQLite database (~5GB), and validates the database integrity by querying the molecule_dictionary table.

Parameters

url : str, optional Download URL (default: DEFAULT_DB_URL) force : bool, optional If True, re-download even if database exists (default: False)

Raises

ChEMBLError If download, extraction, or validation fails

Examples

chembl = CheMBL(auto_download=False) # Will raise FileNotFoundError chembl.download_database(force=True) # Explicit download

search_by_chembl_id(chembl_id)

Search for compound by ChEMBL ID.

Parameters

chembl_id : str ChEMBL identifier (e.g., 'CHEMBL25' for aspirin)

Returns

dict or None Compound information including structure, or None if not found

Examples

chembl = CheMBL() aspirin = chembl.search_by_chembl_id('CHEMBL25') print(aspirin['pref_name']) 'ASPIRIN'

search_by_name(name, limit=100)

Search for compounds by name (case-insensitive partial match).

Searches both preferred names and synonyms.

Parameters

name : str Compound name or partial name to search limit : int, optional Maximum number of results (default: 100)

Returns

list of dict List of matching compounds with structure information

Examples

chembl = CheMBL() results = chembl.search_by_name('aspirin') print(len(results)) 1 print(results[0].get("chembl_id")) 'CHEMBL25'

search_by_inchi(inchi)

Search for compound by Standard InChI.

Parameters

inchi : str Standard InChI string

Returns

dict or None Compound information, or None if not found

Examples

chembl = CheMBL() inchi = 'InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)' compound = chembl.search_by_inchi(inchi) print(compound['chembl_id']) 'CHEMBL25'

search_by_inchikey(inchikey)

Search for compound by Standard InChI Key.

Parameters

inchikey : str Standard InChI Key (e.g., 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N')

Returns

dict or None Compound information, or None if not found

Examples

chembl = CheMBL() compound = chembl.search_by_inchikey('BSYNRYMUTXBXSQ-UHFFFAOYSA-N') print(compound['pref_name']) 'ASPIRIN'

search_by_smiles(smiles)

Search for compound by canonical SMILES.

Note: This performs exact string matching. For similarity searches, consider using RDKit or other cheminformatics tools.

Parameters

smiles : str Canonical SMILES string

Returns

dict or None Compound information, or None if not found

Examples

chembl = CheMBL() compound = chembl.search_by_smiles('CC(=O)Oc1ccccc1C(=O)O') print(compound['chembl_id']) 'CHEMBL25'

get_compound(molregno)

Get complete compound information by internal molregno.

Retrieves data from molecule_dictionary, compound_structures, and molecule_synonyms tables.

Parameters

molregno : int Internal molecule registry number

Returns

dict or None Dictionary with compound information including: - molregno, chembl_id, pref_name, max_phase - canonical_smiles, standard_inchi, standard_inchi_key - molfile (if available) - synonyms: list of alternative names Returns None if not found

Examples

chembl = CheMBL() compound = chembl.get_compound(15) print(compound['pref_name']) 'ASPIRIN' print(compound['canonical_smiles']) 'CC(=O)Oc1ccccc1C(=O)O' print(compound.get("synonyms", [])[:3]) ['Acetylsalicylic acid', 'Aspirin', '2-Acetoxybenzoic acid']

get_properties(molregno)

Get physicochemical properties for a compound.

Parameters

molregno : int Internal molecule registry number

Returns

dict or None Dictionary with properties including: - mw_freebase: Molecular weight - alogp: Calculated LogP - hba: Hydrogen bond acceptors - hbd: Hydrogen bond donors - psa: Polar surface area - rtb: Rotatable bonds - ro3_pass: Rule of 3 compliance - num_ro5_violations: Lipinski violations - aromatic_rings, heavy_atoms, etc. Returns None if not found

Examples

chembl = CheMBL() props = chembl.get_properties(15) # Aspirin print(f"MW: {props['mw_freebase']:.2f}") MW: 180.16 print(f"LogP: {props['alogp']:.2f}") LogP: 1.19

chembl_id_to_molregno(chembl_id)

Convert ChEMBL ID to internal molregno identifier.

Parameters

chembl_id : str ChEMBL identifier (e.g., 'CHEMBL25')

Returns

int or None Internal molregno ID, or None if not found

Examples

chembl = CheMBL() molregno = chembl.chembl_id_to_molregno('CHEMBL25') print(molregno) 15

molregno_to_chembl_id(molregno)

Convert internal molregno to ChEMBL ID.

Parameters

molregno : int Internal molecule registry number

Returns

str or None ChEMBL identifier, or None if not found

Examples

chembl = CheMBL() chembl_id = chembl.molregno_to_chembl_id(15) print(chembl_id) 'CHEMBL25'

provesid.chembl.ChEMBLError

Bases: Exception

Custom exception for ChEMBL database errors

Overview

The ChEMBL module provides access to the ChEMBL SQLite database, a manually curated database of bioactive molecules with drug-like properties maintained by EMBL-EBI. The database contains over 2.3 million compounds with chemical structures, properties, and bioactivity data.

Database Information

  • Database: ChEMBL v36
  • Format: SQLite (~5GB uncompressed)
  • Source: EMBL-EBI FTP
  • Auto-download: Yes (on first use)
  • Compressed size: ~1.5GB

Key Features

  • Multiple search methods: Search by ChEMBL ID, name, InChI, InChI Key, or SMILES
  • Local database: Fast queries with no API rate limits
  • Offline access: Works offline after initial download
  • Comprehensive data: Structures, properties, identifiers, and metadata
  • Easy integration: Simple Python API consistent with other PROVESID modules

Database Schema

The ChEMBL class queries the following main tables:

Core Tables

  • molecule_dictionary: Primary compound information

    • molregno: Internal molecule registry number (primary key)
    • chembl_id: ChEMBL identifier (e.g., CHEMBL25)
    • pref_name: Preferred compound name
    • max_phase: Maximum clinical trial phase (0-4)
    • therapeutic_flag: Drug/therapeutic indicator
    • molecule_type: Type classification
  • compound_structures: Chemical structure representations

    • molregno: Foreign key to molecule_dictionary
    • canonical_smiles: Canonical SMILES string
    • standard_inchi: Standard InChI representation
    • standard_inchi_key: Standard InChI Key
    • molfile: Molfile structure data
  • compound_properties: Physicochemical properties

    • molregno: Foreign key to molecule_dictionary
    • mw_freebase: Molecular weight
    • alogp: Calculated LogP (lipophilicity)
    • hba: Hydrogen bond acceptors
    • hbd: Hydrogen bond donors
    • psa: Polar surface area
    • rtb: Rotatable bonds
    • aromatic_rings: Number of aromatic rings
    • heavy_atoms: Heavy atom count
    • num_ro5_violations: Lipinski Rule of Five violations
  • molecule_synonyms: Alternative compound names

    • molregno: Foreign key to molecule_dictionary
    • synonyms: Synonym/alternative name
    • syn_type: Type of synonym
    • Note: All compound lookups automatically include a list of synonyms
  • chembl_id_lookup: ChEMBL ID mapping table

    • chembl_id: ChEMBL identifier
    • entity_type: Type of entity (COMPOUND, TARGET, ASSAY, etc.)
    • entity_id: Internal ID (e.g., molregno for compounds)

For complete schema documentation, see src/provesid/data/schema_documentation.txt.

Quick Start

from provesid import CheMBL

chembl = CheMBL()

# Search by ChEMBL ID
compound = chembl.search_by_chembl_id('CHEMBL25')  # Aspirin
print(compound['pref_name'])  # 'ASPIRIN'

# Get molecular properties
props = chembl.get_properties(compound['molregno'])
print(f"MW: {props['mw_freebase']:.2f}")
print(f"LogP: {props['alogp']:.2f}")

# View synonyms
print(f"Synonyms: {compound['synonyms'][:5]}")  # First 5 synonyms

Usage Examples

Example 1: Search by Name

from provesid import CheMBL

chembl = CheMBL()

# Search for compounds by name
results = chembl.search_by_name('caffeine')

for compound in results:
    print(f"{compound['chembl_id']}: {compound['pref_name']}")
    print(f"  SMILES: {compound['canonical_smiles']}")
    if compound['synonyms']:
        print(f"  Synonyms: {', '.join(compound['synonyms'][:3])}")
from provesid import CheMBL

chembl = CheMBL()

# Search by SMILES
smiles = 'CC(=O)Oc1ccccc1C(=O)O'
compound = chembl.search_by_smiles(smiles)

# Search by InChI Key
inchikey = 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N'
compound = chembl.search_by_inchikey(inchikey)

# Both return the same compound (aspirin)
print(compound['chembl_id'])  # 'CHEMBL25'

Example 3: Property Analysis

from provesid import CheMBL

chembl = CheMBL()

# Get compound and properties
compound = chembl.search_by_chembl_id('CHEMBL25')
props = chembl.get_properties(compound['molregno'])

# Check Lipinski's Rule of Five
print("Lipinski's Rule of Five:")
print(f"  MW < 500: {props['mw_freebase'] < 500}")
print(f"  LogP < 5: {props['alogp'] < 5}")
print(f"  HBA < 10: {props['hba'] < 10}")
print(f"  HBD < 5: {props['hbd'] < 5}")
print(f"  Total violations: {props['num_ro5_violations']}")

Example 4: Batch Processing

from provesid import CheMBL
import pandas as pd

chembl = CheMBL()

# Process multiple compounds
drug_ids = ['CHEMBL25', 'CHEMBL521', 'CHEMBL112']
data = []

for chembl_id in drug_ids:
    compound = chembl.search_by_chembl_id(chembl_id)
    if compound:
        props = chembl.get_properties(compound['molregno'])
        data.append({
            'ChEMBL ID': chembl_id,
            'Name': compound['pref_name'],
            'MW': props['mw_freebase'],
            'LogP': props['alogp']
        })

df = pd.DataFrame(data)
print(df)

Example 5: ID Conversion

from provesid import CheMBL

chembl = CheMBL()

# Convert ChEMBL ID to internal molregno
molregno = chembl.chembl_id_to_molregno('CHEMBL25')
print(f"CHEMBL25 -> molregno: {molregno}")

# Convert back
chembl_id = chembl.molregno_to_chembl_id(molregno)
print(f"molregno {molregno} -> {chembl_id}")

Manual Database Download

If you prefer to download the database manually:

from provesid import CheMBL

# Initialize without auto-download
chembl = CheMBL(auto_download=False)

# Or download explicitly
chembl.download_database(force=True)

Download from command line:

cd src/provesid/data
wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_36_sqlite.tar.gz
tar -xzf chembl_36_sqlite.tar.gz
mv chembl_36/chembl_36.db .

Performance Notes

  • First query: May take a few seconds as SQLite loads indexes
  • Subsequent queries: Very fast (local SQLite, no network overhead)
  • Name searches: May be slower due to LIKE queries and synonym matching
  • Exact matches: InChI Key and SMILES searches are fast (indexed)

Error Handling

from provesid import CheMBL, ChEMBLError

try:
    chembl = CheMBL()
    compound = chembl.search_by_chembl_id('CHEMBL25')

    if compound is None:
        print("Compound not found")
    else:
        print(f"Found: {compound['pref_name']}")

except ChEMBLError as e:
    print(f"ChEMBL error: {e}")
except FileNotFoundError as e:
    print(f"Database not found: {e}")

Comparison with Other Data Sources

Feature ChEMBL PubChem ChEBI
Database Size 2.3M compounds 110M+ compounds 190K+ entities
Focus Bioactive drugs All chemistry Biology-focused
API Local SQLite REST API REST API + SDF
Offline Yes No Partial (SDF)
Speed Very fast Rate limited Moderate
Bioactivity Yes Yes Limited

References

See Also