PubChem API tutorial¶
This tutorial explains some use cases of the PubChem API. Note that we also have another API in this package called pubchemview
that has its separate tutorial.
# Import the required modules for PubChem API
from provesid.pubchem import PubChemAPI, Domain, CompoundProperties
import json # mostly for nicer printing :-)
# Initialize the PubChem API client
pc = PubChemAPI()
CID, SID, and AID¶
CID (Compound ID), SID (Substance ID), and AID (Assay ID) are unique identifiers used by PubChem:
CID (Compound ID): Identifies a unique chemical structure in the PubChem Compound database. Each distinct molecule has a single CID, regardless of how it was submitted or by whom. Example: formaldehyde has CID 712.
SID (Substance ID): Identifies a record in the PubChem Substance database, which represents a substance as submitted by a depositor. Multiple SIDs can map to the same CID if different sources submit the same compound. Example: formaldehyde may have many SIDs from different submitters.
AID (Assay ID): Identifies a bioassay record in the PubChem BioAssay database. Each AID corresponds to a specific biological test or experiment, which may reference one or more CIDs or SIDs.
In summary:
- CID = unique chemical structure
- SID = depositor-submitted sample/record
- AID = bioassay/experiment
Whatever we need to retrieve from PubChem, we first need to look for these IDs.
How to search for IDs?¶
The next code cell demonstrates how to use the PubChemAPI
to search for CIDs (Compound IDs) and SIDs (Substance IDs) in PubChem by querying with different types of identifiers such as chemical names, SMILES strings, or CAS numbers.
pc.get_cids_by_name('aspirin')
looks up CIDs by the compound name "aspirin".pc.get_cids_by_name('water', domain=Domain.COMPOUND)
searches for CIDs in the compound domain using the name "water".pc.get_cids_by_name('8000-78-0', domain=Domain.SUBSTANCE)
searches for CIDs in the substance domain using a CAS number.pc.find_cids_comprehensive('8000-78-0')
performs a comprehensive search across both compound and substance domains for the given CAS number, returning the found cid numbers and how they are found.pc.get_sids_by_name('8000-78-0')
retrieves SIDs by searching with a CAS number.
You can use these methods to retrieve PubChem IDs by providing any identifier (name, SMILES, CAS, etc.). The API will return the corresponding CIDs or SIDs, making it easy to map between different chemical identifiers and PubChem records. This is useful for integrating chemical data from various sources or for further property lookups in PubChem.
# 1. Default behavior (backward compatible)
cids_aspirin = pc.get_cids_by_name('aspirin') # Returns clean list of CIDs
print(f"cids found for the name aspirin: {cids_aspirin}")
# 2. Explicit compound domain
cids_water = pc.get_cids_by_name('water', domain=Domain.COMPOUND)
print(f"cids found for the name water: {cids_water}")
# 3. Search in substance domain (new capability)
cids_garlic_oil = pc.get_cids_by_name('8000-78-0', domain=Domain.SUBSTANCE) # [6850738]
print(f"cid found for the CAS number 8000-78-0: {cids_garlic_oil}")
# 4. Comprehensive search across both domains
results = pc.find_cids_comprehensive('8000-78-0')
# Returns detailed results with recommendations
print(f"comprehensive search results for the CAS number 8000-78-0: {results}")
# 5. Enhanced SID search
sids = pc.get_sids_by_name('8000-78-0') # Returns clean list of SIDs
print(f"first 5 sids found for the CAS number 8000-78-0: {sids[:5]}")
cids found for the name aspirin: [2244, 1983, 9871508, 56842252, 145904, 3776, 3032790, 16099592, 24936226, 24847798, 23666729, 11980079, 3014024, 24847791, 169926, 91626, 702, 123131972, 69725476, 24847819, 12280114, 9905405, 133472, 119032, 68484, 15110, 6247, 10245201, 10245200, 56841602, 9938610, 10745, 68749, 23676700, 4064, 30987, 71586755, 25157143, 156866, 16126783, 155576, 12759847, 171511, 5748307, 5793, 222, 24847961, 24666, 44219, 9841438, 53040, 54681542, 8591, 9818919, 24847966, 31869, 131750206, 3080848, 137329, 57384021, 5161, 187065, 53477504, 46780045, 29971035, 90478514, 162733, 21102, 522325, 91758292, 46186934, 23680279, 69975280, 530150, 56843206, 199027, 131953074, 77845952, 72204814, 11508774, 6453785, 91820534, 79668, 67421543, 5492635, 139196123, 139196122, 139196121, 71508666, 129672411, 86676097, 9864979, 132282528, 67463240, 54404402, 9935793, 225394, 133162, 83966, 12490, 135, 176479303, 176479302, 134716626, 131716973, 129682947, 53633780, 51404094, 44153517, 20975655, 13037297, 12332833, 11182709, 11139098, 93093, 83151, 69039, 11812, 71593881, 67546490, 62705023, 23452462, 526502, 73922, 138319320, 91820043, 91819941, 2820813, 9863414, 338] cids found for the name watercid found for the CAS number 8000-78-0: [6850738] comprehensive search results for the CAS number 8000-78-0: {'query': '8000-78-0', 'name_type': 'word', 'compound_domain': {'cids': [], 'success': False, 'error': 'Internal server error'}, 'substance_domain': {'cids': [6850738], 'success': True, 'error': None}, 'total_unique_cids': [6850738], 'recommended_domain': 'substance'} first 5 sids found for the CAS number 8000-78-0: [442031951, 445479966, 446464370, 479884108, 480557704]
pc.find_cids_comprehensive('8000-78-0')
{'query': '8000-78-0', 'name_type': 'word', 'compound_domain': {'cids': [], 'success': False, 'error': 'Resource not found'}, 'substance_domain': {'cids': [6850738], 'success': True, 'error': None}, 'total_unique_cids': [6850738], 'recommended_domain': 'substance'}
One of the main use cases for me was to look up a compound by its CAS number and if nothing is found look up a substance by the same CAS number, especially for those that are not found in CAS Common Registry.
Synonyms and specific IDs¶
After finding the cid
using "one" identifier, we can obtain a list of synonyms (e.g. chemical name, CAS number, etc.) and also extract certain identifiers from the list:
synonyms = pc.get_compound_synonyms(cids_aspirin[0])
ids = pc.get_compound_identifiers(cids_aspirin[0])
print(synonyms[:5])
print(ids)
['aspirin', 'ACETYLSALICYLIC ACID', '50-78-2', '2-Acetoxybenzoic acid', '2-(Acetyloxy)benzoic acid'] {'success': True, 'cid': 2244, 'error': None, 'total_synonyms': 692, 'casrn': ['50-78-2', '001-16-2', '001-17-0', '001-18-8', '11126-35-5'], 'nsc': ['NSC406186', 'NSC27223', 'NSC755899'], 'dtxsid': ['DTXSID5020108'], 'dtxcid': ['DTXCID50108'], 'ec_number': [], 'chebi_id': ['CHEBI:15365'], 'chembl': ['CHEMBL25']}
Compound property¶
After obtaining the cid
of a compound, we can obtain compound properties by calling one of the following functions, that can give basic, selected, or all available properties.
res_basic = pc.get_basic_compound_info(cids_aspirin[0])
# Pretty print the result
print(json.dumps(res_basic, indent=2))
{ "CID": 2244, "MolecularFormula": "C9H8O4", "MolecularWeight": "180.16", "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "InChI": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)", "InChIKey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", "IUPACName": "2-acetyloxybenzoic acid", "success": true, "cid": 2244, "error": null }
res_selected = pc.get_compound_properties(cids_aspirin[0],
[CompoundProperties.SMILES,
CompoundProperties.INCHI,
CompoundProperties.INCHIKEY],
include_synonyms=False)
print(json.dumps(res_selected, indent=2))
{ "CID": 2244, "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "InChI": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)", "InChIKey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", "success": true, "cid": 2244, "error": null }
res_all = pc.get_all_compound_info(cids_aspirin[0])
print(json.dumps(res_all, indent=2))
{ "CID": 2244, "MolecularFormula": "C9H8O4", "MolecularWeight": "180.16", "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "ConnectivitySMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "InChI": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)", "InChIKey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N", "IUPACName": "2-acetyloxybenzoic acid", "XLogP": 1.2, "ExactMass": "180.04225873", "MonoisotopicMass": "180.04225873", "TPSA": 63.6, "Complexity": 212, "Charge": 0, "HBondDonorCount": 1, "HBondAcceptorCount": 4, "RotatableBondCount": 3, "HeavyAtomCount": 13, "IsotopeAtomCount": 0, "AtomStereoCount": 0, "DefinedAtomStereoCount": 0, "UndefinedAtomStereoCount": 0, "BondStereoCount": 0, "DefinedBondStereoCount": 0, "UndefinedBondStereoCount": 0, "CovalentUnitCount": 1, "Volume3D": 136, "XStericQuadrupole3D": 3.86, "YStericQuadrupole3D": 2.45, "ZStericQuadrupole3D": 0.89, "FeatureCount3D": 5, "FeatureAcceptorCount3D": 3, "FeatureDonorCount3D": 0, "FeatureAnionCount3D": 1, "FeatureCationCount3D": 0, "FeatureRingCount3D": 1, "FeatureHydrophobeCount3D": 0, "ConformerModelRMSD3D": 0.6, "EffectiveRotorCount3D": 3, "ConformerCount3D": 10, "Fingerprint2D": "AAADccBwOAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAyDoAABgCIAiDSCAACCAAkIAAIiAEGCMgMJzaENRqCe2Cl4BEIuYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA==", "Title": "Aspirin", "PatentCount": 101562, "PatentFamilyCount": 45279, "LiteratureCount": 138993, "AnnotationTypes": "Biological Test Results: Active|Biological Test Results: Micromolar|Biological Test Results: Nanomolar|Associated Disorders and Diseases|Biological Test Results|Chemical and Physical Properties|Classification|Drug and Medication Information|Identification|Interactions and Pathways|Literature|Patents|Pharmacology and Biochemistry|Safety and Hazards|Spectral Information|Taxonomy|Toxicity", "AnnotationTypeCount": 17, "SourceCategories": "Chemical Vendors|Curation Efforts|Governmental Organizations|Journal Publishers|Legacy Depositors|NIH Initiatives|Research and Development|Subscription Services", "success": true, "cid": 2244, "error": null }
Substance properties¶
The substance properties can be found by providing a sid
to the
sids = pc.get_sids_by_name("garlic oil")
res = pc.get_substance_by_sid(sids[1])
res
{'sid': {'id': 479884108, 'version': 1}, 'source': {'db': {'name': '21014', 'source_id': {'str': 'AA01FZ9M'}}}, 'synonyms': ['Garlic Oil', '8000-78-0'], 'xref': [{'rn': '8000-78-0'}, {'dburl': 'https://www.aablocks.com'}, {'sburl': 'https://www.aablocks.com/prod/8000-78-0'}, {'regid': 'AA01FZ9M'}], 'compound': [{'id': {'type': 0}, 'atoms': {'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 'element': [6, 6, 6, 16, 16, 6, 6, 6, 6, 6, 6, 16, 16, 16, 6, 6, 6, 6, 6, 6, 16, 16, 8, 6, 6, 6]}, 'bonds': {'aid1': [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 22, 24, 25], 'aid2': [2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26], 'order': [1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2]}, 'coords': [{'type': [1, 3], 'aid': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 'conformers': [{'x': [0, 1.3856, 2.7713, 4.1569, 5.5426, 6.9282, 8.3138, 9.6995, 12.8995, 14.2851, 15.6708, 17.0564, 18.442, 19.8277, 21.2133, 22.599, 23.9846, 0, 1.3856, 2.7713, 4.1569, 5.5426, 5.5426, 6.9282, 8.3138, 9.6995], 'y': [0, 0.8, 0, 0.8, 0, 0.8, 0, 0.8, 0, 0.8, 0, 0.8, 0, 0.8, 0, 0.8, 0, 5.6, 6.4, 5.6, 6.4, 5.6, 4, 6.4, 5.6, 6.4]}]}], 'charge': 0}, {'id': {'type': 1, 'id': {'cid': 6850738}}}, {'id': {'type': 2, 'id': {'cid': 16315}}}, {'id': {'type': 2, 'id': {'cid': 16591}}}, {'id': {'type': 2, 'id': {'cid': 65036}}}]}
PubChem View¶
from provesid import PubChemView, get_property_table
logp_table = get_property_table(cids_aspirin[0], "LogP")
logp_table
CID | StringWithMarkup | ExperimentalValue | Unit | Temperature | Conditions | FullReference | |
---|---|---|---|---|---|---|---|
0 | 2244 | None | None | None | None | DrugBank | Acetylsalicylic acid | The DrugBank... | |
1 | 2244 | log Kow = 1.19 | 1.19 | None | None | None | Hazardous Substances Data Bank (HSDB) | ACETYL... |
2 | 2244 | 1.19 | 1.19 | None | None | None | Human Metabolome Database (HMDB) | Aspirin | T... |
3 | 2244 | 1.19 | 1.19 | None | None | None | ILO-WHO International Chemical Safety Cards (I... |
pcv = PubChemView()
res_logP = pcv.get_property_summary(cids_aspirin[0], "LogP")
print(json.dumps(res_logP, indent=2))
{ "property": "LogP", "values": [ "", "log Kow = 1.19", "1.19", "1.19" ], "references": [ "https://www.fip.org/files/fip/BPS/BCS/Monographs/AcetylsalicylicAcid.pdf", "Hansch, C., Leo, A., D. Hoekman. Exploring QSAR - Hydrophobic, Electronic, and Steric Constants. Washington, DC: American Chemical Society., 1995., p. 54", "HANSCH,C ET AL. (1995)" ], "units": [], "conditions": [], "count": 4 }
Advanced PubChem API Features¶
The PubChem API has been improved to provide more elegant data access. Previously, methods like get_substance_by_sid()
and get_compound_by_cid()
returned data wrapped in redundant structures requiring access like result["PC_Substances"][0]
or result["PC_Compounds"][0]
. Now these methods automatically extract the relevant data for easier access.
Batch Processing and Multiple Compounds¶
Let's explore how to work with multiple compounds and batch processing:
# Batch processing for multiple compounds
compound_names = ["aspirin", "caffeine", "acetaminophen", "ibuprofen"]
all_cids = []
for name in compound_names:
cids = pc.get_cids_by_name(name)
if cids:
all_cids.append(cids[0]) # Take the first CID for each compound
print(f"{name}: CID {cids[0]}")
print(f"\nCollected CIDs: {all_cids}")
# Batch property retrieval
properties = [CompoundProperties.MOLECULAR_WEIGHT,
CompoundProperties.MOLECULAR_FORMULA,
CompoundProperties.SMILES]
batch_results = pc.get_compound_properties_batch(all_cids, properties)
print("\nBatch property results:")
print(json.dumps(batch_results, indent=2))
aspirin: CID 2244 caffeine: CID 9871508 caffeine: CID 9871508 acetaminophen: CID 1983 acetaminophen: CID 1983 ibuprofen: CID 24848049 Collected CIDs: [2244, 9871508, 1983, 24848049] ibuprofen: CID 24848049 Collected CIDs: [2244, 9871508, 1983, 24848049] Batch property results: [ { "CID": 2244, "MolecularFormula": "C9H8O4", "MolecularWeight": "180.16", "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "success": true, "cid": 2244, "error": null }, { "CID": 9871508, "MolecularFormula": "C25H27N5O8", "MolecularWeight": "525.5", "SMILES": "CC(=O)NC1=CC=C(C=C1)O.CC(=O)OC1=CC=CC=C1C(=O)O.CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "success": true, "cid": 9871508, "error": null }, { "CID": 1983, "MolecularFormula": "C8H9NO2", "MolecularWeight": "151.16", "SMILES": "CC(=O)NC1=CC=C(C=C1)O", "success": true, "cid": 1983, "error": null }, { "CID": 24848049, "MolecularFormula": "C36H47NO10", "MolecularWeight": "653.8", "SMILES": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.CN(C)CCOC(C1=CC=CC=C1)C2=CC=CC=C2.C(C(=O)O)C(CC(=O)O)(C(=O)O)O", "success": true, "cid": 24848049, "error": null } ] Batch property results: [ { "CID": 2244, "MolecularFormula": "C9H8O4", "MolecularWeight": "180.16", "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O", "success": true, "cid": 2244, "error": null }, { "CID": 9871508, "MolecularFormula": "C25H27N5O8", "MolecularWeight": "525.5", "SMILES": "CC(=O)NC1=CC=C(C=C1)O.CC(=O)OC1=CC=CC=C1C(=O)O.CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "success": true, "cid": 9871508, "error": null }, { "CID": 1983, "MolecularFormula": "C8H9NO2", "MolecularWeight": "151.16", "SMILES": "CC(=O)NC1=CC=C(C=C1)O", "success": true, "cid": 1983, "error": null }, { "CID": 24848049, "MolecularFormula": "C36H47NO10", "MolecularWeight": "653.8", "SMILES": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O.CN(C)CCOC(C1=CC=CC=C1)C2=CC=CC=C2.C(C(=O)O)C(CC(=O)O)(C(=O)O)O", "success": true, "cid": 24848049, "error": null } ]
Chemical Structure Searching¶
PubChem API supports searching by various chemical identifiers including SMILES and InChI keys. Both get_cids_by_smiles()
and get_cids_by_inchikey()
methods return clean lists of CIDs, and the corresponding get_compounds_by_*()
methods return the compound data directly without wrapper structures:
# Search by SMILES string (caffeine)
caffeine_smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
cids_by_smiles = pc.get_cids_by_smiles(caffeine_smiles)
print(f"CIDs found by SMILES: {cids_by_smiles}")
# Get compound record by SMILES (new improved method - no wrapper needed!)
compound_by_smiles = pc.get_compounds_by_smiles(caffeine_smiles)
print(f"Compound data type: {type(compound_by_smiles)}")
if isinstance(compound_by_smiles, dict):
print(f"Direct access to compound keys: {list(compound_by_smiles.keys())}")
# Search by InChI Key
inchikey = "RYYVLZVUVIJVGH-UHFFFAOYSA-N" # caffeine InChI key
cids_by_inchikey = pc.get_cids_by_inchikey(inchikey)
print(f"CIDs found by InChI Key: {cids_by_inchikey}")
# Get compound record by InChI Key (new improved method)
compound_by_inchikey = pc.get_compounds_by_inchikey(inchikey)
print(f"Compound by InChI Key - type: {type(compound_by_inchikey)}")
CIDs found by SMILES: [2519] Compound data type: <class 'dict'> Direct access to compound keys: ['id', 'atoms', 'bonds', 'coords', 'charge', 'props', 'count'] Compound data type: <class 'dict'> Direct access to compound keys: ['id', 'atoms', 'bonds', 'coords', 'charge', 'props', 'count'] CIDs found by InChI Key: [2519] CIDs found by InChI Key: [2519] Compound by InChI Key - type: <class 'dict'> Compound by InChI Key - type: <class 'dict'>
Comprehensive PubChem View Tutorial¶
PubChemView provides access to experimental properties that are not available through the standard PubChem API. These include physical and chemical properties like melting point, boiling point, solubility, vapor pressure, and many others.
Available Experimental Properties¶
Let's first explore what experimental properties are available for a compound:
# Check what experimental properties are available for aspirin
available_props = pcv.get_available_properties(cids_aspirin[0])
print(f"Available experimental properties for aspirin ({len(available_props)} total):")
for prop in available_props:
print(f" - {prop}")
# Show the standard experimental properties mapping
print(f"\nTotal standard experimental properties supported: {len(pcv.experimental_properties)}")
print("Some examples:")
for i, (key, value) in enumerate(list(pcv.experimental_properties.items())[:10]):
print(f" {key} -> {value}")
print(" ...")
Available experimental properties for aspirin (16 total): - Physical Description - Color/Form - Odor - Boiling Point - Melting Point - Flash Point - Solubility - Density - Vapor Pressure - LogP - Stability/Shelf Life - Decomposition - Dissociation Constants - Collision Cross Section - Kovats Retention Index - Other Experimental Properties Total standard experimental properties supported: 44 Some examples: Accelerating Rate Calorimetry (ARC) -> Accelerating+Rate+Calorimetry+(ARC) Acid Value -> Acid+Value Autoignition Temperature -> Autoignition+Temperature Boiling Point -> Boiling+Point Caco2 Permeability -> Caco2+Permeability Collision Cross Section -> Collision+Cross+Section Color/Form -> Color/Form Corrosivity -> Corrosivity Decomposition -> Decomposition Density -> Density ...
Common Physical Properties¶
Let's extract some common physical and chemical properties using the convenience methods:
# Melting Point
melting_point = pcv.get_melting_point(cids_aspirin[0])
print("Melting Point data:")
for i, mp in enumerate(melting_point[:3]): # Show first 3 entries
print(f" {i+1}: {mp.value} (Ref: {mp.reference_number})")
# Boiling Point
boiling_point = pcv.get_boiling_point(cids_aspirin[0])
print(f"\nBoiling Point data ({len(boiling_point)} entries):")
for i, bp in enumerate(boiling_point[:2]):
print(f" {i+1}: {bp.value}")
# Solubility
solubility = pcv.get_solubility(cids_aspirin[0])
print(f"\nSolubility data ({len(solubility)} entries):")
for i, sol in enumerate(solubility[:3]):
print(f" {i+1}: {sol.value}")
# Density
density = pcv.get_density(cids_aspirin[0])
print(f"\nDensity data ({len(density)} entries):")
for i, dens in enumerate(density[:2]):
print(f" {i+1}: {dens.value}")
Melting Point data: 1: 275 °F (NTP, 1992) (Ref: 7) 2: 138-140 (Ref: 35) 3: 135 °C (rapid heating) (Ref: 60) Boiling Point data (4 entries): 1: 284 °F at 760 mmHg (decomposes) (NTP, 1992) 2: Boiling Point data (4 entries): 1: 284 °F at 760 mmHg (decomposes) (NTP, 1992) 2: Solubility data (6 entries): 1: less than 1 mg/mL at 73 °F (NTP, 1992) 2: 3: 1 g sol in: 300 mL water at 25 °C, 100 mL water at 37 °C, 5 mL alcohol, 17 mL chloroform, 10-15 mL ether; less sol in anhydrous ether Solubility data (6 entries): 1: less than 1 mg/mL at 73 °F (NTP, 1992) 2: 3: 1 g sol in: 300 mL water at 25 °C, 100 mL water at 37 °C, 5 mL alcohol, 17 mL chloroform, 10-15 mL ether; less sol in anhydrous ether Density data (5 entries): 1: 1.4 (NTP, 1992) - Denser than water; will sink 2: 1.40 Density data (5 entries): 1: 1.4 (NTP, 1992) - Denser than water; will sink 2: 1.40
Note that the experimental data are not reported homogeneously and therefore it becomes difficult to come up with a single method to extract values, units, and experimental conditions from the reported data that are always in string
format. We will gradually improve this feature by adding more formats to our regex
code as we encounted them.
Property Tables with Full References¶
The get_property_table()
function provides comprehensive property data in a pandas DataFrame format with full reference information and parsed experimental values:
# Get comprehensive LogP data with references
logp_table = get_property_table(cids_aspirin[0], "LogP")
print("LogP Property Table:")
print(logp_table)
print(f"\nColumns: {list(logp_table.columns)}")
# Show some specific data
if len(logp_table) > 0:
print(f"\nExample extracted values:")
for i, row in logp_table.iterrows():
if row['ExperimentalValue'] is not None:
print(f" Original: '{row['StringWithMarkup']}'")
print(f" Extracted: {row['ExperimentalValue']} {row['Unit'] if row['Unit'] else '(unitless)'}")
break
LogP Property Table: CID StringWithMarkup ExperimentalValue Unit Temperature Conditions \ 0 2244 None None None None 1 2244 log Kow = 1.19 1.19 None None None 2 2244 1.19 1.19 None None None 3 2244 1.19 1.19 None None None FullReference 0 DrugBank | Acetylsalicylic acid | The DrugBank... 1 Hazardous Substances Data Bank (HSDB) | ACETYL... 2 Human Metabolome Database (HMDB) | Aspirin | T... 3 ILO-WHO International Chemical Safety Cards (I... Columns: ['CID', 'StringWithMarkup', 'ExperimentalValue', 'Unit', 'Temperature', 'Conditions', 'FullReference'] Example extracted values: Original: 'log Kow = 1.19' Extracted: 1.19 (unitless)
# Compare different properties for aspirin
properties_to_check = ["Vapor Pressure", "Melting Point", "Boiling Point", "Solubility"]
for prop in properties_to_check:
table = get_property_table(cids_aspirin[0], prop)
if len(table) > 0:
valid_values = table[table['ExperimentalValue'].notna()]
print(f"{prop}: {len(valid_values)} experimental values extracted from {len(table)} total entries")
if len(valid_values) > 0:
# Show one example
example = valid_values.iloc[0]
print(f" Example: {example['ExperimentalValue']} {example['Unit'] if example['Unit'] else ''}")
else:
print(f"{prop}: No data available")
print()
Vapor Pressure: 4 experimental values extracted from 5 total entries Example: 0 mmHg Melting Point: 5 experimental values extracted from 7 total entries Example: 275 °F Melting Point: 5 experimental values extracted from 7 total entries Example: 275 °F Boiling Point: 3 experimental values extracted from 4 total entries Example: 284 °F Boiling Point: 3 experimental values extracted from 4 total entries Example: 284 °F Solubility: 1 experimental values extracted from 6 total entries Example: 1 mg/mL Solubility: 1 experimental values extracted from 6 total entries Example: 1 mg/mL
Advanced Pattern Recognition¶
PubChemView includes sophisticated pattern recognition for extracting experimental values from various text formats. The recent improvements include support for formats like "log Kow = 1.19" for LogP data:
# Demonstrate the improved LogP pattern recognition
logp_data = pcv.extract_property_data(cids_aspirin[0], "LogP")
print("LogP pattern recognition examples:")
for i, data in enumerate(logp_data):
if data.value: # Only show non-empty values
# Test the extraction function directly
exp_value, unit, temp, cond = pcv._extract_experimental_value_and_unit(data.value, "LogP")
print(f" {i+1}: '{data.value}' -> {exp_value} {unit if unit else '(unitless)'}")
# Test with a compound that has vapor pressure data
caffeine_cid = pc.get_cids_by_name("caffeine")[0]
vp_data = pcv.extract_property_data(caffeine_cid, "Vapor Pressure")
print(f"\nVapor Pressure pattern recognition examples (CID {caffeine_cid}):")
for i, data in enumerate(vp_data[:3]): # Show first 3
if data.value:
exp_value, unit, temp, cond = pcv._extract_experimental_value_and_unit(data.value, "Vapor Pressure")
print(f" {i+1}: '{data.value}' -> {exp_value} {unit if unit else 'no unit'}")
LogP pattern recognition examples: 2: 'log Kow = 1.19' -> 1.19 (unitless) 3: '1.19' -> 1.19 (unitless) 4: '1.19' -> 1.19 (unitless)
Property 'Vapor Pressure' not found for CID 9871508
Vapor Pressure pattern recognition examples (CID 9871508):
Batch Property Extraction¶
For multiple compounds, you can extract properties in batch:
# Batch extraction for multiple properties of aspirin
target_properties = ["LogP", "Melting Point", "Boiling Point", "Solubility", "Vapor Pressure"]
batch_results = pcv.batch_extract_properties(cids_aspirin[0], target_properties)
print("Batch extraction results for aspirin:")
for prop_name, prop_data in batch_results.items():
print(f"\n{prop_name}: {len(prop_data)} entries")
if prop_data:
# Show first non-empty value
for data in prop_data:
if data.value and data.value.strip():
print(f" Example: {data.value}")
break
# Extract all experimental properties for a compound
print(f"\n" + "="*50)
print("ALL EXPERIMENTAL PROPERTIES")
print("="*50)
all_properties = pcv.extract_all_experimental_properties(cids_aspirin[0])
print(f"Total experimental property categories found: {len(all_properties)}")
for prop_name, data_list in list(all_properties.items())[:5]: # Show first 5
print(f"{prop_name}: {len(data_list)} entries")
Batch extraction results for aspirin: LogP: 4 entries Example: log Kow = 1.19 Melting Point: 7 entries Example: 275 °F (NTP, 1992) Boiling Point: 4 entries Example: 284 °F at 760 mmHg (decomposes) (NTP, 1992) Solubility: 6 entries Example: less than 1 mg/mL at 73 °F (NTP, 1992) Vapor Pressure: 5 entries Example: 0 mmHg (approx) (NIOSH, 2024) ================================================== ALL EXPERIMENTAL PROPERTIES ================================================== Total experimental property categories found: 16 Physical Description: 6 entries Color/Form: 2 entries Odor: 2 entries Boiling Point: 4 entries Melting Point: 7 entries Total experimental property categories found: 16 Physical Description: 6 entries Color/Form: 2 entries Odor: 2 entries Boiling Point: 4 entries Melting Point: 7 entries
import pandas as pd
# Example: Build a small database of pharmaceutical compounds
pharma_compounds = {
"aspirin": 2244,
"ibuprofen": 3672,
"acetaminophen": 1983,
"caffeine": 2519
}
# Create a comprehensive database
database_records = []
for name, cid in pharma_compounds.items():
print(f"Processing {name} (CID: {cid})...")
# Get basic compound info
basic_info = pc.get_basic_compound_info(cid)
# Get experimental properties
logp_table = get_property_table(cid, "LogP")
mp_table = get_property_table(cid, "Melting Point")
sol_table = get_property_table(cid, "Solubility")
# Extract first valid experimental value for each property
logp_exp = logp_table[logp_table['ExperimentalValue'].notna()]['ExperimentalValue'].iloc[0] if len(logp_table[logp_table['ExperimentalValue'].notna()]) > 0 else None
mp_exp = mp_table[mp_table['ExperimentalValue'].notna()]['ExperimentalValue'].iloc[0] if len(mp_table[mp_table['ExperimentalValue'].notna()]) > 0 else None
sol_count = len(sol_table[sol_table['ExperimentalValue'].notna()])
record = {
'name': name,
'cid': cid,
'molecular_formula': basic_info.get('MolecularFormula'),
'molecular_weight': basic_info.get('MolecularWeight'),
'smiles': basic_info.get('CanonicalSMILES'),
'logp_experimental': logp_exp,
'melting_point_experimental': mp_exp,
'solubility_data_points': sol_count
}
database_records.append(record)
# Create DataFrame
pharma_db = pd.DataFrame(database_records)
print("\nPharmaceutical Compounds Database:")
print(pharma_db.to_string(index=False))
Processing aspirin (CID: 2244)... Processing ibuprofen (CID: 3672)... Processing ibuprofen (CID: 3672)... Processing acetaminophen (CID: 1983)... Processing acetaminophen (CID: 1983)... Processing caffeine (CID: 2519)... Processing caffeine (CID: 2519)... Pharmaceutical Compounds Database: name cid molecular_formula molecular_weight smiles logp_experimental melting_point_experimental solubility_data_points aspirin 2244 C9H8O4 180.16 None 1.19 275 1 ibuprofen 3672 C13H18O2 206.28 None 3.97 75-77 1 acetaminophen 1983 C8H9NO2 151.16 None 0.46 168 2 caffeine 2519 C8H10N4O2 194.19 None -0.07 460 2 Pharmaceutical Compounds Database: name cid molecular_formula molecular_weight smiles logp_experimental melting_point_experimental solubility_data_points aspirin 2244 C9H8O4 180.16 None 1.19 275 1 ibuprofen 3672 C13H18O2 206.28 None 3.97 75-77 1 acetaminophen 1983 C8H9NO2 151.16 None 0.46 168 2 caffeine 2519 C8H10N4O2 194.19 None -0.07 460 2
Error Handling and Best Practices¶
When working with PubChem APIs, it's important to handle errors gracefully:
from provesid import PubChemNotFoundError, PubChemError, PubChemViewNotFoundError
# Example of robust compound lookup
def safe_compound_lookup(identifier, search_type="name"):
"""Safely look up a compound with error handling"""
try:
if search_type == "name":
cids = pc.get_cids_by_name(identifier)
elif search_type == "smiles":
cids = pc.get_cids_by_smiles(identifier)
else:
raise ValueError(f"Unsupported search type: {search_type}")
if not cids:
print(f"No compounds found for '{identifier}'")
return None
print(f"Found {len(cids)} compound(s) for '{identifier}': {cids[:5]}...") # Show first 5
return cids[0] # Return first CID
except PubChemNotFoundError:
print(f"Compound '{identifier}' not found in PubChem")
return None
except PubChemError as e:
print(f"PubChem API error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
# Test with various inputs
test_compounds = [
("aspirin", "name"),
("invalid_compound_name_xyz", "name"),
("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles"), # aspirin SMILES
("invalid_smiles", "smiles")
]
for compound, search_type in test_compounds:
print(f"\nTesting: {compound} (search type: {search_type})")
cid = safe_compound_lookup(compound, search_type)
if cid:
print(f" Success! CID: {cid}")
# Safe property extraction
def safe_property_extraction(cid, property_name):
"""Safely extract property data with error handling"""
try:
data = pcv.extract_property_data(cid, property_name)
return data
except PubChemViewNotFoundError:
print(f"Property '{property_name}' not found for CID {cid}")
return []
except Exception as e:
print(f"Error extracting {property_name} for CID {cid}: {e}")
return []
print(f"\n" + "="*40)
print("Safe property extraction example:")
logp_safe = safe_property_extraction(cids_aspirin[0], "LogP")
print(f"LogP data extracted safely: {len(logp_safe)} entries")
Testing: aspirin (search type: name) Found 130 compound(s) for 'aspirin': [2244, 1983, 9871508, 56842252, 145904]... Success! CID: 2244 Testing: invalid_compound_name_xyz (search type: name) Found 130 compound(s) for 'aspirin': [2244, 1983, 9871508, 56842252, 145904]... Success! CID: 2244 Testing: invalid_compound_name_xyz (search type: name) Unexpected error: Resource not found Testing: CC(=O)OC1=CC=CC=C1C(=O)O (search type: smiles) Unexpected error: Resource not found Testing: CC(=O)OC1=CC=CC=C1C(=O)O (search type: smiles) Found 1 compound(s) for 'CC(=O)OC1=CC=CC=C1C(=O)O': [2244]... Success! CID: 2244 Testing: invalid_smiles (search type: smiles) Found 1 compound(s) for 'CC(=O)OC1=CC=CC=C1C(=O)O': [2244]... Success! CID: 2244 Testing: invalid_smiles (search type: smiles) Unexpected error: Bad request: { "Fault": { "Code": "PUGREST.BadRequest", "Message": "Unable to standardize the given structure - perhaps some special characters need to be escaped or data packed in a MIME form?", "Details": [ "error: ", "status: 400", "output: Caught ncbi::CException: Standardization failed", "Output Log:", "Record 1: Warning: Cactvs Ensemble cannot be created from input string", "Record 1: Error: Unable to convert input into a compound object", "", "" ] } } ======================================== Safe property extraction example: Unexpected error: Bad request: { "Fault": { "Code": "PUGREST.BadRequest", "Message": "Unable to standardize the given structure - perhaps some special characters need to be escaped or data packed in a MIME form?", "Details": [ "error: ", "status: 400", "output: Caught ncbi::CException: Standardization failed", "Output Log:", "Record 1: Warning: Cactvs Ensemble cannot be created from input string", "Record 1: Error: Unable to convert input into a compound object", "", "" ] } } ======================================== Safe property extraction example: LogP data extracted safely: 4 entries LogP data extracted safely: 4 entries
Summary¶
This tutorial covered the comprehensive functionality of both PubChem APIs in the PROVESID package:
PubChemAPI (Standard API)¶
- Improved Data Access: Methods like
get_compound_by_cid()
andget_substance_by_sid()
now return data directly without redundant wrapper structures - Multiple Search Methods: Search by name, SMILES, InChI key, CAS number
- Comprehensive ID Resolution: Find CIDs across both compound and substance domains
- Batch Processing: Handle multiple compounds efficiently
- Property Extraction: Get basic, selected, or all compound properties
PubChemView (Experimental Properties API)¶
- Experimental Properties: Access to 40+ experimental properties not available in the standard API
- Advanced Pattern Recognition: Sophisticated text parsing for extracting numerical values from diverse formats
- Property Tables: Comprehensive DataFrames with full reference information
- Batch Extraction: Extract multiple properties for compounds efficiently
- Temperature and Conditions: Automatic extraction of experimental conditions
Key Improvements¶
- Elegant Data Access: No more
["PC_Compounds"][0]
or["PC_Substances"][0]
needed - Enhanced LogP Recognition: Now supports "log Kow = 1.19" and similar formats
- Robust Error Handling: Proper exception handling for network and data issues
- Comprehensive Pattern Support: Handles scientific notation, comparison operators, and diverse units
Best Practices¶
- Always use error handling for production code
- Use batch methods for multiple compounds
- Check data availability before processing
- Respect PubChem's rate limits (built into the APIs)
The PROVESID package provides a powerful and user-friendly interface to PubChem's vast chemical database, making it easy to integrate chemical data into your research workflows.