# Reload the module to pick up the new method
import importlib
import provesid.zeropm
importlib.reload(provesid.zeropm)
from provesid.zeropm import ZeroPM
# Initialize the ZeroPM database connection
zpm = ZeroPM()
print("✓ ZeroPM reloaded and initialized successfully!")
✓ ZeroPM reloaded and initialized successfully!
ZeroPM Tutorial¶
This notebook demonstrates the functionality of the ZeroPM class, which provides efficient access to the ZeroPM SQLite database containing chemical identifiers and properties.
Overview¶
ZeroPM allows you to:
- Query chemicals by CAS number or name
- Query chemicals by regulatory inventory, country, or region
- Convert between different chemical identifiers (CAS, InChI, InChIKey, SMILES)
- Perform fuzzy name searches
- Batch process multiple chemicals
- Search by substructure
- Export results to CSV
Database Auto-Download¶
The ZeroPM database (~400MB) is automatically downloaded from GitHub on first use. The download happens once and the database is cached locally for future use.
Let's get started!
1. Installation and Setup¶
First, make sure you have the required dependencies installed.
# Import the ZeroPM class
from provesid.zeropm import ZeroPM
# Initialize the ZeroPM database connection
zpm = ZeroPM()
print("✓ ZeroPM initialized successfully!")
print(f"Database path: {zpm.db_path}")
✓ ZeroPM initialized successfully! Database path: /home/ali/projects/USETOX/PROVESID/src/provesid/data/zeropm-v0-0-4.sqlite
# playing around with the ZeroPM instance
query_id = zpm.query_cas("50-00-0")
inchi_ids, ranks = zpm.get_inchi_id(query_id)
print(f"InChI IDs for CAS 50-00-0: {inchi_ids} with ranks {ranks}")
InChI IDs for CAS 50-00-0: [32227] with ranks [1]
Testing the new get_id_table_from_cas method¶
This method returns a pandas DataFrame with all identifiers (CAS, InChI, InChIKey, synonyms) for a given CAS number.
# Get identifier table for formaldehyde (CAS: 50-00-0)
df = zpm.get_id_table_from_cas("50-00-0")
print(df)
print(f"\nNumber of rows: {len(df)}")
cas query_id inchi_id rank inchi \
0 50-00-0 8671 32227 1 InChI=1S/CH2O/c1-2/h1H2
inchikey zeropm_id \
0 WSFSSNUMVMOOMR-UHFFFAOYSA-N 3224
synonyms \
0 formaldehyde ...%; FORMALIN; Formaldehyde; for...
sources
0 Chemical Data Reporting Inventory, Industrial ...
Number of rows: 1
# Display the full table with better formatting
print("\nFull table with all columns:")
print(df.to_string())
# Show specific columns
print("\n\nKey columns:")
print(df[['cas', 'inchi', 'inchikey']].to_string())
Full table with all columns:
cas query_id inchi_id rank inchi inchikey zeropm_id synonyms sources
0 50-00-0 8671 32227 1 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 3224 formaldehyde ...%; FORMALIN; Formaldehyde; formaldehyde ... %; Formalin; formaldehyde Chemical Data Reporting Inventory, Industrial Processing and Use; Chemical Data Reporting Inventory, Consumer and Commercial Use; Chemical Data Reporting Inventory, Manufacturing-Import Information; Chemical Data Reporting Inventory, Nationally Aggregated Production Volumes; Industrial Chemicals Inventory; Inventory of Existing Chemical Substances and Chemicals; Inventory of Chemicals and Chemical Substances; NITE; Hazardous Chemical Information System; Substances in Preparations in Nordic Countries Inventory; CLP Annex VI; Chemical Information Management System Inventory; Chemicals Information System ; Chemical Substance Inventory; Chemical Substances Control Law Existing Chemical Substances; Inventory of Chemicals; EC Inventory; Toxic Substances Control Act (TSCA) Chemical Substance Inventory; Domestic Substance List
Key columns:
cas inchi inchikey
0 50-00-0 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N
# Test with another compound that might have multiple entries
# Let's try with aspirin
df2 = zpm.get_id_table_from_cas("50-78-2") # Aspirin
if df2 is not None:
print("\nAspirin (CAS: 50-78-2):")
print(df2[['cas', 'inchi_id', 'rank', 'inchikey']].to_string())
else:
print("\nAspirin not found in database")
Aspirin (CAS: 50-78-2):
cas inchi_id rank inchikey
0 50-78-2 45333 1 BSYNRYMUTXBXSQ-UHFFFAOYSA-N
1 50-78-2 182692 2 BSYNRYMUTXBXSQ-UHFFFAOYSA-M
2 50-78-2 9014 3 XDZMPRGFOOFSBL-UHFFFAOYSA-N
3 50-78-2 208438 4 BSYNRYMUTXBXSQ-FIBGUPNXSA-N
Benefits of get_id_table_from_cas¶
The get_id_table_from_cas() method provides several advantages:
- Comprehensive identifier retrieval: Returns CAS, InChI, InChIKey, and synonyms in one call
- Handles multiple structures: Some CAS numbers map to multiple InChI structures (different forms, salts, etc.)
- Rank information: Includes rank to indicate the relevance/confidence of each match
- Easy data manipulation: Returns a pandas DataFrame for easy filtering, sorting, and export
df2
| cas | query_id | inchi_id | rank | inchi | inchikey | zeropm_id | synonyms | sources | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 50-78-2 | 11272 | 45333 | 1 | InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)... | BSYNRYMUTXBXSQ-UHFFFAOYSA-N | 4267 | 2-acetoxybenzoic acid; Acetylsalicyclic acid; ... | Substances in Preparations in Nordic Countries... |
| 1 | 50-78-2 | 11272 | 182692 | 2 | InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)... | BSYNRYMUTXBXSQ-UHFFFAOYSA-M | <NA> | 2-acetoxybenzoic acid; Acetylsalicyclic acid; ... | Substances in Preparations in Nordic Countries... |
| 2 | 50-78-2 | 11272 | 9014 | 3 | InChI=1S/C9H10O3/c1-2-12-8-6-4-3-5-7(8)9(10)11... | XDZMPRGFOOFSBL-UHFFFAOYSA-N | 6402 | 2-acetoxybenzoic acid; Acetylsalicyclic acid; ... | Substances in Preparations in Nordic Countries... |
| 3 | 50-78-2 | 11272 | 208438 | 4 | InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)... | BSYNRYMUTXBXSQ-FIBGUPNXSA-N | <NA> | 2-acetoxybenzoic acid; Acetylsalicyclic acid; ... | Substances in Preparations in Nordic Countries... |
Batch Processing Multiple CAS Numbers¶
For processing multiple CAS numbers at once, use batch_get_id_table_from_cas():
# Process multiple CAS numbers at once
cas_list = ["50-00-0", "50-78-2", "64-17-5"] # formaldehyde, aspirin, ethanol
batch_df = zpm.batch_get_id_table_from_cas(cas_list)
print(f"Total rows: {len(batch_df)}")
print(f"\nNumber of structures per CAS:")
print(batch_df.groupby('cas').size())
print(f"\nFirst few rows:")
print(batch_df[['cas', 'inchi_id', 'rank', 'inchikey']].head(10))
Total rows: 6
Number of structures per CAS:
cas
50-00-0 1
50-78-2 4
64-17-5 1
dtype: int64
First few rows:
cas inchi_id rank inchikey
0 50-00-0 32227 1 WSFSSNUMVMOOMR-UHFFFAOYSA-N
1 50-78-2 45333 1 BSYNRYMUTXBXSQ-UHFFFAOYSA-N
2 50-78-2 182692 2 BSYNRYMUTXBXSQ-UHFFFAOYSA-M
3 50-78-2 9014 3 XDZMPRGFOOFSBL-UHFFFAOYSA-N
4 50-78-2 208438 4 BSYNRYMUTXBXSQ-FIBGUPNXSA-N
5 64-17-5 60355 1 LFQSCWFLJHTTHZ-UHFFFAOYSA-N
# The batch method handles missing CAS numbers gracefully
mixed_cas_list = ["50-00-0", "999-99-9", "50-78-2"] # valid, invalid, valid
batch_df_mixed = zpm.batch_get_id_table_from_cas(mixed_cas_list)
print(f"Requested {len(mixed_cas_list)} CAS numbers")
print(f"Found {len(batch_df_mixed['cas'].unique())} in database")
print(f"CAS numbers found: {list(batch_df_mixed['cas'].unique())}")
WARNING:root:CAS number 999-99-9 not found in database
Requested 3 CAS numbers Found 2 in database CAS numbers found: ['50-00-0', '50-78-2']
2. Database Statistics¶
Let's start by exploring what's in the database.
# Get database statistics
stats = zpm.get_database_stats()
print("Database Statistics:")
print("=" * 50)
for key, value in stats.items():
print(f"{key:30s}: {value:>15,}" if isinstance(value, int) else f"{key:30s}: {value}")
Database Statistics: ================================================== api_ready_query : 447,617 api_results : 1,349,825 substances : 359,221 inventories : 543,851 inventory_summary : 1,045,887 cleanventory_chemicals : 483,388 zeropm_chemicals : 126,369 components : 22,380 multi_components : 38,696 unique_cas_numbers : 164,513 unique_chemical_names : 283,104
3. Querying by CAS Number¶
The most common way to query the database is using a CAS Registry Number.
# Get a sample CAS number from the database
zpm.cursor.execute("""
SELECT query
FROM api_ready_query
WHERE type = 'CAS Registry Number'
LIMIT 1
""")
sample_cas = zpm.cursor.fetchone()[0]
print(f"Sample CAS number: {sample_cas}")
# Query the CAS number to get a query_id
query_id = zpm.query_cas(sample_cas)
print(f"Query ID: {query_id}")
# Get InChI information
inchi_ids, ranks = zpm.get_inchi_id(query_id)
if inchi_ids:
inchi, inchikey = zpm.get_inchi(inchi_ids[0])
print(f"\nInChI: {inchi[:50]}..." if len(inchi) > 50 else f"\nInChI: {inchi}")
print(f"InChIKey: {inchikey}")
Sample CAS number: 121-20-0 Query ID: 1 InChI: InChI=1S/C21H28O5/c1-7-8-9-14-13(3)17(11-16(14)22)... InChIKey: SHCRDCOTRILILT-WOBDGSLYSA-N
4. Converting CAS to SMILES¶
ZeroPM can convert CAS numbers to SMILES using RDKit.
# Get SMILES from CAS number
smiles = zpm.get_smiles_from_cas(sample_cas)
print(f"CAS: {sample_cas}")
print(f"SMILES: {smiles}")
# Get chemical names
names = zpm.get_names(sample_cas)
if names:
print(f"\nAlternative names ({len(names)}):")
for i, name in enumerate(names[:5], 1): # Show first 5 names
print(f" {i}. {name}")
if len(names) > 5:
print(f" ... and {len(names) - 5} more")
CAS: 121-20-0 SMILES: C/C=C\CC1=C(C)[C@@H](OC(=O)[C@@H]2[C@@H](/C=C(\C)C(=O)OC)C2(C)C)CC1=O Alternative names (7): 1. Cyclopropanecarboxylic acid, 3-(3-methoxy-2-methyl-3-oxo-1-propenyl)-2,2-dimethyl-, 3-(2-butenyl)-2-methyl-4-oxo-2-cyclopenten-1-yl ester, [1R-[1.alpha.[S*(Z)],3.beta.(E)]]- 2. CINERIN II 3. 3-(but-2-enyl)-2-methyl-4-oxocyclopent-2-enyl2,2-dimethyl-3-(3-methoxy-2-methyl-3-oxoprop-1-enyl)cyclopropanecarboxylate 4. Jasmolin II 5. Cyclopropanecarboxylic acid, 3-[(1E)-3-methoxy-2-methyl-3-oxo-1-propenyl]-2,2-dimethyl-, (1S)-3-(2Z)-2-butenyl-2-methyl-4-oxo-2-cyclopenten-1-yl ester, (1R,3R)- ... and 2 more
5. Querying by Chemical Name¶
You can search for chemicals by their exact name or use fuzzy matching.
# Get a sample chemical name
zpm.cursor.execute("""
SELECT query
FROM api_ready_query
WHERE type = 'chemical name'
LIMIT 1
""")
sample_name = zpm.cursor.fetchone()[0]
# Exact match
query_id = zpm.query_name(sample_name)
print(f"Exact search for '{sample_name}'")
print(f"Query ID: {query_id}")
# Fuzzy match (with partial name)
if len(sample_name) >= 5:
partial_name = sample_name[:5]
print(f"\nFuzzy search for '{partial_name}':")
similar_ids = zpm.query_similar_name(partial_name, number_of_results=5, score_cutoff=70)
if similar_ids:
print(f"Found {len(similar_ids)} similar matches")
for qid in similar_ids[:3]:
zpm.cursor.execute("SELECT query FROM api_ready_query WHERE query_id = ?", (qid,))
result = zpm.cursor.fetchone()
if result:
print(f" - {result[0]}")
Exact search for 'cinerin II' Query ID: 2 Fuzzy search for 'ciner':
Found 5 similar matches - cinerin II - cinerin I - ne
6. Converting Between Different Identifiers¶
ZeroPM supports conversion between various chemical identifiers.
# Get a sample InChI and InChIKey
zpm.cursor.execute("""
SELECT inchi, inchikey
FROM substances
LIMIT 1
""")
test_inchi, test_inchikey = zpm.cursor.fetchone()
print("Identifier Conversions:")
print("=" * 60)
# InChI to CAS
cas_from_inchi = zpm.get_cas_from_inchi(test_inchi)
print(f"InChI → CAS: {cas_from_inchi}")
# InChIKey to CAS
cas_from_key = zpm.get_cas_from_inchikey(test_inchikey)
print(f"InChIKey → CAS: {cas_from_key}")
# InChIKey to SMILES
smiles_from_key = zpm.get_smiles_from_inchikey(test_inchikey)
print(f"InChIKey → SMILES: {smiles_from_key}")
# SMILES to CAS (if we have a valid SMILES)
if smiles_from_key:
cas_from_smiles = zpm.get_cas_from_smiles(smiles_from_key)
print(f"SMILES → CAS: {cas_from_smiles}")
Identifier Conversions: ============================================================ InChI → CAS: ['100-00-5', '68239-23-6'] InChIKey → CAS: ['100-00-5', '68239-23-6'] InChIKey → SMILES: O=[N+]([O-])c1ccc(Cl)cc1 SMILES → CAS: ['100-00-5', '68239-23-6']
[12:26:29] WARNING: Charges were rearranged
7. Batch Processing¶
For efficiency, ZeroPM provides batch methods to process multiple chemicals at once.
# Get multiple CAS numbers
zpm.cursor.execute("""
SELECT query
FROM api_ready_query
WHERE type = 'CAS Registry Number'
LIMIT 5
""")
cas_list = [row[0] for row in zpm.cursor.fetchall()]
print(f"Batch processing {len(cas_list)} CAS numbers:")
print("=" * 60)
# Batch query CAS numbers
query_ids = zpm.batch_query_cas(cas_list)
for cas, qid in query_ids.items():
print(f"{cas}: Query ID = {qid}")
print("\n" + "=" * 60)
# Batch get SMILES
smiles_dict = zpm.batch_get_smiles_from_cas(cas_list)
for cas, smiles in smiles_dict.items():
print(f"{cas}: {smiles if smiles else 'N/A'}")
Batch processing 5 CAS numbers: ============================================================ 121-20-0: Query ID = 1 25646-71-3: Query ID = 4 76823-93-3: Query ID = 9 177964-68-0: Query ID = 11 27955-94-8: Query ID = 14 ============================================================
121-20-0: C/C=C\CC1=C(C)[C@@H](OC(=O)[C@@H]2[C@@H](/C=C(\C)C(=O)OC)C2(C)C)CC1=O 25646-71-3: CCN(CCNS(C)(=O)=O)c1ccc(N)c(C)c1.CCN(CCNS(C)(=O)=O)c1ccc(N)c(C)c1.O=S(=O)(O)O.O=S(=O)(O)O.O=S(=O)(O)O 76823-93-3: N#CCCSCc1csc(NC(=N)N)n1 177964-68-0: COCc1c(C(C)C)nc(C(C)C)c(C=CC=O)c1-c1ccc(F)cc1 27955-94-8: CC(c1ccc(O)cc1)(c1ccc(O)cc1)c1ccc(O)cc1
8. Batch InChIKey to CAS Conversion¶
# Get multiple InChIKeys
zpm.cursor.execute("""
SELECT inchikey
FROM substances
LIMIT 5
""")
inchikey_list = [row[0] for row in zpm.cursor.fetchall()]
print(f"Batch converting {len(inchikey_list)} InChIKeys to CAS:")
print("=" * 60)
cas_dict = zpm.batch_get_cas_from_inchikey(inchikey_list)
for key, cas in cas_dict.items():
print(f"{key}: {cas if cas else 'N/A'}")
Batch converting 5 InChIKeys to CAS: ============================================================ : 71889-03-7 AAADGWUCZIMSKQ-UHFFFAOYSA-M: 16509-22-1 AAADKYXUTOBAGS-UHFFFAOYSA-N: 78-99-9 AAAFFJJBQGZTFF-UHFFFAOYSA-N: 5355-88-4 AAAFYYTUBLYGNG-UHFFFAOYSA-N: N/A
9. Advanced Search: Regex Pattern Matching¶
Search for chemicals using pattern matching.
# Search for chemicals with names containing a pattern
# For example, search for names containing "acid"
pattern = "%acid%"
results = zpm.query_name_regex(pattern, case_sensitive=False, limit=10)
print(f"Chemical names matching pattern '{pattern}':")
print("=" * 60)
for query_id, name in results[:5]:
print(f" {name}")
print(f"\nTotal matches: {len(results)}")
Chemical names matching pattern '%acid%':
============================================================
2-chloroethylphosphonic acid
Isocyanic acid, 2-methyl-m-phenylene ester
3-(4-aminophenyl)-2-cyano-2-propenoic acid
5-{4-[5-5-amino-2-[4-(2-sulfoxyethylsulfonyl)phenylazo]-4-sulfo-phenylamino]-6-chloro-1,3,5-triazin-2-ylamino}}-4-hydroxy-3-(1-sulfo-naphthalen-2-ylazo)-naphthalene-2,7-disulfonicacid sodium salt
acetic acid ... %
Total matches: 10
10. Advanced Search: Substructure Search¶
Search for chemicals containing a specific substructure (SMARTS pattern).
⚠️ Note: This operation can be slow for large searches as it needs to check each molecule.
# Search for molecules containing a benzene ring
smarts_pattern = "c1ccccc1" # Benzene ring
print(f"Searching for molecules with benzene ring (max 5 results)...")
results = zpm.get_cas_by_substructure(smarts_pattern, max_results=5)
print(f"Found {len(results)} molecules with benzene ring:")
print("=" * 60)
for i, compound in enumerate(results, 1):
print(f"\n{i}. CAS: {compound['cas']}")
print(f" SMILES: {compound['smiles']}")
print(f" InChIKey: {compound['inchikey'][:27]}...")
Searching for molecules with benzene ring (max 5 results)... Found 5 molecules with benzene ring: ============================================================ 1. CAS: ['100-00-5', '68239-23-6'] SMILES: O=[N+]([O-])c1ccc(Cl)cc1 InChIKey: CZGCEKJOLUNIFY-UHFFFAOYSA-N... 2. CAS: ['100-01-6', '10040-98-9', '68239-24-7'] SMILES: Nc1ccc([N+](=O)[O-])cc1 InChIKey: TYMLOMAKGOJONV-UHFFFAOYSA-N... 3. CAS: ['100-02-7', '25154-55-6'] SMILES: O=[N+]([O-])c1ccc(O)cc1 InChIKey: BTJIUGUIPKRLHP-UHFFFAOYSA-N... 4. CAS: 100-03-8 SMILES: O=S(O)c1ccc(Cl)cc1 InChIKey: AOQYAMDZQAEDLO-UHFFFAOYSA-N... 5. CAS: ['100-04-9', '13533-17-0', '24564-52-1'] SMILES: CN(C)c1ccc([N+]#N)cc1.[Cl-] InChIKey: CCIAVEMREXZXAK-UHFFFAOYSA-M...
11. Exporting Data to CSV¶
You can export query results to CSV files for further analysis.
import tempfile
import os
# Create a temporary directory for exports
temp_dir = tempfile.mkdtemp()
# Export batch results
output_file = os.path.join(temp_dir, 'cas_smiles_export.csv')
zpm.export_to_csv(
list(smiles_dict.items()),
output_file,
columns=['CAS', 'SMILES']
)
print(f"✓ Exported data to: {output_file}")
# Export custom query results
sql_query = """
SELECT aq.query AS CAS, s.inchikey
FROM api_ready_query aq
JOIN api_results ar ON aq.query_id = ar.query_id
JOIN substances s ON ar.inchi_id = s.inchi_id
WHERE aq.type = 'CAS Registry Number'
LIMIT 10
"""
output_file2 = os.path.join(temp_dir, 'cas_inchikey_export.csv')
zpm.export_query_results(sql_query, output_file2, include_headers=True)
print(f"✓ Exported custom query to: {output_file2}")
# List exported files
print(f"\nExported files in {temp_dir}:")
for file in os.listdir(temp_dir):
filepath = os.path.join(temp_dir, file)
size = os.path.getsize(filepath)
print(f" - {file} ({size:,} bytes)")
✓ Exported data to: /tmp/tmpc3_93yy2/cas_smiles_export.csv ✓ Exported custom query to: /tmp/tmpc3_93yy2/cas_inchikey_export.csv Exported files in /tmp/tmpc3_93yy2: - cas_inchikey_export.csv (400 bytes) - cas_smiles_export.csv (353 bytes)
12. Performance Optimization: Creating Indexes¶
Create database indexes to speed up queries.
# Create indexes for better query performance
print("Creating database indexes...")
index_results = zpm.create_indexes()
print("\nIndex Status:")
print("=" * 60)
for index_name, status in index_results.items():
print(f"{index_name:30s}: {status}")
print("\n✓ Indexes created successfully!")
print("Note: Subsequent queries will be faster with these indexes.")
Creating database indexes... Index Status: ============================================================ idx_query : exists idx_type : exists idx_query_id_results : exists idx_inchi_id_results : exists idx_inchi : exists idx_inchikey : exists idx_inventory_query : exists idx_inventory_id : exists ✓ Indexes created successfully! Note: Subsequent queries will be faster with these indexes.
13. Creating Custom Views¶
Create database views for frequently used queries.
# Create a view for CAS to InChI mapping
view_name = "cas_to_inchi_view"
sql_query = """
SELECT aq.query AS cas, s.inchi, s.inchikey
FROM api_ready_query aq
JOIN api_results ar ON aq.query_id = ar.query_id
JOIN substances s ON ar.inchi_id = s.inchi_id
WHERE aq.type = 'CAS Registry Number' AND ar.rank = 1
"""
success = zpm.create_view(view_name, sql_query)
if success:
print(f"✓ View '{view_name}' created successfully!")
# Query the view
zpm.cursor.execute(f"SELECT * FROM {view_name} LIMIT 5")
print(f"\nSample data from view:")
print("=" * 60)
for row in zpm.cursor.fetchall():
print(f"CAS: {row[0]}, InChIKey: {row[2][:27]}...")
# Clean up - drop the view
zpm.cursor.execute(f"DROP VIEW IF EXISTS {view_name}")
zpm.conn.commit()
print(f"\n✓ View dropped for cleanup")
else:
print(f"✗ Failed to create view")
✓ View 'cas_to_inchi_view' created successfully! Sample data from view: ============================================================ CAS: 121-20-0, InChIKey: SHCRDCOTRILILT-WOBDGSLYSA-N... CAS: 121-20-0, InChIKey: SHCRDCOTRILILT-WOBDGSLYSA-N... CAS: 121-20-0, InChIKey: SHCRDCOTRILILT-WOBDGSLYSA-N... CAS: 121-20-0, InChIKey: SHCRDCOTRILILT-WOBDGSLYSA-N... CAS: 25646-71-3, InChIKey: NPDFXFLCEDDWEG-UHFFFAOYSA-N... ✓ View dropped for cleanup
14. Complete Example: Workflow for Multiple Chemicals¶
Here's a complete workflow demonstrating how to process multiple chemicals efficiently.
import pandas as pd
# Get a sample of CAS numbers to process
zpm.cursor.execute("""
SELECT query
FROM api_ready_query
WHERE type = 'CAS Registry Number'
LIMIT 10
""")
cas_numbers = [row[0] for row in zpm.cursor.fetchall()]
print(f"Processing {len(cas_numbers)} chemicals...")
print("=" * 80)
# Batch get all the data we need
query_ids = zpm.batch_query_cas(cas_numbers)
smiles_data = zpm.batch_get_smiles_from_cas(cas_numbers)
names_data = zpm.batch_get_names(cas_numbers)
# Create a pandas DataFrame
data = []
for cas in cas_numbers:
data.append({
'CAS': cas,
'Query_ID': query_ids.get(cas),
'SMILES': smiles_data.get(cas),
'Names_Count': len(names_data.get(cas, [])),
'First_Name': names_data.get(cas, [''])[0] if names_data.get(cas) else ''
})
df = pd.DataFrame(data)
print("\nResults Summary:")
print(df.to_string(index=False, max_colwidth=50))
print(f"\n✓ Processed {len(cas_numbers)} chemicals successfully!")
print(f" - {df['SMILES'].notna().sum()} have SMILES")
print(f" - {df[df['Names_Count'] > 0].shape[0]} have alternative names")
Processing 10 chemicals... ================================================================================
Results Summary:
CAS Query_ID SMILES Names_Count First_Name
121-20-0 1 C/C=C\CC1=C(C)[C@@H](OC(=O)[C@@H]2[C@@H](/C=C(\... 7 Cyclopropanecarboxylic acid, 3-(3-methoxy-2-met...
25646-71-3 4 CCN(CCNS(C)(=O)=O)c1ccc(N)c(C)c1.CCN(CCNS(C)(=O... 10 (2:3)
76823-93-3 9 N#CCCSCc1csc(NC(=N)N)n1 3 3-(2-(diaminomethyleneamino)thiazol-4-ylmethylt...
177964-68-0 11 COCc1c(C(C)C)nc(C(C)C)c(C=CC=O)c1-c1ccc(F)cc1 4 (E)-3-(4-(4-(E)-2-butenal
27955-94-8 14 CC(c1ccc(O)cc1)(c1ccc(O)cc1)c1ccc(O)cc1 7 4,4',4''-(etaani-1,1,1-triyyli)trifenoli
122886-55-9 16 CCCCCCCCN=C(O)Nc1ccc(Cc2ccc(NC(O)=NCCCCCCCC)cc2... 8 Urea, N-octyl-N'-[4-[[4-[[(octylamino)carbonyl]...
111-40-0 18 NCCNCCN 19 1,2-Ethanediamine, N1-(2-aminoethyl)-
151798-26-4 25 Cc1cccc(N=C(O)c2cc3ccccc3c(N=Nc3ccc4c(c3)C(=O)c... 2 2-[2-hydroksi-3-(2-kloorifenyyli)karbamoyyli-1-...
96-29-7 27 CCC(C)=NO 18 butanone oxime
16672-87-0 32 O=P(O)(O)CCCl 8 Chlorethephon
✓ Processed 10 chemicals successfully!
- 10 have SMILES
- 10 have alternative names
15. Summary and Best Practices¶
Key Features:¶
- Simple Queries:
query_cas(),query_name() - Fuzzy Matching:
query_similar_name()with configurable score cutoff - Identifier Conversion: Convert between CAS, InChI, InChIKey, and SMILES
- Batch Operations: Process multiple chemicals efficiently
- Advanced Search: Regex patterns and substructure matching
- Export: Save results to CSV for further analysis
- Performance: Create indexes for faster queries
Best Practices:¶
- Use batch methods when processing multiple chemicals
- Create indexes before running many queries
- Use fuzzy matching for user input with potential typos
- Set appropriate score cutoffs for fuzzy matching (70-90 is typical)
- Export results to CSV for sharing or further analysis
Performance Tips:¶
- Batch operations are much faster than individual queries
- Create indexes once at the start if doing many queries
- Use
score_cutoffparameter to limit fuzzy search results - Limit substructure searches with
max_resultsparameter
14. New ID Table Methods - Get Complete Identifier Tables¶
The ZeroPM class now provides four new methods to retrieve complete identifier tables from different starting points: InChI, InChIKey, and chemical names. These complement the existing get_id_table_from_cas() method.
14.1 Get ID Table from InChI¶
The get_id_table_from_inchi() method returns all identifiers for a given InChI string.
# First, let's get an InChI from a known CAS to use as an example
formaldehyde_table = zpm.get_id_table_from_cas("50-00-0")
if formaldehyde_table is not None and len(formaldehyde_table) > 0:
example_inchi = formaldehyde_table['inchi'].iloc[0]
print(f"Example InChI: {example_inchi}\n")
# Now get the ID table from InChI
df_from_inchi = zpm.get_id_table_from_inchi(example_inchi)
print("ID Table from InChI:")
print(df_from_inchi)
print(f"\nColumns: {list(df_from_inchi.columns)}")
else:
print("Could not find formaldehyde in database")
Example InChI: InChI=1S/CH2O/c1-2/h1H2
ID Table from InChI:
inchi inchikey inchi_id query_id \
0 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 8671
1 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 8672
2 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 35726
3 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 35725
4 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 310465
5 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 367895
6 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 325078
7 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 325578
8 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 361811
9 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 18623
10 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 1759
11 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 7762
12 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 1760
13 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 395145
14 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 124573
15 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 365734
16 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 91354
17 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 361811
18 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 310465
19 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 36880
20 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 124574
21 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 11512
22 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 366233
23 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 325879
24 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 325877
25 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 11565
26 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 367017
27 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 32227 5472
rank cas sources \
0 1 50-00-0 Inventory of Chemicals; Chemical Data Reportin...
1 1 NaN Inventory of Chemicals; Chemical Data Reportin...
2 1 NaN Inventory of Chemicals; Chemical Data Reportin...
3 1 30525-89-4 Inventory of Chemicals; Chemical Data Reportin...
4 1 NaN Inventory of Chemicals; Chemical Data Reportin...
5 1 NaN Inventory of Chemicals; Chemical Data Reportin...
6 1 NaN Inventory of Chemicals; Chemical Data Reportin...
7 1 NaN Inventory of Chemicals; Chemical Data Reportin...
8 1 NaN Inventory of Chemicals; Chemical Data Reportin...
9 2 NaN Inventory of Chemicals; Chemical Data Reportin...
10 2 630-08-0 Inventory of Chemicals; Chemical Data Reportin...
11 2 108-62-3 Inventory of Chemicals; Chemical Data Reportin...
12 2 NaN Inventory of Chemicals; Chemical Data Reportin...
13 2 63101-50-8 Inventory of Chemicals; Chemical Data Reportin...
14 2 1664-98-8 Inventory of Chemicals; Chemical Data Reportin...
15 2 NaN Inventory of Chemicals; Chemical Data Reportin...
16 2 NaN Inventory of Chemicals; Chemical Data Reportin...
17 2 NaN Inventory of Chemicals; Chemical Data Reportin...
18 2 NaN Inventory of Chemicals; Chemical Data Reportin...
19 3 NaN Inventory of Chemicals; Chemical Data Reportin...
20 6 NaN Inventory of Chemicals; Chemical Data Reportin...
21 7 NaN Inventory of Chemicals; Chemical Data Reportin...
22 7 NaN Inventory of Chemicals; Chemical Data Reportin...
23 7 NaN Inventory of Chemicals; Chemical Data Reportin...
24 7 NaN Inventory of Chemicals; Chemical Data Reportin...
25 11 NaN Inventory of Chemicals; Chemical Data Reportin...
26 11 NaN Inventory of Chemicals; Chemical Data Reportin...
27 11 NaN Inventory of Chemicals; Chemical Data Reportin...
synonyms
0 formaldehyde ...%; FORMALIN; Formaldehyde; for...
1 formaldehyde ...%; FORMALIN; Formaldehyde; for...
2 formaldehyde ...%; FORMALIN; Formaldehyde; for...
3 formaldehyde ...%; FORMALIN; Formaldehyde; for...
4 formaldehyde ...%; FORMALIN; Formaldehyde; for...
5 formaldehyde ...%; FORMALIN; Formaldehyde; for...
6 formaldehyde ...%; FORMALIN; Formaldehyde; for...
7 formaldehyde ...%; FORMALIN; Formaldehyde; for...
8 formaldehyde ...%; FORMALIN; Formaldehyde; for...
9 formaldehyde ...%; FORMALIN; Formaldehyde; for...
10 formaldehyde ...%; FORMALIN; Formaldehyde; for...
11 formaldehyde ...%; FORMALIN; Formaldehyde; for...
12 formaldehyde ...%; FORMALIN; Formaldehyde; for...
13 formaldehyde ...%; FORMALIN; Formaldehyde; for...
14 formaldehyde ...%; FORMALIN; Formaldehyde; for...
15 formaldehyde ...%; FORMALIN; Formaldehyde; for...
16 formaldehyde ...%; FORMALIN; Formaldehyde; for...
17 formaldehyde ...%; FORMALIN; Formaldehyde; for...
18 formaldehyde ...%; FORMALIN; Formaldehyde; for...
19 formaldehyde ...%; FORMALIN; Formaldehyde; for...
20 formaldehyde ...%; FORMALIN; Formaldehyde; for...
21 formaldehyde ...%; FORMALIN; Formaldehyde; for...
22 formaldehyde ...%; FORMALIN; Formaldehyde; for...
23 formaldehyde ...%; FORMALIN; Formaldehyde; for...
24 formaldehyde ...%; FORMALIN; Formaldehyde; for...
25 formaldehyde ...%; FORMALIN; Formaldehyde; for...
26 formaldehyde ...%; FORMALIN; Formaldehyde; for...
27 formaldehyde ...%; FORMALIN; Formaldehyde; for...
Columns: ['inchi', 'inchikey', 'inchi_id', 'query_id', 'rank', 'cas', 'sources', 'synonyms']
14.2 Batch Get ID Tables from InChI List¶
Process multiple InChI strings at once:
# Get multiple InChIs from the database to use as examples
example_cas_list = ["50-00-0", "50-78-2", "64-17-5"] # formaldehyde, aspirin, ethanol
inchi_list = []
for cas in example_cas_list:
table = zpm.get_id_table_from_cas(cas)
if table is not None and len(table) > 0:
inchi_list.append(table['inchi'].iloc[0])
print(f"Testing with {len(inchi_list)} InChI strings\n")
# Batch process
batch_df = zpm.batch_get_id_table_from_inchi(inchi_list)
print("Batch ID Table from InChI list:")
print(batch_df[['inchi', 'inchikey', 'cas', 'rank']].head(10))
print(f"\nTotal rows: {len(batch_df)}")
print(f"Unique InChIs processed: {batch_df['inchi'].nunique()}")
Testing with 3 InChI strings
Batch ID Table from InChI list:
inchi inchikey cas rank
0 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 50-00-0 1
1 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
2 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
3 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N 30525-89-4 1
4 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
5 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
6 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
7 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
8 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
9 InChI=1S/CH2O/c1-2/h1H2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 2
Total rows: 86
Unique InChIs processed: 3
14.3 Get ID Table from InChIKey¶
The get_id_table_from_inchikey() method returns all identifiers for a given InChIKey string.
# Get an InChIKey example from formaldehyde
if formaldehyde_table is not None and len(formaldehyde_table) > 0:
example_inchikey = formaldehyde_table['inchikey'].iloc[0]
print(f"Example InChIKey: {example_inchikey}\n")
# Now get the ID table from InChIKey
df_from_inchikey = zpm.get_id_table_from_inchikey(example_inchikey)
print("ID Table from InChIKey:")
print(df_from_inchikey)
print(f"\nColumns: {list(df_from_inchikey.columns)}")
else:
print("Could not find formaldehyde in database")
Example InChIKey: WSFSSNUMVMOOMR-UHFFFAOYSA-N
ID Table from InChIKey:
inchikey inchi inchi_id query_id \
0 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 8671
1 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 8672
2 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 35726
3 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 35725
4 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 310465
5 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 367895
6 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 325078
7 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 325578
8 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 361811
9 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 18623
10 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 1759
11 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 7762
12 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 1760
13 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 395145
14 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 124573
15 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 365734
16 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 91354
17 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 361811
18 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 310465
19 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 36880
20 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 124574
21 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 11512
22 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 366233
23 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 325879
24 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 325877
25 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 11565
26 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 367017
27 WSFSSNUMVMOOMR-UHFFFAOYSA-N InChI=1S/CH2O/c1-2/h1H2 32227 5472
rank cas sources \
0 1 50-00-0 Inventory of Chemicals; Chemical Data Reportin...
1 1 NaN Inventory of Chemicals; Chemical Data Reportin...
2 1 NaN Inventory of Chemicals; Chemical Data Reportin...
3 1 30525-89-4 Inventory of Chemicals; Chemical Data Reportin...
4 1 NaN Inventory of Chemicals; Chemical Data Reportin...
5 1 NaN Inventory of Chemicals; Chemical Data Reportin...
6 1 NaN Inventory of Chemicals; Chemical Data Reportin...
7 1 NaN Inventory of Chemicals; Chemical Data Reportin...
8 1 NaN Inventory of Chemicals; Chemical Data Reportin...
9 2 NaN Inventory of Chemicals; Chemical Data Reportin...
10 2 630-08-0 Inventory of Chemicals; Chemical Data Reportin...
11 2 108-62-3 Inventory of Chemicals; Chemical Data Reportin...
12 2 NaN Inventory of Chemicals; Chemical Data Reportin...
13 2 63101-50-8 Inventory of Chemicals; Chemical Data Reportin...
14 2 1664-98-8 Inventory of Chemicals; Chemical Data Reportin...
15 2 NaN Inventory of Chemicals; Chemical Data Reportin...
16 2 NaN Inventory of Chemicals; Chemical Data Reportin...
17 2 NaN Inventory of Chemicals; Chemical Data Reportin...
18 2 NaN Inventory of Chemicals; Chemical Data Reportin...
19 3 NaN Inventory of Chemicals; Chemical Data Reportin...
20 6 NaN Inventory of Chemicals; Chemical Data Reportin...
21 7 NaN Inventory of Chemicals; Chemical Data Reportin...
22 7 NaN Inventory of Chemicals; Chemical Data Reportin...
23 7 NaN Inventory of Chemicals; Chemical Data Reportin...
24 7 NaN Inventory of Chemicals; Chemical Data Reportin...
25 11 NaN Inventory of Chemicals; Chemical Data Reportin...
26 11 NaN Inventory of Chemicals; Chemical Data Reportin...
27 11 NaN Inventory of Chemicals; Chemical Data Reportin...
synonyms
0 formaldehyde ...%; FORMALIN; Formaldehyde; for...
1 formaldehyde ...%; FORMALIN; Formaldehyde; for...
2 formaldehyde ...%; FORMALIN; Formaldehyde; for...
3 formaldehyde ...%; FORMALIN; Formaldehyde; for...
4 formaldehyde ...%; FORMALIN; Formaldehyde; for...
5 formaldehyde ...%; FORMALIN; Formaldehyde; for...
6 formaldehyde ...%; FORMALIN; Formaldehyde; for...
7 formaldehyde ...%; FORMALIN; Formaldehyde; for...
8 formaldehyde ...%; FORMALIN; Formaldehyde; for...
9 formaldehyde ...%; FORMALIN; Formaldehyde; for...
10 formaldehyde ...%; FORMALIN; Formaldehyde; for...
11 formaldehyde ...%; FORMALIN; Formaldehyde; for...
12 formaldehyde ...%; FORMALIN; Formaldehyde; for...
13 formaldehyde ...%; FORMALIN; Formaldehyde; for...
14 formaldehyde ...%; FORMALIN; Formaldehyde; for...
15 formaldehyde ...%; FORMALIN; Formaldehyde; for...
16 formaldehyde ...%; FORMALIN; Formaldehyde; for...
17 formaldehyde ...%; FORMALIN; Formaldehyde; for...
18 formaldehyde ...%; FORMALIN; Formaldehyde; for...
19 formaldehyde ...%; FORMALIN; Formaldehyde; for...
20 formaldehyde ...%; FORMALIN; Formaldehyde; for...
21 formaldehyde ...%; FORMALIN; Formaldehyde; for...
22 formaldehyde ...%; FORMALIN; Formaldehyde; for...
23 formaldehyde ...%; FORMALIN; Formaldehyde; for...
24 formaldehyde ...%; FORMALIN; Formaldehyde; for...
25 formaldehyde ...%; FORMALIN; Formaldehyde; for...
26 formaldehyde ...%; FORMALIN; Formaldehyde; for...
27 formaldehyde ...%; FORMALIN; Formaldehyde; for...
Columns: ['inchikey', 'inchi', 'inchi_id', 'query_id', 'rank', 'cas', 'sources', 'synonyms']
14.4 Batch Get ID Tables from InChIKey List¶
Process multiple InChIKey strings at once:
# Get InChIKeys from our example CAS list
inchikey_list = []
for cas in example_cas_list:
table = zpm.get_id_table_from_cas(cas)
if table is not None and len(table) > 0:
inchikey_list.append(table['inchikey'].iloc[0])
print(f"Testing with {len(inchikey_list)} InChIKey strings:")
for key in inchikey_list:
print(f" - {key}")
# Batch process
batch_df_keys = zpm.batch_get_id_table_from_inchikey(inchikey_list)
print("\nBatch ID Table from InChIKey list:")
print(batch_df_keys[['inchikey', 'cas', 'rank']].head(10))
print(f"\nTotal rows: {len(batch_df_keys)}")
print(f"Unique InChIKeys processed: {batch_df_keys['inchikey'].nunique()}")
Testing with 3 InChIKey strings:
- WSFSSNUMVMOOMR-UHFFFAOYSA-N
- BSYNRYMUTXBXSQ-UHFFFAOYSA-N
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N
Batch ID Table from InChIKey list:
inchikey cas rank
0 WSFSSNUMVMOOMR-UHFFFAOYSA-N 50-00-0 1
1 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
2 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
3 WSFSSNUMVMOOMR-UHFFFAOYSA-N 30525-89-4 1
4 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
5 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
6 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
7 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
8 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 1
9 WSFSSNUMVMOOMR-UHFFFAOYSA-N NaN 2
Total rows: 86
Unique InChIKeys processed: 3
14.5 Get ID Table from Chemical Name¶
The get_id_table_from_name() method returns all identifiers for a given chemical name (exact match).
# Try to find a chemical name in the database
# First, let's see what names are available for formaldehyde
formaldehyde_names = zpm.get_names("50-00-0")
print(f"Available names for formaldehyde (CAS 50-00-0): {formaldehyde_names[:5]}\n")
if formaldehyde_names:
# Use the first name
example_name = formaldehyde_names[0]
print(f"Using name: '{example_name}'\n")
# Get ID table from name
df_from_name = zpm.get_id_table_from_name(example_name)
if df_from_name is not None:
print("ID Table from Name:")
print(df_from_name)
print(f"\nColumns: {list(df_from_name.columns)}")
else:
print(f"Name '{example_name}' not found in database as a query")
else:
print("No names found for formaldehyde")
Available names for formaldehyde (CAS 50-00-0): ['formaldehyde ...%', 'FORMALIN', 'Formaldehyde', 'formaldehyde ... %', 'Formalin']
Using name: 'formaldehyde ...%'
ID Table from Name:
name query_id inchi_id rank inchi inchikey cas \
0 formaldehyde ...% 104113 None None None None None
sources
0 CLP Annex VI
Columns: ['name', 'query_id', 'inchi_id', 'rank', 'inchi', 'inchikey', 'cas', 'sources']
14.6 Batch Get ID Tables from Name List¶
Process multiple chemical names at once:
# Get a few chemical names from the database
# Query some names directly from the database
zpm.cursor.execute("""
SELECT query
FROM api_ready_query
WHERE type = 'chemical name'
LIMIT 5
""")
name_results = zpm.cursor.fetchall()
name_list = [row[0] for row in name_results]
print(f"Testing with {len(name_list)} chemical names:")
for name in name_list:
print(f" - {name}")
# Batch process
batch_df_names = zpm.batch_get_id_table_from_name(name_list)
if batch_df_names is not None and len(batch_df_names) > 0:
print("\nBatch ID Table from Name list:")
print(batch_df_names[['name', 'cas', 'inchikey', 'rank']].head(10))
print(f"\nTotal rows: {len(batch_df_names)}")
print(f"Unique names processed: {batch_df_names['name'].nunique()}")
else:
print("\nNo results found for the name list")
Testing with 5 chemical names: - cinerin II - 3-(but-2-enyl)-2-methyl-4-oxocyclopent-2-enyl 2,2-dimethyl-3-(3-methoxy-2-methyl-3-oxoprop-1-enyl)cyclopropanecarboxylate - N-(2-(4-amino-N-ethyl-m-toluidino)ethyl)methanesulphonamide sesquisulphate - 4-(N-ethyl-N-2-methanesulphonylaminoethyl)-2-methylphenylenediamine sesquisulphate monohydrate - N-(2-(4-amino-N-ethyl-m-toluidino)ethyl)methanesulfonamide sesquisulfate
Batch ID Table from Name list:
name cas inchikey rank
0 cinerin II 121-20-0 SHCRDCOTRILILT-WOBDGSLYSA-N 1
1 cinerin II 121-20-0 SHCRDCOTRILILT-DCXZXJRMSA-N 2
2 cinerin II 121-54-0 SHCRDCOTRILILT-DCXZXJRMSA-N 2
3 cinerin II 121-20-0 SHCRDCOTRILILT-UHFFFAOYSA-N 3
4 cinerin II NaN SHCRDCOTRILILT-LGPFIRNVSA-N 4
5 cinerin II NaN SHCRDCOTRILILT-YZDDFNNUSA-N 5
6 cinerin II NaN SHCRDCOTRILILT-SIMJFJABSA-N 6
7 cinerin II NaN SHCRDCOTRILILT-GFCFHAQJSA-N 7
8 cinerin II NaN WZRUHNUBXOTVHG-KTVYOGNXSA-N 8
9 cinerin II NaN KHNZEDVIMMWGSA-DCXZXJRMSA-N 9
Total rows: 19
Unique names processed: 5
14.7 Summary of ID Table Methods¶
All six ID table methods provide:
- Complete identifier mapping: Returns all related identifiers in a DataFrame
- Rank information: Shows the relevance/confidence of each match
- Batch processing: Efficient handling of multiple queries
- Consistent output format: Easy to combine and analyze results
Use cases:
get_id_table_from_cas()- When you have CAS numbers and need all associated identifiersget_id_table_from_inchi()- When you have InChI strings from calculations or other sourcesget_id_table_from_inchikey()- When you have InChIKeys from databases or publicationsget_id_table_from_name()- When you have exact chemical names (not fuzzy matching)
The batch versions are recommended when processing multiple chemicals for better performance.
15. CAS Conversion Methods - Convert to CAS from Various Identifiers¶
The ZeroPM class provides methods to convert from different types of chemical identifiers to CAS numbers.
15.1 Get CAS from Chemical Name¶
Convert exact chemical names to CAS numbers:
# Example: Get CAS from a chemical name
# Use a name from our previous examples
if formaldehyde_names:
test_name = formaldehyde_names[0] # "Formalin"
print(f"Chemical name: {test_name}")
cas_from_name = zpm.get_cas_from_name(test_name)
print(f"CAS number(s): {cas_from_name}")
print(f"Type: {type(cas_from_name)}")
# If multiple CAS numbers are returned
if isinstance(cas_from_name, list):
print(f"\nFound {len(cas_from_name)} CAS numbers for this name")
for i, cas in enumerate(cas_from_name, 1):
print(f" {i}. {cas}")
else:
# Try with another name
cas_from_name = zpm.get_cas_from_name("methanol")
print(f"Methanol CAS: {cas_from_name}")
Chemical name: formaldehyde ...% CAS number(s): None Type: <class 'NoneType'>
15.2 Get CAS from SMILES¶
Convert SMILES strings to CAS numbers (already existed, shown for completeness):
# Example: Get CAS from SMILES
smiles_examples = {
"C": "Methane",
"CO": "Methanol",
"CCO": "Ethanol",
"C=O": "Formaldehyde"
}
print("Converting SMILES to CAS:")
for smiles, name in smiles_examples.items():
cas = zpm.get_cas_from_smiles(smiles)
print(f" {smiles:6s} ({name:15s}): {cas}")
Converting SMILES to CAS: C (Methane ): ['8006-14-2', '74-82-8', '1333-86-4', '7440-44-0', '7782-42-5', '308068-56-6', '76-49-3', '133-11-9', '16291-96-6', '676-80-2', '125612-26-2', '115383-22-7', '7782-40-3', '1034343-98-0', '99685-96-8', '6532-48-5', '14762-74-4', '131159-39-2', '90597-58-3', '3109-63-5', '64365-11-3'] CO (Methanol ): ['67-56-1', '3473-63-0', '125-04-2', '1849-29-2', '90-05-1', '276863-95-7', '533-67-5', '97-67-6', '10399-13-0', '7682-20-4', '2969-81-5', '122-66-7', '135646-98-9', '526-73-8', '5026-62-0', '72-18-4', '147-85-3', '73231-34-2', '14166-21-3', '80875-98-5', '288-32-4', '53-03-2', '59-51-8', '62211-93-2', '76721-89-6', '526-83-0', '15307-79-6', '2016-36-6', '119-56-2', '520-26-3', '63675-74-1', '2746-19-2', '2295-31-0', '61278-21-5', '38966-21-1', '93-60-7', '41340-36-7', '507-09-5', '101020-79-5', '14742-26-8', '24198-97-8', '1116-77-4', '63-91-2', '221176-49-4', '26473-47-2', '27918-19-0', '42288-26-6', '73851-70-4', '73-40-5', '120-94-5', '183288-43-9', '112827-99-3', '18297-63-7', '28049-61-8', '122111-11-9', '6368-20-3', '123-30-8', '141645-16-1', '1264-62-6', '481-29-8', '1455-13-6', '104987-12-4', '86393-33-1', '110567-22-1', '5202-89-1', '192725-50-1', '40431-63-8', '134-85-0', '179463-17-3', '219861-08-2', '56-12-2', '51146-56-6', '149437-76-3', '256-96-2', '5543-57-7', '72126-78-4', '143322-58-1', '183193-59-1', '109-11-5', '41078-70-0', '138530-94-6', '24169-02-6', '125-65-5', '811-98-3', '27262-45-9', '147-71-7', '84680-54-6', '51146-57-7', '61036-62-2', '27262-47-1', '162515-68-6', '114772-34-8', '720-94-5', '50-03-3', '112704-79-7', '93957-50-7', '5445-51-2', '1075-89-4', '473-98-3', '25322-68-3', '102767-31-7', '636-21-5', '13408-09-8', '58-00-4', '78441-62-0', '131-01-1', '3788-94-1', '42835-25-6', '80841-78-7', '46022-05-3', '43229-70-5', '382-67-2', '86604-78-6', '183288-44-0', '5543-58-8', '113082-99-8', '71749-03-6', '18725-37-6', '4441-63-8', '324763-51-1', '85933-19-3', '35271-74-0', '16673-34-0', '465-99-6', '976-71-6', '117976-90-6', '139481-59-7'] CCO (Ethanol ): ['64-17-5', '68475-56-9', '68476-78-8', '1516-08-1', '14742-23-5', '925-93-9', '9002-89-5', '90604-31-2'] C=O (Formaldehyde ): ['50-00-0', '30525-89-4', '630-08-0', '108-62-3', '63101-50-8', '1664-98-8']
15.3 Get CAS from Molecular Formula¶
Convert molecular formulas to CAS numbers. Note that formulas are not unique - many isomers can share the same formula:
# Example: Get CAS from molecular formula
# Warning: This can be slow for the first run as it scans the entire database
formulas_to_test = ["CH2O", "H2O", "CH4O"]
print("Converting molecular formulas to CAS numbers:\n")
for formula in formulas_to_test:
print(f"Formula: {formula}")
cas_list = zpm.get_cas_from_formula(formula)
if cas_list:
print(f" Found {len(cas_list)} chemicals with this formula")
# Show first 5 CAS numbers
for i, cas in enumerate(cas_list[:5], 1):
print(f" {i}. {cas}")
if len(cas_list) > 5:
print(f" ... and {len(cas_list) - 5} more")
else:
print(f" No chemicals found with formula {formula}")
print()
Converting molecular formulas to CAS numbers: Formula: CH2O
Found 7 chemicals with this formula
1. 108-62-3
2. 1664-98-8
3. 30525-89-4
4. 3228-27-1
5. 50-00-0
... and 2 more
Formula: H2O
Found 6 chemicals with this formula
1. 13768-40-6
2. 14280-30-9
3. 14314-42-2
4. 17778-80-2
5. 7732-18-5
... and 1 more
Formula: CH4O
Found 129 chemicals with this formula
1. 101020-79-5
2. 102767-31-7
3. 10399-13-0
4. 104987-12-4
5. 1075-89-4
... and 124 more
15.4 Batch CAS Conversion Methods¶
Process multiple identifiers efficiently:
# Batch convert SMILES to CAS
smiles_list = ["C", "CO", "CCO", "C=O"]
print("Batch SMILES to CAS conversion:")
results_smiles = zpm.batch_get_cas_from_smiles(smiles_list)
for smiles, cas in results_smiles.items():
print(f" {smiles:6s} -> {cas}")
print("\n" + "="*50 + "\n")
# Batch convert names to CAS
if formaldehyde_names:
# Use a few names from formaldehyde
test_names = formaldehyde_names[:3]
print(f"Batch names to CAS conversion:")
results_names = zpm.batch_get_cas_from_name(test_names)
for name, cas in results_names.items():
print(f" {name:20s} -> {cas}")
print("\n" + "="*50 + "\n")
# Batch convert formulas to CAS
formula_batch = ["CH4", "CH4O", "C2H6O"]
print("Batch formulas to CAS conversion:")
results_formulas = zpm.batch_get_cas_from_formula(formula_batch)
for formula, cas_list in results_formulas.items():
if cas_list:
print(f" {formula:8s} -> {len(cas_list)} chemicals found")
else:
print(f" {formula:8s} -> Not found")
Batch SMILES to CAS conversion: C -> ['8006-14-2', '74-82-8', '1333-86-4', '7440-44-0', '7782-42-5', '308068-56-6', '76-49-3', '133-11-9', '16291-96-6', '676-80-2', '125612-26-2', '115383-22-7', '7782-40-3', '1034343-98-0', '99685-96-8', '6532-48-5', '14762-74-4', '131159-39-2', '90597-58-3', '3109-63-5', '64365-11-3'] CO -> ['67-56-1', '3473-63-0', '125-04-2', '1849-29-2', '90-05-1', '276863-95-7', '533-67-5', '97-67-6', '10399-13-0', '7682-20-4', '2969-81-5', '122-66-7', '135646-98-9', '526-73-8', '5026-62-0', '72-18-4', '147-85-3', '73231-34-2', '14166-21-3', '80875-98-5', '288-32-4', '53-03-2', '59-51-8', '62211-93-2', '76721-89-6', '526-83-0', '15307-79-6', '2016-36-6', '119-56-2', '520-26-3', '63675-74-1', '2746-19-2', '2295-31-0', '61278-21-5', '38966-21-1', '93-60-7', '41340-36-7', '507-09-5', '101020-79-5', '14742-26-8', '24198-97-8', '1116-77-4', '63-91-2', '221176-49-4', '26473-47-2', '27918-19-0', '42288-26-6', '73851-70-4', '73-40-5', '120-94-5', '183288-43-9', '112827-99-3', '18297-63-7', '28049-61-8', '122111-11-9', '6368-20-3', '123-30-8', '141645-16-1', '1264-62-6', '481-29-8', '1455-13-6', '104987-12-4', '86393-33-1', '110567-22-1', '5202-89-1', '192725-50-1', '40431-63-8', '134-85-0', '179463-17-3', '219861-08-2', '56-12-2', '51146-56-6', '149437-76-3', '256-96-2', '5543-57-7', '72126-78-4', '143322-58-1', '183193-59-1', '109-11-5', '41078-70-0', '138530-94-6', '24169-02-6', '125-65-5', '811-98-3', '27262-45-9', '147-71-7', '84680-54-6', '51146-57-7', '61036-62-2', '27262-47-1', '162515-68-6', '114772-34-8', '720-94-5', '50-03-3', '112704-79-7', '93957-50-7', '5445-51-2', '1075-89-4', '473-98-3', '25322-68-3', '102767-31-7', '636-21-5', '13408-09-8', '58-00-4', '78441-62-0', '131-01-1', '3788-94-1', '42835-25-6', '80841-78-7', '46022-05-3', '43229-70-5', '382-67-2', '86604-78-6', '183288-44-0', '5543-58-8', '113082-99-8', '71749-03-6', '18725-37-6', '4441-63-8', '324763-51-1', '85933-19-3', '35271-74-0', '16673-34-0', '465-99-6', '976-71-6', '117976-90-6', '139481-59-7'] CCO -> ['64-17-5', '68475-56-9', '68476-78-8', '1516-08-1', '14742-23-5', '925-93-9', '9002-89-5', '90604-31-2'] C=O -> ['50-00-0', '30525-89-4', '630-08-0', '108-62-3', '63101-50-8', '1664-98-8'] ================================================== Batch names to CAS conversion: formaldehyde ...% -> None FORMALIN -> ['108-62-3', '1664-98-8', '30525-89-4', '50-00-0', '630-08-0', '63101-50-8'] Formaldehyde -> ['1003-90-3', '1034343-98-0', '106349-49-9', '108-62-3', '115383-22-7', '125612-26-2', '131159-39-2', '133-11-9', '1333-86-4', '14762-74-4', '16291-96-6', '1664-98-8', '25301-02-4', '30525-89-4', '308068-56-6', '3109-63-5', '50-00-0', '630-08-0', '63101-50-8', '64365-11-3', '6532-48-5', '676-80-2', '74-82-8', '7440-44-0', '76-49-3', '7782-40-3', '7782-42-5', '8006-14-2', '90597-58-3', '99685-96-8', '99896-05-6'] ================================================== Batch formulas to CAS conversion:
CH4 -> 25 chemicals found CH4O -> 129 chemicals found C2H6O -> 10 chemicals found
15.5 Summary of CAS Conversion Methods¶
Available methods:
get_cas_from_name(name)- Convert exact chemical name to CASget_cas_from_smiles(smiles)- Convert SMILES to CAS (via InChI)get_cas_from_formula(formula)- Convert molecular formula to CAS (returns list, non-unique)get_cas_from_inchi(inchi)- Convert InChI to CASget_cas_from_inchikey(inchikey)- Convert InChIKey to CAS
Batch versions:
batch_get_cas_from_name(name_list)batch_get_cas_from_smiles(smiles_list)batch_get_cas_from_formula(formula_list)batch_get_cas_from_inchikey(inchikey_list)
Important notes:
- Name conversion requires exact matches. Use
query_similar_name()for fuzzy matching - Formula conversion is not unique - multiple chemicals can have the same formula
- SMILES conversion works by first converting to InChI
- All methods return
Noneif no match is found - Some methods may return a list of CAS numbers if multiple matches exist
Conclusion¶
This notebook has demonstrated the main features of the ZeroPM class:
✓ Querying by CAS number and chemical name
✓ Fuzzy name matching for handling variations
✓ Converting between chemical identifiers
✓ Batch processing for efficiency
✓ Advanced searches (regex, substructure)
✓ Exporting results to CSV
✓ Performance optimization with indexes
The ZeroPM class provides a convenient Python interface to the ZeroPM database, making it easy to work with chemical identifiers in your research and applications.
For more information, see the PROVESID documentation.