Online and Offline Data Methods¶
This page explains how PROVESID separates online API access from local/offline dataset access, and how to choose the right method for each workflow.
Why this distinction matters¶
PROVESID supports two execution styles:
- Online methods call remote services and return current data from external APIs.
- Offline methods query local datasets and databases for fast, reproducible lookup.
In many workflows, a practical pattern is:
- Use local/offline lookup first for speed and reproducibility.
- Use online services for missing records, richer metadata, or live updates.
Online methods¶
These classes primarily use network APIs:
PubChemAPIPubChemViewNCIChemicalIdentifierResolverCASCommonChemChEBIOPSINandPYOPSINClassyFireAPI
Online method characteristics¶
- Requires internet connectivity.
- Subject to remote service availability and response-time variance.
- May be rate-limited depending on provider.
- Usually provides the latest upstream data.
Offline and local dataset methods¶
These classes read local data files/databases:
CheMBL(local SQLite)PubChemID(local SQLite identifier database)CompToxID(local SQLite)ZeroPM(local SQLite)REACHDossierID(local REACH dataset)ChebiSDF(local SDF parsing)
Offline method characteristics¶
- Fast and stable lookup once data is available locally.
- Better reproducibility for pipelines and batch processing.
- Large datasets may require significant storage.
- Some classes can auto-download missing datasets.
Storage and environment strategy¶
Because some local datasets are large, uv is the recommended installation workflow:
uv pip install provesid
For development:
git clone https://github.com/USEtox/PROVESID.git
cd PROVESID
uv pip install -e .
uv helps avoid repeated copies across multiple virtual environments and keeps dependency management efficient while you work with large local assets.
Practical selection guide¶
Use online methods when:
- you need live upstream updates,
- you need fields not present in local snapshots,
- or a local dataset for that source is not available.
Use offline methods when:
- you process many records,
- you need predictable/reproducible runs,
- you work in constrained or intermittent network environments,
- or you want lower latency per lookup.