CLI tools for discovering, enriching, and annotating bio.tools entries with help from Pub2Tools, heuristic scraping, and Ollama-based scoring.
- Overview
- Installation
- Quick Start
- Configuration
- Running the Pipeline
- Generated Outputs
- Resume & Caching
- Troubleshooting & Tips
- Development
- License
- Documentation
- Fetch candidate records from Pub2Tools exports or existing JSON files.
- Enrich candidates with homepage metadata, documentation links, repositories, and publication context.
- Score bioinformatics relevance and documentation quality using an Ollama model.
- Generate optimized, concise descriptions through LLM scoring while preserving EDAM annotations.
- Produce strict biotoolsSchema payloads plus human-readable assessment reports.
- Resume any stage (gather, enrich, score) using cached artifacts to accelerate iteration.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
ollama pull llama3.2 # optional, only needed for LLM scoringNeed defaults? Generate a starter configuration with:
biotoolsannotate --write-default-config
# Dry-run against sample data (no network calls)
biotoolsannotate --input tests/fixtures/pub2tools/sample.json --dry-run
# Fetch fresh candidates from the last 7 days and score them
biotoolsannotate --from-date 7d --min-score 0.6
# Re-run only the scoring step with cached enrichment
biotoolsannotate --resume-from-enriched --resume-from-scoring --dry-runConfiguration is YAML-driven. The CLI loads config.yaml from the project root by default and falls back to internal defaults when absent. All placeholders marked __VERSION__ resolve to the installed package version at runtime.
During each run the pipeline scans the Pub2Tools export folders (for example pub2tools/to_biotools.json or pub2tools/biotools_entries.json) and uses them to flag candidates already present in bio.tools. When the Pub2Tools CLI is invoked, outputs are written directly to out/range_<from>_to_<to>/pub2tools/. You can also point to any standalone registry snapshot via pipeline.registry_path or --registry when you want to bypass Pub2Tools exports entirely.
| Purpose | Config key | CLI flag | Notes |
|---|---|---|---|
| Custom input | pipeline.custom_pub2tools_biotools_json |
--custom-pub2tools-json PATH |
Use a custom Pub2Tools to_biotools.json export instead of date-based fetching |
| Registry snapshot | pipeline.registry_path |
--registry PATH |
Supply an external bio.tools JSON/JSONL snapshot for membership checks |
| Date range | pipeline.from_date, pipeline.to_date |
--from-date, --to-date |
Accepts relative windows like 7d or ISO dates |
| Thresholds | pipeline.min_bio_score, pipeline.min_documentation_score |
--min-bio-score, --min-doc-score |
Set both via legacy --min-score if desired |
| bio.tools API validation | pipeline.validate_biotools_api |
--validate-biotools-api |
Validate payload entries against live bio.tools API after scoring (default: false). See bio.tools API Validation Setup below. |
| Upload to bio.tools | pipeline.upload.enabled |
--upload |
Upload new entries to bio.tools registry after payload generation (requires .bt_token file). See Uploading to bio.tools below. |
| Offline mode | pipeline.offline |
--offline |
Disables homepage scraping and Europe PMC enrichment |
| Ollama model | ollama.model |
--model |
Defaults to llama3.2; override per run |
| LLM temperature | ollama.temperature |
(config only) | Lower values tighten determinism; default is 0.01 for high-precision scoring |
| Concurrency | ollama.concurrency |
--concurrency |
Controls parallel scoring workers |
| Logging | logging.level, logging.file |
--verbose, --quiet |
Flags override log level; file path set in config |
Common invocations:
# Custom date window
biotoolsannotate --from-date 2024-01-01 --to-date 2024-03-31
# Offline mode (no network scraping or Europe PMC requests)
biotoolsannotate --offline
# Start from local candidates and separate registry snapshot
biotoolsannotate --input data/candidates.json --registry data/biotools_snapshot.json --offline
# Limit the number of candidates processed
biotoolsannotate --limit 25
# Point to a specific config file
biotoolsannotate --config myconfig.yamlUse biotoolsannotate --help to explore all available flags, including concurrency settings, progress display, and resume options.
To enable live validation of generated payloads against the bio.tools schema:
- Obtain a token from the bio.tools team for the development server.
- Create
.bt_tokenfile in the repository root:echo "your-dev-token" > .bt_token
- Configure dev endpoints in your config file (e.g.,
myconfig.yaml):pipeline: validate_biotools_api: true biotools_api_base: "https://bio-tools-dev.sdu.dk/api/tool/" biotools_validate_api_base: "https://bio-tools-dev.sdu.dk/api/tool/validate/"
- Run the pipeline:
biotoolsannotate --config myconfig.yaml
Look for ✓ Found bio.tools authentication token in the console output. If no token is present, the pipeline falls back to local Pydantic validation automatically.
For detailed setup, troubleshooting, and current limitations, see docs/BIOTOOLS_API_VALIDATION.md.
After generating and validating payloads, you can upload new entries directly to the bio.tools registry:
- Obtain a token from the bio.tools team (development or production).
- Create
.bt_tokenfile in the repository root:echo "your-token-here" > .bt_token
- Run the pipeline with upload enabled:
biotoolsannotate --upload
The pipeline will:
- Check each entry against the bio.tools registry (via GET request)
- Upload new entries using POST (returns
201 Created) - Skip entries that already exist with status
"skipped" - Retry transient server errors (503, 504) with exponential backoff
- Log all outcomes to
upload_results.jsonl
Note: The upload feature only creates new entries. Existing tools will NOT be updated—they are skipped with a status message.
Control upload behavior in your config.yaml:
pipeline:
upload:
enabled: false # Set true to enable by default
retry_attempts: 3 # Number of retries for transient errors
retry_delay: 1.0 # Initial delay in seconds (exponential backoff)
batch_delay: 0.5 # Delay between entries to avoid rate limits
log_file: "upload_results.jsonl" # Result tracking fileUpload results include:
biotools_id: Tool identifierstatus: "uploaded", "failed", or "skipped"error: Error message if failedresponse_code: HTTP status codetimestamp: Upload attempt timestamp
Each run writes artifacts to one of the following folders:
out/<range_start>_to_<range_end>/...when the pipeline gathers candidates from Pub2Tools.out/custom_tool_set/...whenever you supply--inputor setBIOTOOLS_ANNOTATE_INPUT/JSON(resume flags reuse the same directory).
The selected folder contains:
| Path | Description |
|---|---|
exports/biotools_payload.json |
biotoolsSchema-compliant payload ready for upload |
exports/biotools_entries.json |
Full entries including enriched metadata |
reports/assessment.csv |
Primary assessment file - spreadsheet-friendly scoring results (includes in_biotools, confidence_score, and manual_decision for overriding decisions; adds biotools_api_status, api_name, and api_description when --validate-biotools-api is enabled) |
cache/enriched_candidates.json.gz |
Cached candidates after enrichment for quick resumes |
logs/ollama/ollama.log |
Human-readable append-only log of every LLM request and response |
ollama/trace.jsonl |
Machine-readable trace with prompt variants, options, statuses, and parsed JSON payloads |
config.generated.yaml or <original-config>.yaml |
Snapshot of the configuration used for the run |
Each record in reports/assessment.jsonl carries a model_params object describing how the LLM behaved during scoring. Key fields include:
| Field | Meaning |
|---|---|
attempts |
Number of prompt/response cycles the scorer performed (minimum 1) |
schema_errors |
Ordered list of validation errors returned by the JSON schema validator for each failed attempt |
prompt_augmented |
true when the scorer appended schema error feedback to the prompt before retrying |
trace_attempts |
Ordered list of trace metadata objects (trace_id, attempt, prompt_kind, status, schema_errors) that aligns with the JSONL trace for reproducible auditing |
These diagnostics mirror the telemetry requirement captured in OpenSpec and help correlate downstream decisions with LLM stability. Consumers that previously ignored model_params should update their parsers to accommodate the new keys.
The LLM now follows an explicit rubric when emitting the confidence_score field. Expect values near 0.9–1.0 only when every subcriterion is backed by clear evidence from multiple sources; mixed or inferred evidence should land around 0.3–0.8, and scarce/conflicting evidence should drop to 0.0–0.2. Review the prompt template (either in config.yaml or your custom config) for the full guidance and adjust it further if your use case calls for a different calibration.
--resume-from-pub2tools: Reuse the latestto_biotools.jsonexport for the active time range.--resume-from-enriched: Skip ingestion and reusecache/enriched_candidates.json.gz.--resume-from-scoring: Reapply thresholds to assessment fromassessment.csvwithout invoking the LLM. Supports manual editing of scores and decisions.
Combine the flags to iterate quickly on scoring thresholds and payload exports without repeating expensive steps.
The assessment.csv file now includes a manual_decision column (first column) that allows you to override the automatic classification:
| manual_decision | Effect |
|---|---|
add |
Force tool into "add" payload regardless of scores |
review |
Force tool into "review" payload |
do_not_add |
Exclude tool from all payloads |
| (empty) | Use automatic classification based on scores and thresholds |
When using --resume-from-scoring, you can also edit:
- Scores:
bio_score,documentation_score, and subscores (bio_A1-A5,doc_B1-B5) - Registry flags:
in_biotools_name,in_biotools - Other metadata:
homepage,publication_ids,rationale, etc.
The pipeline will read your changes from the CSV and:
- Use
manual_decisionif provided (overrides everything) - Otherwise re-classify based on edited scores and current thresholds
This enables human-in-the-loop refinement without re-running the expensive LLM scoring step.
- Use
--offlinewhen working without network access; the pipeline disables homepage scraping and publication enrichment automatically. - To inspect what the model saw, open
reports/assessment.csvin any spreadsheet program. - Use the
manual_decisioncolumn in the CSV to override decisions when resuming from scoring. - Health checks against the Ollama host run before scoring. Failures fall back to heuristics and are summarized in the run footer.
- Adjust logging verbosity with
--quietor--verboseas needed.
- Lint:
ruff check . - Format:
black . - Type check:
mypy src - Tests:
pytest -q - Coverage:
pytest --cov=biotoolsllmannotate --cov-report=term-missing
MIT