FOI-Bioinformatics/taxbencher

Introduction

FOI-Bioinformatics/taxbencher is a bioinformatics pipeline that benchmarks taxonomic classifiers by evaluating their predictions against known ground truth. It accepts standardized taxpasta profiles (typically generated by nf-core/taxprofiler) and uses the CAMI OPAL evaluation framework to compute comprehensive performance metrics. The pipeline produces HTML reports with precision, recall, F1 scores, UniFrac distances, and other metrics to help researchers compare and select the best taxonomic classification tools for their data.

Pipeline steps:

Convert taxpasta profiles to CAMI Bioboxes format (TAXPASTA_TO_BIOBOXES)
Evaluate predictions against gold standard (OPAL)
Aggregate results (MultiQC)

Features

Standardized Format Conversion: Converts taxpasta TSV to CAMI Bioboxes format
Comprehensive Metrics: Precision, recall, F1, UniFrac, Shannon diversity, Bray-Curtis, and more
Built-in Validation: Pre-flight validation tools for taxpasta and bioboxes formats
nf-core Compliance: Built using nf-core template with best practices
Full Container Support: 100% Docker coverage via Seqera Wave, plus Singularity and Conda
Extensive Testing: Full nf-test suite with validated test data
Integration Ready: Works seamlessly with nf-core/taxprofiler outputs

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Test Profiles

Two test profiles are available for validating your setup:

test: Minimal test dataset (12 taxa, 1 classifier) - Quick validation of pipeline structure
test_realistic: Comprehensive test dataset (40+ taxa, 3 classifiers) - Recommended for full functionality validation

# Quick structure validation (may fail at OPAL due to minimal data)
nextflow run FOI-Bioinformatics/taxbencher -profile test,conda

# Full functionality validation (recommended - proves pipeline works correctly)
nextflow run FOI-Bioinformatics/taxbencher -profile test_realistic,conda

Tip

The test_realistic profile is recommended for validating your installation, as it includes realistic data that passes all OPAL evaluation steps. The minimal test profile may encounter OPAL spider plot errors due to insufficient taxa (known limitation of OPAL 1.0.13).

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample_id,label,classifier,taxpasta_file,taxonomy_db
sample1,sample1_kraken2,kraken2,results/taxprofiler/sample1_kraken2.tsv,NCBI
sample1,sample1_metaphlan,metaphlan,results/taxprofiler/sample1_metaphlan.tsv,NCBI
sample2,sample2_kraken2,kraken2,results/taxprofiler/sample2_kraken2.tsv,NCBI
sample2,sample2_metaphlan,metaphlan,results/taxprofiler/sample2_metaphlan.tsv,NCBI

Each row represents a taxpasta profile from a specific classifier.

sample_id: Biological sample identifier (groups profiles from the same biological sample)
label: Unique identifier for this taxonomic profile
classifier: Taxonomic classifier tool name
taxpasta_file: Path to taxpasta profile or raw profiler output
taxonomy_db: Taxonomy database (optional, default: NCBI)

Profiles with the same sample_id are evaluated together in a single OPAL run, enabling per-sample comparative analysis.

Note

Taxpasta files are generated by nf-core/taxprofiler. Run taxprofiler first to generate taxonomic profiles in standardized format.

Tip

Always validate your input files before running the pipeline:

# Validate taxpasta files
python3 bin/validate_taxpasta.py sample1_kraken2.tsv

# Validate gold standard (CRITICAL - catches column mismatches and unsupported ranks)
python3 bin/validate_bioboxes.py gold_standard.bioboxes

# If validation fails, automatically fix common issues:
python3 bin/fix_gold_standard.py \
  -i gold_standard.bioboxes \
  -o gold_standard_fixed.bioboxes \
  -s sample_id

See Gold Standard Troubleshooting for detailed validation and fixing guide.

Now, you can run the pipeline using:

nextflow run FOI-Bioinformatics/taxbencher \
   -profile <docker/singularity/conda/.../institute> \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir <OUTDIR>

Recommended Profiles

Platform	Recommended Profile	Notes
Linux x86_64	`docker,wave`	Best performance, full functionality, 100% module coverage
Linux ARM64	`conda`	Docker/Singularity containers are AMD64 only
macOS (Intel)	`docker,wave`	Full functionality with Wave containers
macOS (Apple Silicon)	`conda`	⚠️ Docker has limitations (see below)
HPC/Cluster	`singularity,wave` or `conda`	Depends on cluster configuration

Warning

Apple Silicon (M1/M2/M3) Limitations: When using -profile docker,wave on Apple Silicon Macs, the MultiQC step may fail with "Illegal instruction" errors due to AMD64/ARM64 architecture incompatibility. The core benchmarking processes (TAXPASTA_STANDARDISE, TAXPASTA_TO_BIOBOXES, OPAL) work correctly and produce all evaluation metrics. Recommended solution: Use -profile conda on Apple Silicon for full compatibility.

Profile Details

Conda (Recommended for macOS):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile conda \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Docker with Wave (Recommended for Linux/Intel Mac):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile docker,wave \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Wave automatically builds containers for modules requiring scientific Python packages (TAXPASTA_TO_BIOBOXES, COMPARATIVE_ANALYSIS)

Singularity with Wave (HPC/Linux):

nextflow run FOI-Bioinformatics/taxbencher \
   -profile singularity,wave \
   --input samplesheet.csv \
   --gold_standard gold_standard.bioboxes \
   --outdir results

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Documentation

For detailed information, see:

User Documentation

Usage Guide - Complete usage instructions, parameters, and configuration
Output Files - Description of output files and OPAL metrics interpretation
Gold Standard Troubleshooting - Validate and fix gold standard files
Validation Tools - Pre-flight validation scripts and local testing

Developer Documentation

Developer Hub - Complete developer documentation
Contributing Guide - Quick start for developers and architecture overview
Code Quality Report - Latest quality assessment (Grade: A-)
Test Coverage Report - Detailed test analysis
Validation Report - Technical validation infrastructure details

Credits

FOI-Bioinformatics/taxbencher was originally written by Andreas Sjödin at the Swedish Defence Research Agency (FOI).

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Key tools used in this pipeline:

OPAL (CAMI taxonomic profiling evaluation)

Meyer F, Hofmann P, Belmann P, et al. AMBER: Assessment of Metagenome BinnERs. GigaScience. 2018;7(6). doi: 10.1093/gigascience/giy069
taxpasta (Taxonomic profile standardization)

Beber ME, Borry M, Stamouli S, Fellows Yates JA. taxpasta: TAXonomic Profile Aggregation and STAndardisation. Journal of Open Source Software. 2023;8(87):5627. doi: 10.21105/joss.05627

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FOI-Bioinformatics/taxbencher

Introduction

Features

Usage

Test Profiles

Recommended Profiles

Profile Details

Documentation

User Documentation

Developer Documentation

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

FOI-Bioinformatics/taxbencher

Folders and files

Latest commit

History

Repository files navigation

FOI-Bioinformatics/taxbencher

Introduction

Features

Usage

Test Profiles

Recommended Profiles

Profile Details

Documentation

User Documentation

Developer Documentation

Credits

Contributions and Support

Citations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages