Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 43 additions & 27 deletions docs/source/execution.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,8 @@ necessary).
Choosing an orchestrator
------------------------

Before running a command, we need to decide on an orchestrator. The
orchestrator is responsible for the first and third :ref:`tasks above
<rr-tasks>`, preparing the remote and collecting the results. The complete
set of orchestrators, accompanied by descriptions, can be seen by
Orchestrators are responsible for preparing the remote and collecting the results.
The complete set of orchestrators, accompanied by descriptions, can be seen by
calling ``reproman run --list=orchestrators``.

.. note::
Expand All @@ -66,29 +64,47 @@ calling ``reproman run --list=orchestrators``.
only a limited set of functionality is available. If you are new to
DataLad, consider reading the `DataLad handbook`_.

The main orchestrator choices are ``datalad-pair``,
``datalad-pair-run``, and ``datalad-local-run``. If the remote has
DataLad available, you should go with one of the ``datalad-pair*`` orchestrators.
These will sync your local dataset with a dataset on the remote machine
(using `datalad push`_), creating one if it doesn't already exist
(using `datalad create-sibling`_).

``datalad-pair`` differs from the ``datalad-*-run`` orchestrators in the
way it captures results. After execution has completed, ``datalad-pair``
commits the result *on the remote* via DataLad. On fetch, it will pull
that commit down with `datalad update`_. Outputs (specified via
``--outputs`` or as a job parameter) are retrieved with `datalad get`_.

``datalad-pair-run`` and ``datalad-local-run``, on the other hand,
determine a list of output files based on modification times and
packages these files in a tarball. (This approach is inspired by
`datalad-htcondor`_.) On fetch, this tarball is downloaded locally and
used to create a `datalad run`_ commit in the *local* repository.

There is one more orchestrator, ``datalad-no-remote``, that is designed
to work only with a local shell resource. It is similar to
``datalad-pair``, except that the command is executed in the same
directory from which ``reproman run`` is invoked.
Choose the orchestrator based on your setup and needs:

**For remote resources with DataLad (recommended):**

- **``datalad-pair``** - Best for persistent remote datasets

- Creates and maintains DataLad datasets on the remote
- Commits results directly on the remote with full provenance
- Retrieves results using `datalad update`_ and `datalad get`_
- Marks completed jobs with git refs (refs/reproman/JOBID)

- **``datalad-pair-run``** - Best for capturing runs in local dataset

- Prepares remote dataset like ``datalad-pair``
- Packages results in tarball based on file modification times
- Creates a `datalad run`_ commit in your *local* repository
- Marks local commit with git ref (refs/reproman/JOBID)

**For remote resources without DataLad:**

- **``datalad-local-run``** - Remote execution, local DataLad integration

- Uses plain remote directory (no DataLad on remote required)
- Captures results as `datalad run`_ commit locally
- Good when remote lacks DataLad but you want local provenance

- **``plain``** - Simple remote execution

- Basic file transfer using session.put() and session.get()
- No DataLad integration or provenance tracking
- Creates working directory named with job ID
- Sufficient for simple tasks but DataLad orchestrators recommended

**For local execution:**

- **``datalad-no-remote``** - Local dataset execution

- Executes in current local dataset directory
- Behaves like ``datalad-pair`` but stays local
- Available for local shell resources only
- Good for testing workflows locally

Revisiting :ref:`our concrete example <rr-refex>` and assuming we have
an SSH resource named "foo" in our inventory, here's how we could
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ ReproMan |---| tools for reproducible neuroimaging
:maxdepth: 1

overview
tutorial-ssh
acknowledgements

Concepts and technologies
Expand Down
193 changes: 193 additions & 0 deletions docs/source/tutorial-ssh.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
.. _tutorial-ssh:

Tutorial: SSH Resource Workflows
*********************************

This tutorial walks you through ReproMan workflows using SSH resources, from simple command execution to complex data analysis.
We'll start with a basic hello-world example, then progress to processing neuroimaging data.

This tutorial demonstrates ReproMan's power in creating reproducible, traceable computational workflows across SSH-accessible computing environments.

Overview
========

We'll cover two workflows:

**Part 1: Hello World Example**

1. Create a ReproMan SSH resource
2. Execute a simple command remotely
3. Fetch and examine results

**Part 2: Dataset Analysis Example**

1. Set up a DataLad dataset with input data
2. Execute MRIQC quality control analysis remotely
3. Collect and examine results with full provenance

Prerequisites
=============

For Part 1:

- ReproMan installed on local machine (``pip install reproman``)
- Access to a remote server via SSH

For Part 2:

- DataLad support (``pip install 'reproman[full]'``)
- DataLad installed on remote server

Part 1: Hello World Example
============================

Step 1: Create an SSH Resource
-------------------------------

First, let's add an SSH resource to ReproMan's inventory. Replace ``your-server.edu`` with your actual server::

reproman create myserver --resource-type ssh --backend-parameters host=your-server.edu

Verify the resource was created::

reproman ls --refresh

.. note::

The ``--refresh`` flag is needed to check the current status of resources. Without it, you'll only see cached status information.

You should see output similar to::

RESOURCE NAME TYPE ID STATUS
------------- ---- -- ------
myserver ssh 1a23b456-789c- ONLINE

Step 2: Execute a Simple Command
---------------------------------

Let's start with a simple test to verify our setup works. Create a working directory and run a basic command::

mkdir -p hello-world
cd hello-world

reproman run --resource myserver \
--submitter local \
--orchestrator plain \
--output results \
sh -c 'mkdir -p results && echo "Hello from ReproMan on $(hostname)" > results/hello.txt'


Step 3: Fetch Results
---------------------

The job will execute on the remote. To check status and fetch results::

# Check job status and get job ID
reproman jobs

# Fetch results for completed job (replace JOB_ID with actual ID)
reproman jobs JOB_ID

When you run ``reproman jobs JOB_ID``, ReproMan will automatically:

- Fetch the output files from the remote to your local working directory
- Display job information and logs
- Unregister the completed job

You should now see the results locally::

cat results/hello.txt

.. note::

ReproMan creates a working directory on the remote resource automatically. By default, it uses ``~/.reproman/run-root`` on the remote. You can verify the file exists there with ``reproman login myserver``.

Part 2: Dataset Analysis Example
=================================

Now let's try a more realistic example with DataLad dataset management and neuroimaging analysis.

Step 1: Set Up the Analysis Dataset
------------------------------------

Create a new DataLad dataset for our analysis::

# Create dataset for MRIQC quality control results
datalad create -d demo-mriqc -c text2git
cd demo-mriqc

Install input data (using a demo BIDS dataset)::

# Install demo neuroimaging dataset
datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata/raw

.. note::
This only installs the dataset structure - the actual data files are not
downloaded locally. DataLad will automatically fetch any data specified
by `--input` when the analysis runs.


Set up working directory to be ignored::

datalad run -m "Ignore processing workdir" 'echo "workdir/" > .gitignore'

Step 2: Execute Analysis with DataLad Integration
-------------------------------------------------

For full provenance tracking with DataLad::

reproman run --resource myserver \
--submitter local \
--orchestrator datalad-pair-run \
--input sourcedata/raw \
--output . \
bash -c 'podman run --rm -v "$(pwd):/work:rw" nipreps/mriqc:latest /work/sourcedata/raw /work/results participant group --participant-label 02'

.. note::
The ``-v "$(pwd):/work:rw"`` part mounts your current directory into the
container at ``/work``, allowing the containerized software to access the
top level dataset.

Step 3: Monitor Execution
-------------------------

ReproMan jobs run in detached mode by default. Monitor progress::

# List all jobs
reproman jobs

# Check specific job status (replace JOB_ID with actual ID)
reproman jobs JOB_ID

# Fetch completed job results
reproman jobs JOB_ID --fetch

For attached execution (wait for completion)::

reproman run --resource myserver --follow \
[... rest of command ...]

Step 4: Examine Results and Provenance
--------------------------------------

Once the job completes, examine what was captured::

# View the provenance record
git log --oneline -1

# Look at captured job information
ls .reproman/jobs/myserver/

# View job specification
cat .reproman/jobs/myserver/JOB_ID/spec.yaml

# Check MRIQC outputs
ls -la results/

The DataLad orchestrators create rich provenance records::

# View the detailed run record
git show --stat

# See what files were modified/added
git show --name-status