VIBE - Visual Instruction Based Editor

[Paper] [Demo] [🤗 HuggingFace Model]

VIBE is a powerful open-source framework for text-guided image editing. It leverages the efficient Sana1.5-1.6B diffusion model and Qwen3-VL-2B-Instruct to provide exceptionally fast and high-quality, instruction-based image manipulation.

Features

Text-Guided Editing: Edit images using natural language instructions (e.g., "Add a cat on the sofa").
Compact Size: 1.6B params diffusion and 2B condition encoder.
High-Speed Inference: Powered by Sana1.5's efficient linear attention mechanism, enabling rapid image editing.
Multimodal Understanding: Uses Qwen3VL for strong visual understanding and Sana1.5 for high-fidelity image generation.
Flexible Pipeline: Built on top of diffusers and transformers, making it easy to extend and customize.

Installation

Prerequisites

Linux
Python 3.11
Conda (recommended)

Setup Environment

Create and activate a Conda environment:

conda create -y -q --prefix ./vibe_env python=3.11
conda activate ./vibe_env

Install CUDA Toolkit (for NVIDIA GPUs):

conda install -y -c nvidia/label/cuda-12.3.0 --override-channels cuda-compiler
conda install -y -c nvidia/label/cuda-12.3.0 --override-channels cuda-toolkit

Install Dependencies:

pip install -r requirements/requirements.txt

Install Package:
```
pip install .
```

Docker Setup

Alternatively, you can use Docker to run the project without installing dependencies locally.

Prerequisites:
- Docker
- NVIDIA Container Toolkit
Build and Run:
```
# Build and start the container
docker compose up -d --build

# Enter the container
docker compose exec vibe bash
```
Inside the container, you can run the inference scripts as usual. The current directory is mounted to /app, so changes are reflected immediately.

Usage

Single Image processing

You can use the provided shell script to edit single images directly from the terminal.

chmod +x scripts/inference_template.sh
./scripts/inference_template.sh

Alternatively, you can call the Python script directly:

PYTHONPATH="${PYTHONPATH:-}:$(pwd)" \
python scripts/inference.py edit-single-image \
    --image-path "examples/dog.jpg" \
    --instruction "Make the dog look like a painting" \
    --checkpoint-path "/path/to/pipeline/checkpoint" \
    --output-path "outputs/" \
    --num-images-per-prompt 1 \
    --image-guidance-scale 1.2 \
    --guidance-scale 4.5 \
    --num-inference-steps 20 \
    --device "cuda:0"

Multiple Images processing

For processing multiple images, you can use the edit-multiple-images command. This allows you to provide a list of images and prompts via a JSON file.

Create a mapping file (e.g., mapping.json):

[
    {
        "image_path": "examples/dog.jpg",
        "editing_prompt": "Make the dog look like a painting"
    },
    {
        "image_path": "examples/dog.jpg",
        "editing_prompt": "turn the dog into a cat"
    }
]

Run the batch command:

PYTHONPATH="${PYTHONPATH:-}:$(pwd)" \
python scripts/inference.py edit-multiple-images \
    --mapping-path examples/mapping.json \
    --checkpoint-path "/path/to/pipeline/checkpoint" \
    --output-path "outputs/multi_img_results" \
    --num-images-per-prompt 1 \
    --image-guidance-scale 1.2 \
    --guidance-scale 4.5 \
    --num-inference-steps 20 \
    --device "cuda:0"

Arguments:

--image-path: Path to the input image file (single image mode).
--instruction: The text instruction or instructions for editing (single image mode).
--mapping-path: Path to the JSON mapping file (batch mode).
--checkpoint-path: Path to the local pipeline checkpoint (directory containing weights).
--output-path: Directory where the result will be saved.
--num-images-per-prompt: Number of variations to generate (default: 1).
--image-guidance-scale: Controls the influence of the input image (default: 1.2).
--guidance-scale: Controls the strength of the text prompt guidance (default: 4.5).
--num-inference-steps: Number of denoising steps (default: 20).
--device: The device to use.

Project Structure

vibe/: Core source code.
- editor.py: Main ImageEditor class for high-level interaction.
- generative_pipeline/: Sana1.5-based diffusion pipeline logic.
- transformer/: Custom transformer models and editing head.
scripts/: Utility scripts.
- inference.py: CLI entry point for image editing.

Acknowledgements

This project builds upon the work of:

Citation

@misc{vibe2026,
  Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
  Title = {VIBE: Visual Instruction Based Editor},
  Year = {2026},
  Eprint = {arXiv:2601.02242},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
notebooks		notebooks
requirements		requirements
scripts		scripts
vibe		vibe
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
justfile		justfile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIBE - Visual Instruction Based Editor

Features

Installation

Prerequisites

Setup Environment

Docker Setup

Usage

Single Image processing

Multiple Images processing

Project Structure

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

ai-forever/VIBE

Folders and files

Latest commit

History

Repository files navigation

VIBE - Visual Instruction Based Editor

Features

Installation

Prerequisites

Setup Environment

Docker Setup

Usage

Single Image processing

Multiple Images processing

Project Structure

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages