[Paper] [Demo] [🤗 HuggingFace Model]
VIBE is a powerful open-source framework for text-guided image editing. It leverages the efficient Sana1.5-1.6B diffusion model and Qwen3-VL-2B-Instruct to provide exceptionally fast and high-quality, instruction-based image manipulation.
- Text-Guided Editing: Edit images using natural language instructions (e.g., "Add a cat on the sofa").
- Compact Size: 1.6B params diffusion and 2B condition encoder.
- High-Speed Inference: Powered by Sana1.5's efficient linear attention mechanism, enabling rapid image editing.
- Multimodal Understanding: Uses Qwen3VL for strong visual understanding and Sana1.5 for high-fidelity image generation.
- Flexible Pipeline: Built on top of
diffusersandtransformers, making it easy to extend and customize.
- Linux
- Python 3.11
- Conda (recommended)
-
Create and activate a Conda environment:
conda create -y -q --prefix ./vibe_env python=3.11 conda activate ./vibe_env
-
Install CUDA Toolkit (for NVIDIA GPUs):
conda install -y -c nvidia/label/cuda-12.3.0 --override-channels cuda-compiler conda install -y -c nvidia/label/cuda-12.3.0 --override-channels cuda-toolkit
-
Install Dependencies:
pip install -r requirements/requirements.txt
-
Install Package:
pip install .
Alternatively, you can use Docker to run the project without installing dependencies locally.
-
Prerequisites:
-
Build and Run:
# Build and start the container docker compose up -d --build # Enter the container docker compose exec vibe bash
Inside the container, you can run the inference scripts as usual. The current directory is mounted to
/app, so changes are reflected immediately.
You can use the provided shell script to edit single images directly from the terminal.
chmod +x scripts/inference_template.sh
./scripts/inference_template.shAlternatively, you can call the Python script directly:
PYTHONPATH="${PYTHONPATH:-}:$(pwd)" \
python scripts/inference.py edit-single-image \
--image-path "examples/dog.jpg" \
--instruction "Make the dog look like a painting" \
--checkpoint-path "/path/to/pipeline/checkpoint" \
--output-path "outputs/" \
--num-images-per-prompt 1 \
--image-guidance-scale 1.2 \
--guidance-scale 4.5 \
--num-inference-steps 20 \
--device "cuda:0"For processing multiple images, you can use the edit-multiple-images command. This allows you to provide a list of images and prompts via a JSON file.
-
Create a mapping file (e.g.,
mapping.json):[ { "image_path": "examples/dog.jpg", "editing_prompt": "Make the dog look like a painting" }, { "image_path": "examples/dog.jpg", "editing_prompt": "turn the dog into a cat" } ] -
Run the batch command:
PYTHONPATH="${PYTHONPATH:-}:$(pwd)" \
python scripts/inference.py edit-multiple-images \
--mapping-path examples/mapping.json \
--checkpoint-path "/path/to/pipeline/checkpoint" \
--output-path "outputs/multi_img_results" \
--num-images-per-prompt 1 \
--image-guidance-scale 1.2 \
--guidance-scale 4.5 \
--num-inference-steps 20 \
--device "cuda:0"Arguments:
--image-path: Path to the input image file (single image mode).--instruction: The text instruction or instructions for editing (single image mode).--mapping-path: Path to the JSON mapping file (batch mode).--checkpoint-path: Path to the local pipeline checkpoint (directory containing weights).--output-path: Directory where the result will be saved.--num-images-per-prompt: Number of variations to generate (default: 1).--image-guidance-scale: Controls the influence of the input image (default: 1.2).--guidance-scale: Controls the strength of the text prompt guidance (default: 4.5).--num-inference-steps: Number of denoising steps (default: 20).--device: The device to use.
vibe/: Core source code.editor.py: MainImageEditorclass for high-level interaction.generative_pipeline/: Sana1.5-based diffusion pipeline logic.transformer/: Custom transformer models and editing head.
scripts/: Utility scripts.inference.py: CLI entry point for image editing.
This project builds upon the work of:
@misc{vibe2026,
Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
Title = {VIBE: Visual Instruction Based Editor},
Year = {2026},
Eprint = {arXiv:2601.02242},
}