Skip to content

Conversation

@DannyYuyang-quic
Copy link
Contributor

@DannyYuyang-quic DannyYuyang-quic commented Jan 12, 2026

Summary:

  • Runtime support for models
    • SmolVLM 500M
    • InternVL3 1B
    • add hybrid mode runtime requantization in multimodal runner
      • Background: In LLMs, annotate_prefill_kv_output effectively narrows the output gap
        between hybrid mode and KV mode. However, applying the same method to multimodal
        models do not work(bad results). To achieve decent result in hybrid mode, we dequantize
        the KV cache right after prefilling and re‑quantize it based on the decoder input cache at
        runtime.
  • CI
    • refactor VLM test script
    • add VLM acc/perf runtime tests
  • Refactor (VLM)
    • rename embedding forward input for CPU quantization
    • Update VLM vision encoder architecture to align with transformers 5.0 changes
  • Documentation
    • add readme for multimodal VLM

Test plan

SmolVLM

Perf: ~63 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

InternVL3

Perf: ~17 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name internvl3_1b -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

SmolVLM

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"

InternVL3

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model internvl3_1b --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"

Summary:
 - Runtime support for models
   - SmolVLM 500M
   - InternVL3 1B
   - add hybrid mode runtime requantization in multimodal runner

 - CI
   - refactor VLM test script
   - add VLM acc/perf runtime tests

 - Refactor(VLM)
   - rename embedding forward input for CPU quantization
   - Update VLM vision encoder architecture to align with upcoming
     transformers 5.0 changes

 - Documentation
   - add readme for multimodal VLM
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit 5627d44 with merge base 9ba1b5d (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026
@DannyYuyang-quic
Copy link
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jan 12, 2026
@DannyYuyang-quic
Copy link
Contributor Author

Hi @cccclai,
This PR adds support for VLM runtime, follow‑up to #16292
below are some HTP runtime results for SmolVLM‑500M (Hybrid mode) and InternVL3‑1B (Hybrid mode):

SmolVLM 500M (Hybrid):

Image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:09.501139 executorch:multimodal_runner.cpp:646] RSS after finishing text generation: 419.035156 MiB (0 if unsupported)
I 00:00:09.501241 executorch:stats.h:143]       Prompt Tokens: 81    Generated Tokens: 454
I 00:00:09.501285 executorch:stats.h:149]       Model Load Time:                0.370000 (seconds)
I 00:00:09.501327 executorch:stats.h:159]       Total inference time:           8.935000 (seconds)               Rate:  50.811416 (tokens/second)
I 00:00:09.501369 executorch:stats.h:167]               Prompt evaluation:      0.117000 (seconds)               Rate:  692.307692 (tokens/second)
I 00:00:09.501412 executorch:stats.h:178]               Generated 454 tokens:   8.818000 (seconds)               Rate:  51.485598 (tokens/second)
I 00:00:09.501472 executorch:stats.h:186]       Time to first generated token:  0.117000 (seconds)
I 00:00:09.501512 executorch:stats.h:193]       Sampling time over 535 tokens:  0.782000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":81,"generated_tokens":454,"model_load_start_ms":1751021368585,"model_load_end_ms":1751021368955,"inference_start_ms":1751021368955,"inference_end_ms":1751021377890,"prompt_eval_end_ms":1751021369072,"first_token_ms":1751021369072,"aggregate_sampling_time_ms":782,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.1 MB/s (2809 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a serene and picturesque scene of a cityscape. The focal point of the image is a prominent, rectangular structure that appears to be a monument or a significant landmark. This structure is situated on a small, elevated platform, which is likely part of a monument or a memorial. The monument is rectangular in shape and has a smooth, reflective surface, suggesting it might be made of stone or a similar material.

The monument is surrounded by a small, circular area, which is likely a small plaza or a small park. This area is enclosed by a low, low-rise building, which is partially visible in the background. The building has a modern design, with a flat roof and a few windows visible.

In the foreground, there is a large, rectangular stone slab that seems to be part of the monument's base. The stone slab is smooth and reflective, indicating it might be made of granite or another similar material.

The sky above the monument is clear and blue, with no visible clouds, suggesting it is a sunny day. The overall scene is calm and peaceful, with no signs of human activity or movement.

The image does not contain any people, vehicles, or other objects, which helps to focus the viewer's attention on the monument and its surroundings. The absence of any urban elements, such as buildings or roads, also helps to keep the focus on the monument itself.

Given the description, a pure text model can answer questions related to the image by providing a detailed and logical analysis of the elements present in the image. For example, if asked about the type of monument, the model can explain that it is a rectangular stone structure with a smooth, reflective surface. Additionally, if asked about the surrounding area, the model can describe the small, circular plaza or park enclosed by the building.

In summary, the image depicts a rectangular stone monument with a smooth, reflective surface, surrounded by a small, circular plaza or park enclosed by a low, low-rise building. The sky is clear and blue, with no visible clouds, and the overall scene is calm and peaceful. The absence of any human activity or other elements helps to keep the focus on the monument itself.<end_of_utterance>

InternVL3 1B (Hybrid):

Image: http://images.cocodataset.org/val2017/000000039769.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:03.613379 executorch:multimodal_runner.cpp:627] RSS after finishing text generation: 612.617188 MiB (0 if unsupported)
I 00:00:03.613412 executorch:stats.h:143]       Prompt Tokens: 272    Generated Tokens: 118
I 00:00:03.613422 executorch:stats.h:149]       Model Load Time:                0.761000 (seconds)
I 00:00:03.613432 executorch:stats.h:159]       Total inference time:           1.770000 (seconds)               Rate:  66.666667 (tokens/second)
I 00:00:03.613441 executorch:stats.h:167]               Prompt evaluation:      0.197000 (seconds)               Rate:  1380.710660 (tokens/second)
I 00:00:03.613451 executorch:stats.h:178]               Generated 118 tokens:   1.573000 (seconds)               Rate:  75.015893 (tokens/second)
I 00:00:03.613462 executorch:stats.h:186]       Time to first generated token:  0.197000 (seconds)
I 00:00:03.613476 executorch:stats.h:193]       Sampling time over 390 tokens:  0.097000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":272,"generated_tokens":118,"model_load_start_ms":1749836595295,"model_load_end_ms":1749836596056,"inference_start_ms":1749836596056,"inference_end_ms":1749836597826,"prompt_eval_end_ms":1749836596253,"first_token_ms":1749836596253,"aggregate_sampling_time_ms":97,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.6 MB/s (3960 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with dark stripes and is lying on its side with its front paws stretched out. The cat on the right is also tabby but with a mix of brown and black stripes. Both cats are relaxed and appear to be sleeping or lying down comfortably. There are two remote controls placed on the pink surface near the cats. The overall scene suggests a cozy and peaceful setting, ideal for a cat to rest.<|im_end|>

cc: @haowhsu-quic

Please have a look!
Thanks!

@luffy-yu
Copy link
Contributor

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

Here is the 1st command output.

======================================================================
FAIL: test_static_vlm (__main__.TestExampleMultimodalityScript.test_static_vlm)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/n10288/Documents/Code/executorch/backends/qualcomm/tests/test_qnn_delegate.py", line 6561, in test_static_vlm
    self.fail(msg["Error"])
AssertionError: [Errno 2] No such file or directory: './outputs/outputs.txt'

----------------------------------------------------------------------
Ran 1 test in 225.912s

FAILED (failures=1)

Full log is also attached.

smolvlm_500m_instruct.log

The 2nd command also failed due to the same error.

I noticed that the adb seems disconnected and reconnected as seen in the Ubuntu Taskbar. Here is an error from the 3rd command, adb: error: failed to get feature set: device 'RFCW40R0PHW' not found.

@DannyYuyang-quic
Copy link
Contributor Author

DannyYuyang-quic commented Jan 13, 2026

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

It looks like the issue is related to the device connection failing. Could you share the exact command you used? If your device is connected to a remote host, make sure to include the -H ${host_name} flag in your command.

Based on your log, the compilation completed successfully, you can use the --pre_gen_pte ${artifacts_folder_path} to reuse the generated PTE file and verify the runtime results without recompiling.

python backends/qualcomm/tests/test_qnn_delegate.py TestExampleMultimodalityScript.test_static_vlm -v -b build-android -
H ${host_name} -m SM8550 -s ${SERIAL_NUM} -a . --model_name smolvlm_500m_instruct --executorch_root . --pre_gen_pte .

@luffy-yu
Copy link
Contributor

@DannyYuyang-quic Thank you for your reply! I was using the following command.

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8550 -s ${SERIAL_NUM}

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

@meta-codesync
Copy link

meta-codesync bot commented Jan 13, 2026

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D90555945.

@DannyYuyang-quic
Copy link
Contributor Author

DannyYuyang-quic commented Jan 13, 2026

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

Hi @luffy-yu,
Did you rebuild the entire environment after checking out this commit?

git submodule sync
git submodule update --init
./install_executorch.sh
./backends/qualcomm/scripts/build.sh

I noticed you’re using a very old PyTorch version: 2.8.0+cu128.

After rebuilding, could you also run a simple model test to confirm everything works?
Sample test here:

python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m SM8550 -s ${SERIAL_NUM} -a . --executorch_root .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants