Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

DannyYuyang-quic · 2026-01-12T06:33:19Z

Summary:

Runtime support for models
- SmolVLM 500M
- InternVL3 1B
- add hybrid mode runtime requantization in multimodal runner
  - Background: In LLMs, annotate_prefill_kv_output effectively narrows the output gap
    between hybrid mode and KV mode. However, applying the same method to multimodal
    models do not work(bad results). To achieve decent result in hybrid mode, we dequantize
    the KV cache right after prefilling and re‑quantize it based on the decoder input cache at
    runtime.
CI
- refactor VLM test script
- add VLM acc/perf runtime tests
Refactor (VLM)
- rename embedding forward input for CPU quantization
- Update VLM vision encoder architecture to align with transformers 5.0 changes
Documentation
- add readme for multimodal VLM

Test plan

SmolVLM

Perf: ~63 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

InternVL3

Perf: ~17 TPS in SM8750

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name internvl3_1b -b build-android --executorch_root . -a . -m SM8750 -s ${SERIAL_NUM}

Script

SmolVLM

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model smolvlm_500m_instruct --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"

InternVL3

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --decoder_model internvl3_1b --model_mode kv --max_seq_len 1024 --prompt "Can you describe this image?" --image_path "http://images.cocodataset.org/val2017/000000039769.jpg"

Summary: - Runtime support for models - SmolVLM 500M - InternVL3 1B - add hybrid mode runtime requantization in multimodal runner - CI - refactor VLM test script - add VLM acc/perf runtime tests - Refactor(VLM) - rename embedding forward input for CPU quantization - Update VLM vision encoder architecture to align with upcoming transformers 5.0 changes - Documentation - add readme for multimodal VLM

pytorch-bot · 2026-01-12T06:33:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

As of commit 5627d44 with merge base 9ba1b5d ():

NEW FAILURES - The following jobs have failed:

Lint / link-check / lint-urls / linux-job (gh)
RuntimeError: Command docker exec -t 4ef70c099a291ccb5ed8329fed15d059f7a7458f53619967667a9d414de1927d /exec failed with exit code 1
Lint / lintrunner / linux-job (gh)
>>> Lint for examples/qualcomm/oss_scripts/llama/CMakeLists.txt:

CANCELLED JOB - The following job was cancelled. Please retry:

periodic (gh)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2026-01-12T06:34:28Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2026-01-12T06:46:45Z

Hi @cccclai,
This PR adds support for VLM runtime, follow‑up to #16292
below are some HTP runtime results for SmolVLM‑500M (Hybrid mode) and InternVL3‑1B (Hybrid mode):

SmolVLM 500M (Hybrid):

Image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:09.501139 executorch:multimodal_runner.cpp:646] RSS after finishing text generation: 419.035156 MiB (0 if unsupported)
I 00:00:09.501241 executorch:stats.h:143]       Prompt Tokens: 81    Generated Tokens: 454
I 00:00:09.501285 executorch:stats.h:149]       Model Load Time:                0.370000 (seconds)
I 00:00:09.501327 executorch:stats.h:159]       Total inference time:           8.935000 (seconds)               Rate:  50.811416 (tokens/second)
I 00:00:09.501369 executorch:stats.h:167]               Prompt evaluation:      0.117000 (seconds)               Rate:  692.307692 (tokens/second)
I 00:00:09.501412 executorch:stats.h:178]               Generated 454 tokens:   8.818000 (seconds)               Rate:  51.485598 (tokens/second)
I 00:00:09.501472 executorch:stats.h:186]       Time to first generated token:  0.117000 (seconds)
I 00:00:09.501512 executorch:stats.h:193]       Sampling time over 535 tokens:  0.782000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":81,"generated_tokens":454,"model_load_start_ms":1751021368585,"model_load_end_ms":1751021368955,"inference_start_ms":1751021368955,"inference_end_ms":1751021377890,"prompt_eval_end_ms":1751021369072,"first_token_ms":1751021369072,"aggregate_sampling_time_ms":782,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.1 MB/s (2809 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a serene and picturesque scene of a cityscape. The focal point of the image is a prominent, rectangular structure that appears to be a monument or a significant landmark. This structure is situated on a small, elevated platform, which is likely part of a monument or a memorial. The monument is rectangular in shape and has a smooth, reflective surface, suggesting it might be made of stone or a similar material.

The monument is surrounded by a small, circular area, which is likely a small plaza or a small park. This area is enclosed by a low, low-rise building, which is partially visible in the background. The building has a modern design, with a flat roof and a few windows visible.

In the foreground, there is a large, rectangular stone slab that seems to be part of the monument's base. The stone slab is smooth and reflective, indicating it might be made of granite or another similar material.

The sky above the monument is clear and blue, with no visible clouds, suggesting it is a sunny day. The overall scene is calm and peaceful, with no signs of human activity or movement.

The image does not contain any people, vehicles, or other objects, which helps to focus the viewer's attention on the monument and its surroundings. The absence of any urban elements, such as buildings or roads, also helps to keep the focus on the monument itself.

Given the description, a pure text model can answer questions related to the image by providing a detailed and logical analysis of the elements present in the image. For example, if asked about the type of monument, the model can explain that it is a rectangular stone structure with a smooth, reflective surface. Additionally, if asked about the surrounding area, the model can describe the small, circular plaza or park enclosed by the building.

In summary, the image depicts a rectangular stone monument with a smooth, reflective surface, surrounded by a small, circular plaza or park enclosed by a low, low-rise building. The sky is clear and blue, with no visible clouds, and the overall scene is calm and peaceful. The absence of any human activity or other elements helps to keep the focus on the monument itself.<end_of_utterance>

InternVL3 1B (Hybrid):

Image: http://images.cocodataset.org/val2017/000000039769.jpg
Query:"Can you describe this image?"
Answer:

I 00:00:03.613379 executorch:multimodal_runner.cpp:627] RSS after finishing text generation: 612.617188 MiB (0 if unsupported)
I 00:00:03.613412 executorch:stats.h:143]       Prompt Tokens: 272    Generated Tokens: 118
I 00:00:03.613422 executorch:stats.h:149]       Model Load Time:                0.761000 (seconds)
I 00:00:03.613432 executorch:stats.h:159]       Total inference time:           1.770000 (seconds)               Rate:  66.666667 (tokens/second)
I 00:00:03.613441 executorch:stats.h:167]               Prompt evaluation:      0.197000 (seconds)               Rate:  1380.710660 (tokens/second)
I 00:00:03.613451 executorch:stats.h:178]               Generated 118 tokens:   1.573000 (seconds)               Rate:  75.015893 (tokens/second)
I 00:00:03.613462 executorch:stats.h:186]       Time to first generated token:  0.197000 (seconds)
I 00:00:03.613476 executorch:stats.h:193]       Sampling time over 390 tokens:  0.097000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":272,"generated_tokens":118,"model_load_start_ms":1749836595295,"model_load_end_ms":1749836596056,"inference_start_ms":1749836596056,"inference_end_ms":1749836597826,"prompt_eval_end_ms":1749836596253,"first_token_ms":1749836596253,"aggregate_sampling_time_ms":97,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.6 MB/s (3960 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with dark stripes and is lying on its side with its front paws stretched out. The cat on the right is also tabby but with a mix of brown and black stripes. Both cats are relaxed and appear to be sleeping or lying down comfortably. There are two remote controls placed on the pink surface near the cats. The overall scene suggests a cozy and peaceful setting, ideal for a cat to rest.<|im_end|>

cc: @haowhsu-quic

Please have a look!
Thanks!

luffy-yu · 2026-01-12T21:59:43Z

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

Here is the 1st command output.

======================================================================
FAIL: test_static_vlm (__main__.TestExampleMultimodalityScript.test_static_vlm)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/n10288/Documents/Code/executorch/backends/qualcomm/tests/test_qnn_delegate.py", line 6561, in test_static_vlm
    self.fail(msg["Error"])
AssertionError: [Errno 2] No such file or directory: './outputs/outputs.txt'

----------------------------------------------------------------------
Ran 1 test in 225.912s

FAILED (failures=1)

Full log is also attached.

smolvlm_500m_instruct.log

The 2nd command also failed due to the same error.

I noticed that the adb seems disconnected and reconnected as seen in the Ubuntu Taskbar. Here is an error from the 3rd command, adb: error: failed to get feature set: device 'RFCW40R0PHW' not found.

DannyYuyang-quic · 2026-01-13T03:24:43Z

@DannyYuyang-quic Thank you for this PR!

I just tested it on SOC SM8550 . It failed at a very late stage.

It looks like the issue is related to the device connection failing. Could you share the exact command you used? If your device is connected to a remote host, make sure to include the -H ${host_name} flag in your command.

Based on your log, the compilation completed successfully, you can use the --pre_gen_pte ${artifacts_folder_path} to reuse the generated PTE file and verify the runtime results without recompiling.

python backends/qualcomm/tests/test_qnn_delegate.py TestExampleMultimodalityScript.test_static_vlm -v -b build-android -
H ${host_name} -m SM8550 -s ${SERIAL_NUM} -a . --model_name smolvlm_500m_instruct --executorch_root . --pre_gen_pte .

luffy-yu · 2026-01-13T03:37:35Z

@DannyYuyang-quic Thank you for your reply! I was using the following command.

python -m backends.qualcomm.tests.test_qnn_delegate TestExampleMultimodalityScript.test_static_vlm --model_name smolvlm_500m_instruct -b build-android --executorch_root . -a . -m SM8550 -s ${SERIAL_NUM}

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

meta-codesync · 2026-01-13T04:14:40Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D90555945.

DannyYuyang-quic · 2026-01-13T04:26:27Z

The Android phone is connected to the Linux PC via Type-C. adb devices outputs the desired device.

Hi @luffy-yu,
Did you rebuild the entire environment after checking out this commit?

git submodule sync
git submodule update --init
./install_executorch.sh
./backends/qualcomm/scripts/build.sh

I noticed you’re using a very old PyTorch version: 2.8.0+cu128.

After rebuilding, could you also run a simple model test to confirm everything works?
Sample test here:

python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m SM8550 -s ${SERIAL_NUM} -a . --executorch_root .

DannyYuyang-quic requested review from cccclai, kirklandsign and larryliu0820 as code owners January 12, 2026 06:33

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Uh oh!

DannyYuyang-quic commented Jan 12, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

Uh oh!

luffy-yu commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading

Uh oh!

luffy-yu commented Jan 13, 2026

Uh oh!

meta-codesync bot commented Jan 13, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Are you sure you want to change the base?

Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536

Uh oh!

Conversation

DannyYuyang-quic commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Test plan

SmolVLM

InternVL3

Script

SmolVLM

InternVL3

Uh oh!

pytorch-bot bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536

❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 12, 2026

SmolVLM 500M (Hybrid):

InternVL3 1B (Hybrid):

Uh oh!

luffy-yu commented Jan 12, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luffy-yu commented Jan 13, 2026

Uh oh!

meta-codesync bot commented Jan 13, 2026

Uh oh!

DannyYuyang-quic commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DannyYuyang-quic commented Jan 12, 2026 •

edited

Loading

pytorch-bot bot commented Jan 12, 2026 •

edited

Loading

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading

DannyYuyang-quic commented Jan 13, 2026 •

edited

Loading