-
Notifications
You must be signed in to change notification settings - Fork 796
Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Qualcomm AI Engine Direct - Support multimodal(VLM) runner #16536
Conversation
Summary:
- Runtime support for models
- SmolVLM 500M
- InternVL3 1B
- add hybrid mode runtime requantization in multimodal runner
- CI
- refactor VLM test script
- add VLM acc/perf runtime tests
- Refactor(VLM)
- rename embedding forward input for CPU quantization
- Update VLM vision encoder architecture to align with upcoming
transformers 5.0 changes
- Documentation
- add readme for multimodal VLM
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16536
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled Job, 1 Unrelated FailureAs of commit 5627d44 with merge base 9ba1b5d ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: qualcomm" |
|
Hi @cccclai, SmolVLM 500M (Hybrid):Image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg I 00:00:09.501139 executorch:multimodal_runner.cpp:646] RSS after finishing text generation: 419.035156 MiB (0 if unsupported)
I 00:00:09.501241 executorch:stats.h:143] Prompt Tokens: 81 Generated Tokens: 454
I 00:00:09.501285 executorch:stats.h:149] Model Load Time: 0.370000 (seconds)
I 00:00:09.501327 executorch:stats.h:159] Total inference time: 8.935000 (seconds) Rate: 50.811416 (tokens/second)
I 00:00:09.501369 executorch:stats.h:167] Prompt evaluation: 0.117000 (seconds) Rate: 692.307692 (tokens/second)
I 00:00:09.501412 executorch:stats.h:178] Generated 454 tokens: 8.818000 (seconds) Rate: 51.485598 (tokens/second)
I 00:00:09.501472 executorch:stats.h:186] Time to first generated token: 0.117000 (seconds)
I 00:00:09.501512 executorch:stats.h:193] Sampling time over 535 tokens: 0.782000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
PyTorchObserver {"prompt_tokens":81,"generated_tokens":454,"model_load_start_ms":1751021368585,"model_load_end_ms":1751021368955,"inference_start_ms":1751021368955,"inference_end_ms":1751021377890,"prompt_eval_end_ms":1751021369072,"first_token_ms":1751021369072,"aggregate_sampling_time_ms":782,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.1 MB/s (2809 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>User:<fake_token_around_image><global-img><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><fake_token_around_image>Can you describe this image?<end_of_utterance>
Assistant: The image depicts a serene and picturesque scene of a cityscape. The focal point of the image is a prominent, rectangular structure that appears to be a monument or a significant landmark. This structure is situated on a small, elevated platform, which is likely part of a monument or a memorial. The monument is rectangular in shape and has a smooth, reflective surface, suggesting it might be made of stone or a similar material.
The monument is surrounded by a small, circular area, which is likely a small plaza or a small park. This area is enclosed by a low, low-rise building, which is partially visible in the background. The building has a modern design, with a flat roof and a few windows visible.
In the foreground, there is a large, rectangular stone slab that seems to be part of the monument's base. The stone slab is smooth and reflective, indicating it might be made of granite or another similar material.
The sky above the monument is clear and blue, with no visible clouds, suggesting it is a sunny day. The overall scene is calm and peaceful, with no signs of human activity or movement.
The image does not contain any people, vehicles, or other objects, which helps to focus the viewer's attention on the monument and its surroundings. The absence of any urban elements, such as buildings or roads, also helps to keep the focus on the monument itself.
Given the description, a pure text model can answer questions related to the image by providing a detailed and logical analysis of the elements present in the image. For example, if asked about the type of monument, the model can explain that it is a rectangular stone structure with a smooth, reflective surface. Additionally, if asked about the surrounding area, the model can describe the small, circular plaza or park enclosed by the building.
In summary, the image depicts a rectangular stone monument with a smooth, reflective surface, surrounded by a small, circular plaza or park enclosed by a low, low-rise building. The sky is clear and blue, with no visible clouds, and the overall scene is calm and peaceful. The absence of any human activity or other elements helps to keep the focus on the monument itself.<end_of_utterance>InternVL3 1B (Hybrid):Image: http://images.cocodataset.org/val2017/000000039769.jpg I 00:00:03.613379 executorch:multimodal_runner.cpp:627] RSS after finishing text generation: 612.617188 MiB (0 if unsupported)
I 00:00:03.613412 executorch:stats.h:143] Prompt Tokens: 272 Generated Tokens: 118
I 00:00:03.613422 executorch:stats.h:149] Model Load Time: 0.761000 (seconds)
I 00:00:03.613432 executorch:stats.h:159] Total inference time: 1.770000 (seconds) Rate: 66.666667 (tokens/second)
I 00:00:03.613441 executorch:stats.h:167] Prompt evaluation: 0.197000 (seconds) Rate: 1380.710660 (tokens/second)
I 00:00:03.613451 executorch:stats.h:178] Generated 118 tokens: 1.573000 (seconds) Rate: 75.015893 (tokens/second)
I 00:00:03.613462 executorch:stats.h:186] Time to first generated token: 0.197000 (seconds)
I 00:00:03.613476 executorch:stats.h:193] Sampling time over 390 tokens: 0.097000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
PyTorchObserver {"prompt_tokens":272,"generated_tokens":118,"model_load_start_ms":1749836595295,"model_load_end_ms":1749836596056,"inference_start_ms":1749836596056,"inference_end_ms":1749836597826,"prompt_eval_end_ms":1749836596253,"first_token_ms":1749836596253,"aggregate_sampling_time_ms":97,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
/data/local/tmp/yuyazhua/executorch/single_llama/outputs/: 2 files pulled. 1.6 MB/s (3960 bytes in 0.002s)
INFO:root:Results[0]:
<|im_start|>user:
<img><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT><IMG_CONTEXT></img>
Can you describe this image?<|im_end|>assistant
The image shows two cats lying on a pink surface, which appears to be a bed or a couch. The cat on the left is a tabby with dark stripes and is lying on its side with its front paws stretched out. The cat on the right is also tabby but with a mix of brown and black stripes. Both cats are relaxed and appear to be sleeping or lying down comfortably. There are two remote controls placed on the pink surface near the cats. The overall scene suggests a cozy and peaceful setting, ideal for a cat to rest.<|im_end|>cc: @haowhsu-quic Please have a look! |
|
@DannyYuyang-quic Thank you for this PR! I just tested it on SOC SM8550 . It failed at a very late stage. Here is the 1st command output. Full log is also attached. The 2nd command also failed due to the same error. I noticed that the adb seems disconnected and reconnected as seen in the Ubuntu Taskbar. Here is an error from the 3rd command, |
It looks like the issue is related to the device connection failing. Could you share the exact command you used? If your device is connected to a remote host, make sure to include the Based on your log, the compilation completed successfully, you can use the python backends/qualcomm/tests/test_qnn_delegate.py TestExampleMultimodalityScript.test_static_vlm -v -b build-android -
H ${host_name} -m SM8550 -s ${SERIAL_NUM} -a . --model_name smolvlm_500m_instruct --executorch_root . --pre_gen_pte . |
|
@DannyYuyang-quic Thank you for your reply! I was using the following command. The Android phone is connected to the Linux PC via Type-C. |
Hi @luffy-yu, git submodule sync
git submodule update --init
./install_executorch.sh
./backends/qualcomm/scripts/build.shI noticed you’re using a very old PyTorch version: 2.8.0+cu128. After rebuilding, could you also run a simple model test to confirm everything works? python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_linear -b build-android -m SM8550 -s ${SERIAL_NUM} -a . --executorch_root . |
Summary:
annotate_prefill_kv_outputeffectively narrows the output gapbetween
hybridmode andKVmode. However, applying the same method to multimodalmodels do not work(bad results). To achieve decent result in hybrid mode, we dequantize
the KV cache right after prefilling and re‑quantize it based on the decoder input cache at
runtime.
Test plan
SmolVLM
Perf: ~63 TPS in SM8750
InternVL3
Perf: ~17 TPS in SM8750
Script
SmolVLM
InternVL3