An experimental Android application that integrates multiple on-device inference engines, allowing you to run inference with different engines within a single app.
- Multi-Engine Support: Load and chat with models supported by different inference engines.
- Adjustable Parameters: Adjust inference parameters such as top-k, top-p and temperature.
- Performance Metrics: View detailed inference data (time to first token, prefill speed, decode speed).
First, clone the repository and its submodules:
git clone https://github.com/FilipFan/PolyEngineInfer.git
cd PolyEngineInfer
git submodule update --init --recursive
Next, build the project using Gradle:
./gradlew clean build
Install the application on a real device or an emulator using ADB. The app currently supports arm64-v8a and x86_64 architectures.
adb install app-release.apk
The application loads models from the app-specific directory in external storage. Before selecting a model, you need to first push the model files to this directory (typically /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files).
The app automatically selects the appropriate inference engine based on the model file's extension and directory structure. You can find pre-converted models for various engines for popular open-source LLMs. For example, using Llama-3.2-1B, you can download the following versions:
- llama.cpp: hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF
- ONNX: onnx-community/Llama-3.2-1B-Instruct
- ExecuTorch: executorch-community/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8-ET
- LiteRT: litert-community/Llama-3.2-1B-Instruct
Download the model and push it to your device using adb push. For example:
adb push llama-3.2-1b-instruct-q8_0.gguf /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files
Note
For ExecuTorch: You must push both the model (.pte) and tokenizer (.model) files.
Note
For ONNX: You need to push the entire directory containing the model and configuration files. When selecting the model in the app, choose this directory as the model path.
To use the ONNX Runtime generate() API, you may need to further process the downloaded model files to create genai_config.json, tokenizer.json, etc. Refer to the ONNX Runtime GenAI Model Builder for details.
For instance, after downloading onnx-community/Llama-3.2-1B-Instruct, run the following command to generate the necessary files to be pushed to the device:
python3 -m onnxruntime_genai.models.builder \
--input Llama-3.2-1B-Instruct \
-o onnx_test_dir \
-p int4 \
-e cpu \
-c onnx_test_dir/cache \
--extra_options config_only=true
Once the files are on your device, you can select the model from the app's settings page and start chatting.
This project utilizes the following inference engines and versions:
ExecuTorch provides GPU acceleration through its Vulkan Backend.
As detailed in How ExecuTorch Works, achieving hardware acceleration requires an offline compilation step. This process targets a specific hardware backend (like Vulkan). The output is a specialized .pte model file compiled explicitly for that backend.
The ExportRecipe_Llama-3.2-1B_Vulkan_Backend_Instruct.ipynb notebook provides a practical example. It shows the commands needed to convert the Llama-3.2-1B model into a .pte file tailored for the Vulkan backend.
- Stateless Conversations: The context of multi-turn conversations is not preserved; each interaction is a new session.
- Text-Only: The app does not handle multimodal inputs.
This project was developed with reference to the official documentation and open-source examples from the following sources: