Poly Engine Inference 🪄

An experimental Android application that integrates multiple on-device inference engines, allowing you to run inference with different engines within a single app.

Features

Multi-Engine Support: Load and chat with models supported by different inference engines.
Adjustable Parameters: Adjust inference parameters such as top-k, top-p and temperature.
Performance Metrics: View detailed inference data (time to first token, prefill speed, decode speed).

Build Instructions

First, clone the repository and its submodules:

git clone https://github.com/FilipFan/PolyEngineInfer.git
cd PolyEngineInfer
git submodule update --init --recursive

Next, build the project using Gradle:

./gradlew clean build

Installation

Install the application on a real device or an emulator using ADB. The app currently supports arm64-v8a and x86_64 architectures.

adb install app-release.apk

How to Use the App

The application loads models from the app-specific directory in external storage. Before selecting a model, you need to first push the model files to this directory (typically /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files).

The app automatically selects the appropriate inference engine based on the model file's extension and directory structure. You can find pre-converted models for various engines for popular open-source LLMs. For example, using Llama-3.2-1B, you can download the following versions:

llama.cpp: hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF
ONNX: onnx-community/Llama-3.2-1B-Instruct
ExecuTorch: executorch-community/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8-ET
LiteRT: litert-community/Llama-3.2-1B-Instruct

Download the model and push it to your device using adb push. For example:

adb push llama-3.2-1b-instruct-q8_0.gguf /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files

Note

For ExecuTorch: You must push both the model (.pte) and tokenizer (.model) files.

Note

For ONNX: You need to push the entire directory containing the model and configuration files. When selecting the model in the app, choose this directory as the model path.

To use the ONNX Runtime generate() API, you may need to further process the downloaded model files to create genai_config.json, tokenizer.json, etc. Refer to the ONNX Runtime GenAI Model Builder for details.

For instance, after downloading onnx-community/Llama-3.2-1B-Instruct, run the following command to generate the necessary files to be pushed to the device:

python3 -m onnxruntime_genai.models.builder \
  --input Llama-3.2-1B-Instruct \
  -o onnx_test_dir \
  -p int4 \
  -e cpu \
  -c onnx_test_dir/cache \
  --extra_options config_only=true

Once the files are on your device, you can select the model from the app's settings page and start chatting.

Dependencies

This project utilizes the following inference engines and versions:

llama.cpp: b6018
ONNX:
- onnxruntime-genai: v0.10.0
- onnxruntime: v1.23.2
ExecuTorch: v1.0.0
MediaPipe (LiteRT): v0.10.26

Hardware Acceleration

GPU

ExecuTorch Vulkan Backend

ExecuTorch provides GPU acceleration through its Vulkan Backend.

As detailed in How ExecuTorch Works, achieving hardware acceleration requires an offline compilation step. This process targets a specific hardware backend (like Vulkan). The output is a specialized .pte model file compiled explicitly for that backend.

The ExportRecipe_Llama-3.2-1B_Vulkan_Backend_Instruct.ipynb notebook provides a practical example. It shows the commands needed to convert the Llama-3.2-1B model into a .pte file tailored for the Vulkan backend.

Current Limitations

Stateless Conversations: The context of multi-turn conversations is not preserved; each interaction is a new session.
Text-Only: The app does not handle multimodal inputs.

References

This project was developed with reference to the official documentation and open-source examples from the following sources:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
api		api
app		app
chattemplate		chattemplate
docs/notebooks		docs/notebooks
executorch		executorch
external		external
gradle		gradle
keystore		keystore
licenses		licenses
llamacpp		llamacpp
mediapipe		mediapipe
onnx		onnx
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts
spotless.gradle		spotless.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Poly Engine Inference 🪄

Features

Build Instructions

Installation

How to Use the App

Dependencies

Hardware Acceleration

GPU

ExecuTorch Vulkan Backend

Current Limitations

References

About

Uh oh!

Releases

Packages

Languages

License

FilipFan/PolyEngineInfer

Folders and files

Latest commit

History

Repository files navigation

Poly Engine Inference 🪄

Features

Build Instructions

Installation

How to Use the App

Dependencies

Hardware Acceleration

GPU

ExecuTorch Vulkan Backend

Current Limitations

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages