Skip to content

FilipFan/PolyEngineInfer

Repository files navigation

Poly Engine Inference 🪄

An experimental Android application that integrates multiple on-device inference engines, allowing you to run inference with different engines within a single app.

Features

  • Multi-Engine Support: Load and chat with models supported by different inference engines.
  • Adjustable Parameters: Adjust inference parameters such as top-k, top-p and temperature.
  • Performance Metrics: View detailed inference data (time to first token, prefill speed, decode speed).

Build Instructions

First, clone the repository and its submodules:

git clone https://github.com/FilipFan/PolyEngineInfer.git
cd PolyEngineInfer
git submodule update --init --recursive

Next, build the project using Gradle:

./gradlew clean build

Installation

Install the application on a real device or an emulator using ADB. The app currently supports arm64-v8a and x86_64 architectures.

adb install app-release.apk

How to Use the App

The application loads models from the app-specific directory in external storage. Before selecting a model, you need to first push the model files to this directory (typically /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files).

The app automatically selects the appropriate inference engine based on the model file's extension and directory structure. You can find pre-converted models for various engines for popular open-source LLMs. For example, using Llama-3.2-1B, you can download the following versions:

Download the model and push it to your device using adb push. For example:

adb push llama-3.2-1b-instruct-q8_0.gguf /storage/emulated/0/Android/data/dev.filipfan.polyengineinfer/files

Note

For ExecuTorch: You must push both the model (.pte) and tokenizer (.model) files.

Note

For ONNX: You need to push the entire directory containing the model and configuration files. When selecting the model in the app, choose this directory as the model path.

To use the ONNX Runtime generate() API, you may need to further process the downloaded model files to create genai_config.json, tokenizer.json, etc. Refer to the ONNX Runtime GenAI Model Builder for details.

For instance, after downloading onnx-community/Llama-3.2-1B-Instruct, run the following command to generate the necessary files to be pushed to the device:

python3 -m onnxruntime_genai.models.builder \
  --input Llama-3.2-1B-Instruct \
  -o onnx_test_dir \
  -p int4 \
  -e cpu \
  -c onnx_test_dir/cache \
  --extra_options config_only=true

Once the files are on your device, you can select the model from the app's settings page and start chatting.

Dependencies

This project utilizes the following inference engines and versions:

Hardware Acceleration

GPU

ExecuTorch Vulkan Backend

ExecuTorch provides GPU acceleration through its Vulkan Backend.

As detailed in How ExecuTorch Works, achieving hardware acceleration requires an offline compilation step. This process targets a specific hardware backend (like Vulkan). The output is a specialized .pte model file compiled explicitly for that backend.

The ExportRecipe_Llama-3.2-1B_Vulkan_Backend_Instruct.ipynb notebook provides a practical example. It shows the commands needed to convert the Llama-3.2-1B model into a .pte file tailored for the Vulkan backend.

Current Limitations

  • Stateless Conversations: The context of multi-turn conversations is not preserved; each interaction is a new session.
  • Text-Only: The app does not handle multimodal inputs.

References

This project was developed with reference to the official documentation and open-source examples from the following sources:

About

Run AI inference in an Android app with llama.cpp, ExecuTorch, LiteRT, ONNX, and more.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published