Skip to content

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4) for efficient on-device inference on iOS 18+.

Notifications You must be signed in to change notification settings

ambv231/tinyllama-coreml-ios18-quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

TinyLlama CoreML iOS 18 Quantization 🦙📱

Welcome to the TinyLlama CoreML iOS 18 Quantization repository! This project focuses on converting the TinyLlama-1.1B-Chat model from PyTorch to CoreML formats such as float16, int8, and int4. This conversion allows for efficient on-device inference on iOS 18 and later.

You can find the latest releases here. Download the necessary files and execute them to get started.

Table of Contents

Overview

TinyLlama is a state-of-the-art language model designed for mobile applications. By quantizing this model, we make it lightweight and efficient for use on iOS devices. This repository provides the tools necessary to convert and optimize the TinyLlama model, ensuring it runs smoothly on Apple Silicon.

TinyLlama

Features

  • Efficient Quantization: Convert models to float16, int8, and int4 formats.
  • On-Device Inference: Optimized for iOS 18 and later.
  • Easy Integration: Simple setup for developers.
  • Hugging Face Compatibility: Leverage the power of Hugging Face transformers.

Installation

To install the necessary tools and libraries, follow these steps:

  1. Clone the repository:

    git clone https://github.com/ambv231/tinyllama-coreml-ios18-quantization.git
    cd tinyllama-coreml-ios18-quantization
  2. Install dependencies using pip:

    pip install -r requirements.txt
  3. Ensure you have the latest version of Xcode installed on your machine.

  4. Download the latest model files from the Releases section.

Usage

After installation, you can begin using the TinyLlama model in your iOS applications. Here's a simple example of how to load and use the model:

import CoreML

guard let model = try? TinyLlama(configuration: MLModelConfiguration()) else {
    fatalError("Could not load model")
}

// Perform inference
let input = TinyLlamaInput(text: "Hello, world!")
let output = try? model.prediction(input: input)
print(output?.response ?? "No response")

Model Details

TinyLlama-1.1B-Chat

  • Parameters: 1.1 billion
  • Architecture: Transformer-based
  • Training Data: Diverse datasets for improved language understanding

Supported Formats

  • float16: A half-precision floating-point format that reduces memory usage.
  • int8: An 8-bit integer format for faster computations.
  • int4: A 4-bit integer format for even smaller model sizes.

Quantization Techniques

Quantization is the process of mapping a large set of values to a smaller set. In the context of machine learning, it helps in reducing the model size and improving inference speed without significantly sacrificing accuracy.

Techniques Used

  1. Post-Training Quantization: This technique applies quantization after the model has been trained. It allows for efficient conversion with minimal loss in performance.

  2. Dynamic Quantization: This approach quantizes weights on-the-fly during inference, allowing for flexibility and speed.

  3. Quantization-Aware Training: This method involves training the model with quantization in mind, helping it adapt to the reduced precision.

Contributing

We welcome contributions to improve this project. If you want to help, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and commit them.
  4. Push your changes to your fork.
  5. Submit a pull request.

Please ensure your code adheres to the project's coding standards and includes relevant tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or support, please open an issue on GitHub or contact the repository owner.

You can also find the latest releases here. Download the files you need and start working with TinyLlama today!


This README provides an overview of the TinyLlama CoreML iOS 18 Quantization project. For further details and updates, please check the repository frequently.

About

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4) for efficient on-device inference on iOS 18+.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages