Skip to content

Filocava99/LLaVA-computer-use-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLaVA Computer Use Agent

A comprehensive fine-tuning pipeline for creating vision-language models specialized in UI automation and computer use tasks. This project fine-tunes LLaVA-1.5 on multiple UI datasets to create an agent capable of understanding and interacting with user interfaces.

🎯 Overview

This repository provides a complete training pipeline that combines multiple UI automation datasets to create a powerful computer vision agent. The model learns from mobile app interfaces, web pages, diagrams, and document layouts to understand UI elements and support automation tasks.

πŸ“Š Datasets Used

The training pipeline automatically downloads and processes the following datasets:

Core UI Datasets (~36k samples)

  • SoM-LLaVA (20,160 samples): UI screenshots with Set-of-Marks annotations
  • RICO-Screen2Words (15,743 samples): Mobile app screenshots with natural language descriptions

Additional UI Datasets (~45k samples)

  • GUI-World (10,000 samples): General GUI understanding
  • WebSight (15,000 samples): Web page screenshots
  • Mind2Web (8,000 samples): Web interaction tasks
  • AI2D (5,000 samples): Diagram understanding for dashboards
  • RVLCDIP (7,000 samples): Document layouts for form understanding

Total: ~80,000+ training samples

πŸš€ Quick Start

Prerequisites

  • Python 3.10
  • CUDA-capable GPU (RTX 3090/4080+ recommended)
  • 32GB+ RAM
  • 500GB+ storage for datasets and models

Installation

  1. Clone the repository:
git clone git@github.com:Filocava99/LLaVA-computer-use-agent.git
cd LLaVA-computer-use-agent
  1. Create conda environment:
conda create -n llava-computer-use python=3.10 -y
conda activate llava-computer-use
  1. Install dependencies:
pip install -r requirements.txt
  1. Login to HuggingFace (required for datasets):
huggingface-cli login

Training

Full Training (Recommended)

python train.py \
  --output_dir ./llava-ui-finetuned \
  --num_train_epochs 2 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --learning_rate 1e-5 \
  --warmup_steps 1000 \
  --logging_steps 50 \
  --save_steps 1000 \
  --save_total_limit 3 \
  --dataloader_num_workers 8 \
  --bf16 \
  --remove_unused_columns False \
  --report_to tensorboard \
  --logging_dir ./logs

Quick Test (100 steps)

python train.py \
  --output_dir ./llava-ui-test \
  --max_steps 100 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --learning_rate 1e-5 \
  --bf16

Monitor Training

tensorboard --logdir ./logs --port 6006
# Open http://localhost:6006 in your browser

πŸ—οΈ Architecture

Base Model

  • LLaVA-1.5-7B: State-of-the-art vision-language model
  • Vision Encoder: CLIP ViT-L/14
  • Language Model: Vicuna-7B

Fine-tuning Strategy

  • LoRA: Parameter-efficient fine-tuning targeting vision and language projection layers
  • Mixed Precision: BF16 for faster training on modern GPUs
  • Gradient Accumulation: Effective batch size of 16 across multiple steps

Dataset Processing

  • Multimodal Conversations: Image + text conversation format
  • Dataset Mixing: Configurable ratios for different dataset types
  • Robust Processing: Handles various dataset formats and missing data

πŸ“ Project Structure

LLaVA-computer-use-agent/
β”œβ”€β”€ train.py              # Main training script
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ README.md            # This file
β”œβ”€β”€ logs/                # TensorBoard logs (created during training)
β”œβ”€β”€ llava-ui-finetuned/  # Output directory for trained model
└── dataset_cache/       # Cached datasets (created automatically)

πŸ”§ Configuration

Hardware Requirements

Hardware Minimum Recommended
GPU RTX 3090 (24GB) RTX 4080/4090 (16GB+)
RAM 32GB 64GB+
Storage 500GB 1TB+
CUDA 11.8+ 12.1+

Training Parameters

Parameter Default Description
num_train_epochs 2 Number of training epochs
per_device_train_batch_size 2 Batch size per GPU
gradient_accumulation_steps 8 Steps to accumulate gradients
learning_rate 1e-5 Learning rate for optimizer
warmup_steps 1000 Warmup steps for learning rate

πŸ“ˆ Expected Results

Training Metrics

  • Training Time: 6-10 hours on RTX 4080
  • Peak Memory: ~14GB VRAM
  • Final Loss: <1.0 (typically 0.5-0.8)

Model Capabilities

  • UI element detection and description
  • Screen content understanding
  • Form and document layout analysis
  • Web page structure recognition
  • Mobile app interface comprehension

πŸ§ͺ Usage After Training

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

# Load fine-tuned model
model = LlavaForConditionalGeneration.from_pretrained("./llava-ui-finetuned")
processor = AutoProcessor.from_pretrained("./llava-ui-finetuned")

# Process image and prompt
image = Image.open("screenshot.png")
prompt = "Describe this user interface and identify interactive elements."

inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)

print(response)

🀝 Contributing

Contributions are welcome! Please feel free to:

  • Report bugs and issues
  • Suggest new datasets or improvements
  • Submit pull requests
  • Share training results and insights

πŸ“ License

This project is licensed under the MIT License. See individual dataset licenses for their respective terms.

πŸ™ Acknowledgments

  • LLaVA Team: For the base vision-language model
  • Dataset Contributors: For providing high-quality UI automation datasets
  • HuggingFace: For the transformers library and dataset hosting
  • Community: For feedback and contributions

πŸ“š Citation

If you use this work in your research, please cite:

@misc{llava-computer-use-agent,
  title={LLaVA Computer Use Agent: Fine-tuning Vision-Language Models for UI Automation},
  author={Your Name},
  year={2024},
  url={https://github.com/Filocava99/LLaVA-computer-use-agent}
}

πŸ”— Related Work


For questions and support, please open an issue on GitHub or contact the maintainers.

About

A carefully finetuned LLaVA model to perform interactions on UIs and web forms like OpenAI's CUA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages