A comprehensive fine-tuning pipeline for creating vision-language models specialized in UI automation and computer use tasks. This project fine-tunes LLaVA-1.5 on multiple UI datasets to create an agent capable of understanding and interacting with user interfaces.
This repository provides a complete training pipeline that combines multiple UI automation datasets to create a powerful computer vision agent. The model learns from mobile app interfaces, web pages, diagrams, and document layouts to understand UI elements and support automation tasks.
The training pipeline automatically downloads and processes the following datasets:
- SoM-LLaVA (20,160 samples): UI screenshots with Set-of-Marks annotations
- RICO-Screen2Words (15,743 samples): Mobile app screenshots with natural language descriptions
- GUI-World (10,000 samples): General GUI understanding
- WebSight (15,000 samples): Web page screenshots
- Mind2Web (8,000 samples): Web interaction tasks
- AI2D (5,000 samples): Diagram understanding for dashboards
- RVLCDIP (7,000 samples): Document layouts for form understanding
Total: ~80,000+ training samples
- Python 3.10
- CUDA-capable GPU (RTX 3090/4080+ recommended)
- 32GB+ RAM
- 500GB+ storage for datasets and models
- Clone the repository:
git clone git@github.com:Filocava99/LLaVA-computer-use-agent.git
cd LLaVA-computer-use-agent
- Create conda environment:
conda create -n llava-computer-use python=3.10 -y
conda activate llava-computer-use
- Install dependencies:
pip install -r requirements.txt
- Login to HuggingFace (required for datasets):
huggingface-cli login
python train.py \
--output_dir ./llava-ui-finetuned \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-5 \
--warmup_steps 1000 \
--logging_steps 50 \
--save_steps 1000 \
--save_total_limit 3 \
--dataloader_num_workers 8 \
--bf16 \
--remove_unused_columns False \
--report_to tensorboard \
--logging_dir ./logs
python train.py \
--output_dir ./llava-ui-test \
--max_steps 100 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--bf16
tensorboard --logdir ./logs --port 6006
# Open http://localhost:6006 in your browser
- LLaVA-1.5-7B: State-of-the-art vision-language model
- Vision Encoder: CLIP ViT-L/14
- Language Model: Vicuna-7B
- LoRA: Parameter-efficient fine-tuning targeting vision and language projection layers
- Mixed Precision: BF16 for faster training on modern GPUs
- Gradient Accumulation: Effective batch size of 16 across multiple steps
- Multimodal Conversations: Image + text conversation format
- Dataset Mixing: Configurable ratios for different dataset types
- Robust Processing: Handles various dataset formats and missing data
LLaVA-computer-use-agent/
βββ train.py # Main training script
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ logs/ # TensorBoard logs (created during training)
βββ llava-ui-finetuned/ # Output directory for trained model
βββ dataset_cache/ # Cached datasets (created automatically)
Hardware | Minimum | Recommended |
---|---|---|
GPU | RTX 3090 (24GB) | RTX 4080/4090 (16GB+) |
RAM | 32GB | 64GB+ |
Storage | 500GB | 1TB+ |
CUDA | 11.8+ | 12.1+ |
Parameter | Default | Description |
---|---|---|
num_train_epochs |
2 | Number of training epochs |
per_device_train_batch_size |
2 | Batch size per GPU |
gradient_accumulation_steps |
8 | Steps to accumulate gradients |
learning_rate |
1e-5 | Learning rate for optimizer |
warmup_steps |
1000 | Warmup steps for learning rate |
- Training Time: 6-10 hours on RTX 4080
- Peak Memory: ~14GB VRAM
- Final Loss: <1.0 (typically 0.5-0.8)
- UI element detection and description
- Screen content understanding
- Form and document layout analysis
- Web page structure recognition
- Mobile app interface comprehension
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
# Load fine-tuned model
model = LlavaForConditionalGeneration.from_pretrained("./llava-ui-finetuned")
processor = AutoProcessor.from_pretrained("./llava-ui-finetuned")
# Process image and prompt
image = Image.open("screenshot.png")
prompt = "Describe this user interface and identify interactive elements."
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Contributions are welcome! Please feel free to:
- Report bugs and issues
- Suggest new datasets or improvements
- Submit pull requests
- Share training results and insights
This project is licensed under the MIT License. See individual dataset licenses for their respective terms.
- LLaVA Team: For the base vision-language model
- Dataset Contributors: For providing high-quality UI automation datasets
- HuggingFace: For the transformers library and dataset hosting
- Community: For feedback and contributions
If you use this work in your research, please cite:
@misc{llava-computer-use-agent,
title={LLaVA Computer Use Agent: Fine-tuning Vision-Language Models for UI Automation},
author={Your Name},
year={2024},
url={https://github.com/Filocava99/LLaVA-computer-use-agent}
}
- LLaVA: Large Language and Vision Assistant
- Set-of-Mark Prompting for Vision-Language Models
- Mind2Web: Towards a Generalist Agent for the Web
For questions and support, please open an issue on GitHub or contact the maintainers.