Voice-to-Text AI Agent

An autonomous agent for converting M4A voice recordings to text with flexible processing options. Supports both local transcription without LLM dependency and enhanced transcription with LLM processing.

Features

Convert M4A audio files to text transcriptions
Two transcription modes:
- Local mode: Uses Whisper locally without requiring an API key
- LLM mode: Uses OpenAI's API for transcription
Optional LLM post-processing for enhanced formatting and summarization
Preprocess audio for improved transcription quality
Autonomous workflow with error handling
Simple command-line interface
Optional web interface (with Streamlit)

Installation

Clone the repository:

git clone https://github.com/yourusername/voice-to-text-ai.git
cd voice-to-text-ai

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up API keys (optional for local transcription mode):

# Create a .env file with your API keys
echo "OPENAI_API_KEY=your_openai_api_key" > .env

Note: API keys are only required if you use the LLM transcription mode or LLM post-processing. Local transcription mode works without any API keys.

Local Transcription Dependencies

To use the local transcription mode, you need to install the following dependencies:

pip install openai-whisper torch numpy<2.0.0 numba

Alternatively, you can use the provided installation script:

python install_local_deps.py

This script will check for missing dependencies and install them for you.

Important Notes:

Local transcription requires NumPy < 2.0.0 due to compatibility issues with Whisper
For GPU acceleration, ensure you have a CUDA-compatible GPU and the appropriate CUDA toolkit installed
The first time you use local transcription with a specific model size, it will download the model which may take some time depending on your internet connection

Usage

Command Line Interface

Convert an M4A file to text using local transcription (default):

python -m src.main /path/to/audio.m4a

Use OpenAI API for transcription:

python -m src.main /path/to/audio.m4a --mode=llm

Enable LLM post-processing for enhanced formatting:

python -m src.main /path/to/audio.m4a --use-llm --enhance-formatting --fix-punctuation --summarize

Specify Whisper model size for local transcription:

python -m src.main /path/to/audio.m4a --model-size=small

Use GPU acceleration for local transcription (if available):

python -m src.main /path/to/audio.m4a --device=cuda

Python API

from src.agent.agent import VoiceToTextAgent, AgentOptions
from src.transcription.models import TranscriptionOptions
from src.llm.processor import ProcessingOptions

# Initialize with local transcription (no LLM)
agent_options = AgentOptions(
    transcription_mode="local",  # Use "llm" for OpenAI API
    skip_llm_processing=True,  # Skip LLM post-processing
    transcription_options=TranscriptionOptions(
        model_size="base",
        device="cpu"  # Use "cuda" for GPU acceleration
    )
)

# Initialize the agent
agent = VoiceToTextAgent(options=agent_options)

# Process a file
result = agent.process_file("/path/to/audio.m4a")

# Access the result
print(result.text)

# Example with LLM post-processing
agent_with_llm = VoiceToTextAgent(
    options=AgentOptions(
        transcription_mode="local",
        skip_llm_processing=False,
        transcription_options=TranscriptionOptions(model_size="base"),
        processing_options=ProcessingOptions(
            enhance_formatting=True,
            fix_punctuation=True,
            summarize=True
        )
    )
)

result_with_llm = agent_with_llm.process_file("/path/to/audio.m4a")
print(result_with_llm.text)
print(result_with_llm.summary)

Web Interface (Optional)

Start the Streamlit web interface:

streamlit run src/web/app.py

Then open your browser at http://localhost:8501

Dependencies

Python 3.9+
pydub/ffmpeg for audio conversion
OpenAI Whisper for speech recognition (runs locally or via API)
PyTorch for running Whisper locally
OpenAI API for LLM processing (optional)
Pydantic for data validation
Streamlit (optional) for web UI

Project Structure

voice_to_text_ai/
├── src/
│   ├── audio/                # Audio processing modules
│   ├── transcription/        # Transcription modules
│   ├── llm/                  # LLM integration
│   ├── agent/                # Agent logic
│   └── utils/                # Utility functions
├── tests/                    # Test directory
└── examples/                 # Example usage and sample files

Development

Run tests:

pytest

Run linting:

flake8 src tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
tests		tests
.gitignore		.gitignore
PLANNING.md		PLANNING.md
README.md		README.md
TASK.md		TASK.md
env.sample		env.sample
install_local_deps.py		install_local_deps.py
planning_nollm.md		planning_nollm.md
requirements.txt		requirements.txt
test_langfuse_connection.py		test_langfuse_connection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice-to-Text AI Agent

Features

Installation

Local Transcription Dependencies

Usage

Command Line Interface

Python API

Web Interface (Optional)

Dependencies

Project Structure

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

safio/voice_to_text

Folders and files

Latest commit

History

Repository files navigation

Voice-to-Text AI Agent

Features

Installation

Local Transcription Dependencies

Usage

Command Line Interface

Python API

Web Interface (Optional)

Dependencies

Project Structure

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages