Skip to content

Voice & Vision Assistant for the Blind is an AI-powered assistant that helps blind and low-vision users navigate the world more independently. It uses real-time vision, speech recognition, and natural language understanding to describe surroundings, identify people or objects, and answer spoken questions instantly.

Notifications You must be signed in to change notification settings

codingaslu/Voice-Vision-Assistant-for-Blind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

โœจ Voice & Vision Assistant for Blind โœจ

An Advanced Voice & Vision Assistant for Blind and Visually Impaired Users

Bridging the visual gap through AI-powered assistance


๐Ÿ“‹ Overview

Voice & Vision Assistant for Blind combines cutting-edge speech recognition, natural language processing, and computer vision to create an intuitive assistant specifically designed for blind and visually impaired users. This thoughtfully crafted solution helps users better understand their surroundings and interact with the world more confidently and independently.


๐Ÿ—๏ธ System Architecture

The system utilizes an elegant multi-component architecture to process user inputs and generate helpful responses:

graph TD
    %% Main flow - simplified
    User([User]) --> |"Voice Input"| Router["Function Router"]
    Router --> QueryType{"Query Type"}
    
    %% Simplified branches
    QueryType -->|"Visual"| VisualProcess["Visual Analysis"]
    QueryType -->|"Search"| SearchProcess["Internet Search"]
    QueryType -->|"Places"| PlacesProcess["Places Search"]
    QueryType -->|"Calendar"| CalendarProcess["Calendar Management"]
    QueryType -->|"Contact"| ContactProcess["Contact Management"]
    QueryType -->|"Email"| EmailProcess["Email Management"]
    QueryType -->|"General"| TextProcess["Direct Text Response"]
    
    %% Simplified visual path
    VisualProcess --> ModelChoice{"Model Selection"}
    ModelChoice -->|"GPT-4o"| GPTAnalysis["GPT Analysis Stream"]
    ModelChoice -->|"LLAMA"| LLAMAAnalysis["LLAMA Analysis"]
    
    %% Places path
    PlacesProcess --> GooglePlaces["Google Places API"]
    GooglePlaces --> PlacesResults["Location Details"]
    
    %% Calendar path
    CalendarProcess --> GoogleCalendar["Google Calendar API"]
    GoogleCalendar --> CalendarResults["Event Management"]
    
    %% Contact path
    ContactProcess --> GoogleContacts["Google Contacts API"]
    GoogleContacts --> ContactResults["Contact Information"]
    
    %% Email path
    EmailProcess --> Gmail["Gmail API"]
    Gmail --> EmailResults["Email Management"]
    
    %% Output consolidation - simplified
    GPTAnalysis --> Response["TTS Processing"]
    LLAMAAnalysis --> Response
    SearchProcess --> Response
    PlacesResults --> Response
    CalendarResults --> Response
    ContactResults --> Response
    EmailResults --> Response
    TextProcess --> Response
    Response --> Deliver["Voice Response to User"]
    
    %% Styling
    classDef interface fill:#e6f7ff,stroke:#1890ff,stroke-width:2px
    classDef process fill:#f6ffed,stroke:#52c41a,stroke-width:1px
    classDef decision fill:#fff7e6,stroke:#fa8c16,stroke-width:1px
    classDef output fill:#f9f0ff,stroke:#722ed1,stroke-width:1px
    classDef api fill:#fff1f0,stroke:#f5222d,stroke-width:1px
    
    class User,Deliver interface
    class Router,VisualProcess,GPTAnalysis,LLAMAAnalysis,SearchProcess,TextProcess,PlacesProcess,CalendarProcess,ContactProcess,EmailProcess process
    class QueryType,ModelChoice decision
    class Response,PlacesResults,CalendarResults,ContactResults,EmailResults output
    class GooglePlaces,GoogleCalendar,GoogleContacts,Gmail api
Loading

โœจ Key Features

๐Ÿ”„ Dual Vision Model Approach

  • People Detection First: Llama-4-Scout-17B checks for presence of people
  • Conditional Processing: GPT-4o for scenes without people, Llama for scenes with people
  • Privacy-Aware: Thoughtful descriptions while respecting privacy

๐ŸŒŠ Real-time Response Streaming

  • Progressive output for improved user experience
  • Natural conversational flow with minimal latency
  • Immediate feedback during interaction

๐Ÿ‘๏ธ Blind/Low-Vision Optimizations

  • Detailed Descriptions: Prioritizes key elements for visually impaired users
  • Voice-First Design: Intuitive speech interface reduces barriers
  • Concise Analysis: Thorough yet efficient scene descriptions

๐ŸŽญ Virtual Avatar Integration

  • Visual Representation: Optional virtual avatar for video calls
  • Enhanced Communication: Helps blind users interact with sighted individuals
  • Professional Presence: Maintains visual engagement in meetings and calls

๐Ÿงฉ Comprehensive Capabilities

  • Voice Interaction: Natural conversation using speech
  • Visual Understanding: Camera-based vision to describe surroundings
  • Internet Search: Real-time information lookup
  • Location Search: Find nearby businesses, restaurants, and points of interest
  • Calendar Management: Add and view calendar events
  • Contact Management: Find contact information from your Google Contacts
  • Email Management: Read emails and send messages
  • Seamless Integration: Coordinated operation between components

๐Ÿ”ง Technical Implementation

Model Selection

We carefully selected meta-llama/llama-4-scout-17b-16e-instruct as our primary people detection model based on:

Criteria Performance
Response Time TTFT < 150ms (well below 500ms requirement)
Batch Processing Handles 10+ consecutive image queries without degradation
Streaming Provides token-by-token streaming for responsive UX
People Recognition Reliably identifies presence of people in images
Image Limits 4MB (base64), 20MB (URL), multiple images supported
Success Rate >95% in testing

Groq API Integration

The Groq API powers our Llama model implementation when people are detected in scenes:

  • โšก Fast Processing: Sub-500ms TTFT meets accessibility requirements
  • ๐Ÿง  Advanced Models: Leverages state-of-the-art Llama 4 Scout capabilities
  • ๐Ÿ”Œ Simple Integration: Clean API with official Python client library

๐Ÿ—‚๏ธ Project Structure

Ally/
โ”œโ”€โ”€ app.py                  # Main entry point
โ”œโ”€โ”€ requirements.txt        # Dependencies
โ”œโ”€โ”€ .env                    # Environment variables
โ”œโ”€โ”€ images/                 # Images and diagrams
โ””โ”€โ”€ src/
    โ”œโ”€โ”€ main.py             # Entry point and agent implementation
    โ”œโ”€โ”€ config.py           # Configuration handling
    โ”œโ”€โ”€ utils.py            # Utility functions for Google API integration
    โ””โ”€โ”€ tools/
        โ”œโ”€โ”€ visual.py       # Visual processing (camera, frames, image analysis)
        โ”œโ”€โ”€ groq_handler.py # Groq API integration for enhanced image analysis
        โ”œโ”€โ”€ internet_search.py # Web search functionality
        โ”œโ”€โ”€ google_places.py # Places search using Google Places API
        โ”œโ”€โ”€ calendar.py     # Calendar integration for managing events
        โ””โ”€โ”€ communication.py # Contact and email management

๐Ÿš€ Getting Started

Prerequisites

  • Python 3.9+ - Core programming language
  • LiveKit API - For real-time communication
  • OpenAI API - For GPT-4o capabilities
  • Deepgram API - For speech-to-text functionality
  • ElevenLabs API - For text-to-speech synthesis
  • Groq API - For fallback vision processing
  • Google APIs - For Places, Calendar, Contacts, and Gmail functionality

Installation

1. Clone the repository
git clone https://github.com/codingaslu/Ally-Clone-Assignment.git
cd Ally-Clone-Assignment
2. Set up environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
3. Configure environment variables

Create a .env file with the following:

LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_key
LIVEKIT_API_SECRET=your_livekit_secret
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=your_openai_key
ELEVEN_API_KEY=your_elevenlabs_key

# Vision configuration
VISION_PROVIDER=groq
.
# Groq API configuration
GROQ_API_KEY=your_groq_api_key  # Get your API key from https://console.groq.com/keys
GROQ_MODEL_ID=meta-llama/llama-4-scout-17b-16e-instruct
   
# Google Places API configuration
GPLACES_API_KEY=your_google_places_api_key  # Get your API key from Google Cloud Console https://console.cloud.google.com/google/maps-apis/credentials
GMAIL_MAIL=your_gmail_address
GMAIL_APP_PASSWORD=your_gmail_app_password

# Tavus virtual avatar configuration (optional)
ENABLE_AVATAR=false
TAVUS_API_KEY=your_tavus_api_key
TAVUS_REPLICA_ID=your_replica_id
TAVUS_PERSONA_ID=your_persona_id
TAVUS_AVATAR_NAME=ally-vision-avatar
4. Set up Google API credentials
  1. Create a new project in the Google Cloud Console
  2. Enable the required APIs:
    • Google Places API Web Service
    • Google Calendar API
    • People API (Contacts)
    • Gmail API
  3. Create OAuth 2.0 credentials:
    • Go to "Credentials" and click "Create Credentials" > "OAuth client ID"
    • Select "Desktop app" as the application type
    • Give it a name and click "Create"
    • Download the JSON file
  4. Important: Rename the downloaded file to credentials.json and place it in the project root directory
  5. Create API key for Places API:
    • Go to "Credentials" and click "Create Credentials" > "API Key"
    • Restrict the key to only the Google Places API
    • Copy this key to your .env file as GPLACES_API_KEY
  6. When you first run the application and try to use calendar or email features, it will:
    • Open a browser window for authentication
    • Ask you to sign in to your Google account
    • Request permission to access your calendar, contacts, and email
    • After granting permission, it will create a token.json file for future use
5. Special Setup Instructions for Blind Users

Important: One-Time Authentication Process

For blind users, the Google OAuth authentication process requires sighted assistance only once during initial setup:

  1. Initial Setup (One-Time with Assistance):

    • After installing the application, the first time you use any Google service (calendar, contacts, email), the system will need authentication
    • A browser window will open with Google's authentication page
    • This step requires sighted assistance to complete the login and permission granting
    • The assistant should:
      • Help navigate to the URL provided in the console
      • Log in to the blind user's Google account
      • Grant the requested permissions
      • Confirm when the "Authentication successful" message appears
  2. After Initial Setup (No Assistance Needed):

    • The system creates a token.json file that stores authentication securely
    • This token remains valid indefinitely with regular use
    • No further visual authentication is typically needed
    • Re-authentication is only required if access is explicitly revoked or unused for months
  3. Long-Term Solution (Optional):

    • For completely independent use, a developer can modify the application to use Service Account authentication
    • This alternative method doesn't require browser authentication but needs more technical setup
    • Contact the support email below if you need help implementing this solution
6. Virtual Avatar Setup (Optional)

Tavus Virtual Avatar Integration

  1. Prerequisites:

    • Sign up for a Tavus account
    • Create a replica (virtual avatar)
    • Generate an API key
  2. Installation:

    • The required dependencies are included in the requirements.txt file
    • Make sure you have livekit-agents[tavus] and livekit-plugins-tavus installed
  3. Configuration: Add the following to your .env file:

    # Tavus virtual avatar configuration
    ENABLE_AVATAR=true
    TAVUS_API_KEY=your_tavus_api_key
    TAVUS_REPLICA_ID=your_replica_id
    TAVUS_PERSONA_ID=your_persona_id
    TAVUS_AVATAR_NAME=ally-vision-avatar
    
  4. Persona Setup: You need to create a Tavus persona with specific settings using the Tavus API:

    curl --request POST \
      --url https://tavusapi.com/v2/personas \
      -H "Content-Type: application/json" \
      -H "x-api-key: <your-tavus-api-key>" \
      -d '{
        "layers": {
            "transport": {
                "transport_type": "livekit"
            }
        },
        "persona_name": "Ally Assistant Avatar",
        "pipeline_mode": "echo"
    }'
    • Save the id from the response as your TAVUS_PERSONA_ID
    • Use your replica ID as TAVUS_REPLICA_ID
  5. When to Use:

    • Professional meetings where visual presence helps
    • Family video calls with sighted relatives
    • Educational or work environments
    • Any situation where blind users benefit from having a visual representation

The avatar will automatically handle the audio output when enabled, creating a seamless experience for both the blind user and sighted participants in the conversation.

Running the Application

Step Command Description
1 python app.py download-files Download dependencies
2a python app.py start Start in standard mode
2b python app.py dev Start in development mode
3 Connect via LiveKit playground Begin interaction

๐Ÿ› ๏ธ Implementation Details

Core Components

๐ŸŽ™๏ธ Voice Pipeline

  • STT: Deepgram for real-time speech-to-text conversion
  • TTS: ElevenLabs for natural-sounding text-to-speech

๐Ÿง  Language Processing

  • Primary LLM: OpenAI GPT-4o for conversational intelligence
  • Function Routing: Dynamic selection of appropriate capabilities

๐Ÿ‘๏ธ Vision Processing

  • People Detection: Llama-4-Scout-17B to determine presence of people
  • Scene Analysis: GPT-4o for scenes without people, Llama for scenes with people

๐ŸŒŽ Location Services

  • Places Search: Google Places API for finding businesses and points of interest
  • Relevant Results: Provides address, ratings, opening hours, and other details
  • Accessibility Focus: Tailored information relevant for blind and visually impaired users

๐Ÿงฉ Challenges and Solutions

Privacy-Preserving Vision

Challenge: GPT-4o sometimes refuses to describe people in images due to privacy guardrails.

Solution:

  1. Llama model first checks for presence of people in images
  2. Route to appropriate model based on content (GPT-4o for no people, Llama for people)
  3. Response normalization for consistent user experience

Performance Optimization

Approaches:

  • โšก Efficient API client configuration
  • ๐Ÿ–ผ๏ธ Image preprocessing and optimization
  • โฑ๏ธ Parallel processing where appropriate
  • ๐ŸŒŠ Response streaming for immediate feedback

๐Ÿ“Š Performance and Limitations

Aspect Details
Connectivity Requires stable internet connection
API Rate Limits Subject to provider limitations
Image Size Max 4MB (base64), 20MB (URL)
Context Window 128K tokens in preview

๐Ÿ”ฎ Future Roadmap

Planned Feature Description
๐Ÿ–ผ๏ธ Advanced preprocessing Enhanced image optimization pipeline
๐Ÿ—บ๏ธ Location integration Google Maps integration for location context
๐ŸŒค๏ธ Environmental data Weather, distance, and temporal information
๐Ÿ“ฑ Code recognition QR and barcode detection and processing
โšก Performance upgrades Response caching for improved speed
๐ŸŽž๏ธ Sequential analysis Multi-image sequence processing
๐ŸŽ™๏ธ Voice personalization Customizable voice profile selection

๐Ÿ“„ License

This project is proprietary and confidential. All rights reserved.


๐Ÿ‘ Acknowledgments

  • LiveKit - WebRTC infrastructure
  • OpenAI - GPT-4o capabilities
  • Groq - Llama model API access
  • Deepgram - Speech recognition technology
  • ElevenLabs - Voice synthesis technology

๐Ÿ“ž Support

For issues or questions, please contact:

Email: muhammedaslam179@gmail.com
GitHub: Open an Issue

๐Ÿ”ง Troubleshooting

Google API Authentication Issues

Issue Solution
"credentials.json file not found" Ensure you've renamed the downloaded OAuth credentials file to credentials.json and placed it in the project root directory
"Token has been expired or revoked" The application handles token refreshing automatically. Once authenticated, you typically won't need to log in again unless you explicitly revoke access in your Google account or don't use the application for an extended period (months). If re-authentication is ever needed, sighted assistance would be required only for that one-time process.
Authentication window doesn't open Run the application from a terminal with GUI access. If using SSH, ensure X11 forwarding is enabled
Calendar events not showing Check that you've enabled the Calendar API in Google Cloud Console and granted the necessary permissions
Contacts not found Verify that you've enabled the People API and that contacts exist in your Google Contacts
Email sending fails Make sure you've enabled "Less secure app access" in your Google account or generated an App Password if using 2FA

General Issues

Issue Solution
Missing dependencies Run pip install -r requirements.txt to install all required packages
API keys not working Double-check your .env file for correct API keys and ensure all services are properly configured
Camera not enabling Ensure your device has a camera and the necessary permissions are granted
Voice not working Check your microphone settings and verify Deepgram API key is valid

About

Voice & Vision Assistant for the Blind is an AI-powered assistant that helps blind and low-vision users navigate the world more independently. It uses real-time vision, speech recognition, and natural language understanding to describe surroundings, identify people or objects, and answer spoken questions instantly.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages