Bridging the visual gap through AI-powered assistance
Voice & Vision Assistant for Blind combines cutting-edge speech recognition, natural language processing, and computer vision to create an intuitive assistant specifically designed for blind and visually impaired users. This thoughtfully crafted solution helps users better understand their surroundings and interact with the world more confidently and independently.
The system utilizes an elegant multi-component architecture to process user inputs and generate helpful responses:
graph TD
%% Main flow - simplified
User([User]) --> |"Voice Input"| Router["Function Router"]
Router --> QueryType{"Query Type"}
%% Simplified branches
QueryType -->|"Visual"| VisualProcess["Visual Analysis"]
QueryType -->|"Search"| SearchProcess["Internet Search"]
QueryType -->|"Places"| PlacesProcess["Places Search"]
QueryType -->|"Calendar"| CalendarProcess["Calendar Management"]
QueryType -->|"Contact"| ContactProcess["Contact Management"]
QueryType -->|"Email"| EmailProcess["Email Management"]
QueryType -->|"General"| TextProcess["Direct Text Response"]
%% Simplified visual path
VisualProcess --> ModelChoice{"Model Selection"}
ModelChoice -->|"GPT-4o"| GPTAnalysis["GPT Analysis Stream"]
ModelChoice -->|"LLAMA"| LLAMAAnalysis["LLAMA Analysis"]
%% Places path
PlacesProcess --> GooglePlaces["Google Places API"]
GooglePlaces --> PlacesResults["Location Details"]
%% Calendar path
CalendarProcess --> GoogleCalendar["Google Calendar API"]
GoogleCalendar --> CalendarResults["Event Management"]
%% Contact path
ContactProcess --> GoogleContacts["Google Contacts API"]
GoogleContacts --> ContactResults["Contact Information"]
%% Email path
EmailProcess --> Gmail["Gmail API"]
Gmail --> EmailResults["Email Management"]
%% Output consolidation - simplified
GPTAnalysis --> Response["TTS Processing"]
LLAMAAnalysis --> Response
SearchProcess --> Response
PlacesResults --> Response
CalendarResults --> Response
ContactResults --> Response
EmailResults --> Response
TextProcess --> Response
Response --> Deliver["Voice Response to User"]
%% Styling
classDef interface fill:#e6f7ff,stroke:#1890ff,stroke-width:2px
classDef process fill:#f6ffed,stroke:#52c41a,stroke-width:1px
classDef decision fill:#fff7e6,stroke:#fa8c16,stroke-width:1px
classDef output fill:#f9f0ff,stroke:#722ed1,stroke-width:1px
classDef api fill:#fff1f0,stroke:#f5222d,stroke-width:1px
class User,Deliver interface
class Router,VisualProcess,GPTAnalysis,LLAMAAnalysis,SearchProcess,TextProcess,PlacesProcess,CalendarProcess,ContactProcess,EmailProcess process
class QueryType,ModelChoice decision
class Response,PlacesResults,CalendarResults,ContactResults,EmailResults output
class GooglePlaces,GoogleCalendar,GoogleContacts,Gmail api
- People Detection First: Llama-4-Scout-17B checks for presence of people
- Conditional Processing: GPT-4o for scenes without people, Llama for scenes with people
- Privacy-Aware: Thoughtful descriptions while respecting privacy
- Progressive output for improved user experience
- Natural conversational flow with minimal latency
- Immediate feedback during interaction
- Detailed Descriptions: Prioritizes key elements for visually impaired users
- Voice-First Design: Intuitive speech interface reduces barriers
- Concise Analysis: Thorough yet efficient scene descriptions
- Visual Representation: Optional virtual avatar for video calls
- Enhanced Communication: Helps blind users interact with sighted individuals
- Professional Presence: Maintains visual engagement in meetings and calls
- Voice Interaction: Natural conversation using speech
- Visual Understanding: Camera-based vision to describe surroundings
- Internet Search: Real-time information lookup
- Location Search: Find nearby businesses, restaurants, and points of interest
- Calendar Management: Add and view calendar events
- Contact Management: Find contact information from your Google Contacts
- Email Management: Read emails and send messages
- Seamless Integration: Coordinated operation between components
We carefully selected meta-llama/llama-4-scout-17b-16e-instruct
as our primary people detection model based on:
Criteria | Performance |
---|---|
Response Time | TTFT < 150ms (well below 500ms requirement) |
Batch Processing | Handles 10+ consecutive image queries without degradation |
Streaming | Provides token-by-token streaming for responsive UX |
People Recognition | Reliably identifies presence of people in images |
Image Limits | 4MB (base64), 20MB (URL), multiple images supported |
Success Rate | >95% in testing |
The Groq API powers our Llama model implementation when people are detected in scenes:
- โก Fast Processing: Sub-500ms TTFT meets accessibility requirements
- ๐ง Advanced Models: Leverages state-of-the-art Llama 4 Scout capabilities
- ๐ Simple Integration: Clean API with official Python client library
Ally/
โโโ app.py # Main entry point
โโโ requirements.txt # Dependencies
โโโ .env # Environment variables
โโโ images/ # Images and diagrams
โโโ src/
โโโ main.py # Entry point and agent implementation
โโโ config.py # Configuration handling
โโโ utils.py # Utility functions for Google API integration
โโโ tools/
โโโ visual.py # Visual processing (camera, frames, image analysis)
โโโ groq_handler.py # Groq API integration for enhanced image analysis
โโโ internet_search.py # Web search functionality
โโโ google_places.py # Places search using Google Places API
โโโ calendar.py # Calendar integration for managing events
โโโ communication.py # Contact and email management
- Python 3.9+ - Core programming language
- LiveKit API - For real-time communication
- OpenAI API - For GPT-4o capabilities
- Deepgram API - For speech-to-text functionality
- ElevenLabs API - For text-to-speech synthesis
- Groq API - For fallback vision processing
- Google APIs - For Places, Calendar, Contacts, and Gmail functionality
1. Clone the repository
git clone https://github.com/codingaslu/Ally-Clone-Assignment.git
cd Ally-Clone-Assignment
2. Set up environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
3. Configure environment variables
Create a .env
file with the following:
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_key
LIVEKIT_API_SECRET=your_livekit_secret
DEEPGRAM_API_KEY=your_deepgram_key
OPENAI_API_KEY=your_openai_key
ELEVEN_API_KEY=your_elevenlabs_key
# Vision configuration
VISION_PROVIDER=groq
.
# Groq API configuration
GROQ_API_KEY=your_groq_api_key # Get your API key from https://console.groq.com/keys
GROQ_MODEL_ID=meta-llama/llama-4-scout-17b-16e-instruct
# Google Places API configuration
GPLACES_API_KEY=your_google_places_api_key # Get your API key from Google Cloud Console https://console.cloud.google.com/google/maps-apis/credentials
GMAIL_MAIL=your_gmail_address
GMAIL_APP_PASSWORD=your_gmail_app_password
# Tavus virtual avatar configuration (optional)
ENABLE_AVATAR=false
TAVUS_API_KEY=your_tavus_api_key
TAVUS_REPLICA_ID=your_replica_id
TAVUS_PERSONA_ID=your_persona_id
TAVUS_AVATAR_NAME=ally-vision-avatar
4. Set up Google API credentials
- Create a new project in the Google Cloud Console
- Enable the required APIs:
- Google Places API Web Service
- Google Calendar API
- People API (Contacts)
- Gmail API
- Create OAuth 2.0 credentials:
- Go to "Credentials" and click "Create Credentials" > "OAuth client ID"
- Select "Desktop app" as the application type
- Give it a name and click "Create"
- Download the JSON file
- Important: Rename the downloaded file to
credentials.json
and place it in the project root directory - Create API key for Places API:
- Go to "Credentials" and click "Create Credentials" > "API Key"
- Restrict the key to only the Google Places API
- Copy this key to your
.env
file asGPLACES_API_KEY
- When you first run the application and try to use calendar or email features, it will:
- Open a browser window for authentication
- Ask you to sign in to your Google account
- Request permission to access your calendar, contacts, and email
- After granting permission, it will create a
token.json
file for future use
5. Special Setup Instructions for Blind Users
For blind users, the Google OAuth authentication process requires sighted assistance only once during initial setup:
-
Initial Setup (One-Time with Assistance):
- After installing the application, the first time you use any Google service (calendar, contacts, email), the system will need authentication
- A browser window will open with Google's authentication page
- This step requires sighted assistance to complete the login and permission granting
- The assistant should:
- Help navigate to the URL provided in the console
- Log in to the blind user's Google account
- Grant the requested permissions
- Confirm when the "Authentication successful" message appears
-
After Initial Setup (No Assistance Needed):
- The system creates a
token.json
file that stores authentication securely - This token remains valid indefinitely with regular use
- No further visual authentication is typically needed
- Re-authentication is only required if access is explicitly revoked or unused for months
- The system creates a
-
Long-Term Solution (Optional):
- For completely independent use, a developer can modify the application to use Service Account authentication
- This alternative method doesn't require browser authentication but needs more technical setup
- Contact the support email below if you need help implementing this solution
6. Virtual Avatar Setup (Optional)
-
Prerequisites:
- Sign up for a Tavus account
- Create a replica (virtual avatar)
- Generate an API key
-
Installation:
- The required dependencies are included in the requirements.txt file
- Make sure you have
livekit-agents[tavus]
andlivekit-plugins-tavus
installed
-
Configuration: Add the following to your
.env
file:# Tavus virtual avatar configuration ENABLE_AVATAR=true TAVUS_API_KEY=your_tavus_api_key TAVUS_REPLICA_ID=your_replica_id TAVUS_PERSONA_ID=your_persona_id TAVUS_AVATAR_NAME=ally-vision-avatar
-
Persona Setup: You need to create a Tavus persona with specific settings using the Tavus API:
curl --request POST \ --url https://tavusapi.com/v2/personas \ -H "Content-Type: application/json" \ -H "x-api-key: <your-tavus-api-key>" \ -d '{ "layers": { "transport": { "transport_type": "livekit" } }, "persona_name": "Ally Assistant Avatar", "pipeline_mode": "echo" }'
- Save the
id
from the response as yourTAVUS_PERSONA_ID
- Use your replica ID as
TAVUS_REPLICA_ID
- Save the
-
When to Use:
- Professional meetings where visual presence helps
- Family video calls with sighted relatives
- Educational or work environments
- Any situation where blind users benefit from having a visual representation
The avatar will automatically handle the audio output when enabled, creating a seamless experience for both the blind user and sighted participants in the conversation.
Step | Command | Description |
---|---|---|
1 | python app.py download-files |
Download dependencies |
2a | python app.py start |
Start in standard mode |
2b | python app.py dev |
Start in development mode |
3 | Connect via LiveKit playground | Begin interaction |
- STT: Deepgram for real-time speech-to-text conversion
- TTS: ElevenLabs for natural-sounding text-to-speech
- Primary LLM: OpenAI GPT-4o for conversational intelligence
- Function Routing: Dynamic selection of appropriate capabilities
- People Detection: Llama-4-Scout-17B to determine presence of people
- Scene Analysis: GPT-4o for scenes without people, Llama for scenes with people
- Places Search: Google Places API for finding businesses and points of interest
- Relevant Results: Provides address, ratings, opening hours, and other details
- Accessibility Focus: Tailored information relevant for blind and visually impaired users
Challenge: GPT-4o sometimes refuses to describe people in images due to privacy guardrails.
Solution:
- Llama model first checks for presence of people in images
- Route to appropriate model based on content (GPT-4o for no people, Llama for people)
- Response normalization for consistent user experience
Approaches:
- โก Efficient API client configuration
- ๐ผ๏ธ Image preprocessing and optimization
- โฑ๏ธ Parallel processing where appropriate
- ๐ Response streaming for immediate feedback
Aspect | Details |
---|---|
Connectivity | Requires stable internet connection |
API Rate Limits | Subject to provider limitations |
Image Size | Max 4MB (base64), 20MB (URL) |
Context Window | 128K tokens in preview |
Planned Feature | Description |
---|---|
๐ผ๏ธ Advanced preprocessing | Enhanced image optimization pipeline |
๐บ๏ธ Location integration | Google Maps integration for location context |
๐ค๏ธ Environmental data | Weather, distance, and temporal information |
๐ฑ Code recognition | QR and barcode detection and processing |
โก Performance upgrades | Response caching for improved speed |
๐๏ธ Sequential analysis | Multi-image sequence processing |
๐๏ธ Voice personalization | Customizable voice profile selection |
This project is proprietary and confidential. All rights reserved.
- LiveKit - WebRTC infrastructure
- OpenAI - GPT-4o capabilities
- Groq - Llama model API access
- Deepgram - Speech recognition technology
- ElevenLabs - Voice synthesis technology
For issues or questions, please contact:
Email: muhammedaslam179@gmail.com
GitHub: Open an Issue
Issue | Solution |
---|---|
"credentials.json file not found" | Ensure you've renamed the downloaded OAuth credentials file to credentials.json and placed it in the project root directory |
"Token has been expired or revoked" | The application handles token refreshing automatically. Once authenticated, you typically won't need to log in again unless you explicitly revoke access in your Google account or don't use the application for an extended period (months). If re-authentication is ever needed, sighted assistance would be required only for that one-time process. |
Authentication window doesn't open | Run the application from a terminal with GUI access. If using SSH, ensure X11 forwarding is enabled |
Calendar events not showing | Check that you've enabled the Calendar API in Google Cloud Console and granted the necessary permissions |
Contacts not found | Verify that you've enabled the People API and that contacts exist in your Google Contacts |
Email sending fails | Make sure you've enabled "Less secure app access" in your Google account or generated an App Password if using 2FA |
Issue | Solution |
---|---|
Missing dependencies | Run pip install -r requirements.txt to install all required packages |
API keys not working | Double-check your .env file for correct API keys and ensure all services are properly configured |
Camera not enabling | Ensure your device has a camera and the necessary permissions are granted |
Voice not working | Check your microphone settings and verify Deepgram API key is valid |