1. Vision
On-Device RAG has a bright future. Just as traditional desktop apps (Word, Excel, Photoshop) expanded to the web, AI is now moving from the cloud to your deviceβbringing privacy, speed, and offline capability.
2. Methodology
Overview of the internal RAG (Retrieval-Augmented Generation) pipeline:
- Ingestion: Documents are split into segments (chunks) and converted into numerical vectors using a local embedding model.
- Storage: Vectors are stored locally in your browser (IndexedDB), ensuring complete privacy.
- Retrieval: A hybrid search approach combines Semantic Search (meaning-based) and Keyword Search (BM25) to find the most relevant context.
- Generation: An optimized LLM (running via WebGPU) uses the retrieved context to generate a factual answer.
3. Current Capabilities
- Excel at direct Q&A (e.g., "What is photosynthesis?", "Where is the Eiffel Tower?").
- Best suited for short, factual answers based on your documents.
4. Assumptions
- Users prefer privacy and clear, direct answers over conversational chit-chat.
5. Limitations
- Indirect Questions: May struggle with implied or highly nuanced questions compared to larger models.
- Model Size: Running in the browser means using smaller, optimized models, which may have less general "world knowledge."
- Browser Limits: Performance depends on your device's GPU and memory.
6. How to Test
- Load Data: Click "Load Wikipedia Demo" or drag & drop your own PDF/text files.
- Go Offline (optional): Disconnect your internet to verify the app works entirely offline.
- Ask Questions: Try direct questions like "What is [Topic]?" based on your data.
- Push Limits: Try asking indirect questions to see where the smaller model might struggle.