Senior LLM Engineer — Build a Private AI Assistant (RAG, FastAPI, Streamlit, ChromaDB, OpenAI)
Senior LLM Engineer Needed — Build a Private AI Assistant (RAG, FastAPI, Streamlit, ChromaDB, OpenAI) Project Overview I’m looking for a senior-level AI engineer with real experience designing and implementing LLM-powered applications, especially those involving Retrieval-Augmented Generation (RAG), vector databases, multi-prompt agent behavior, and clean production-grade Python architectures. The goal is to build my internal private AI assistant (“TomGPT”) that will run locally and serve as: •A Tax Planning Advisor • A Profitability & Business Advisory Assistant • A Content Creation Assistant for my CPA practice This project requires someone who understands how to build modular LLM systems, not someone who glues together LangChain tutorials. ________________________________________ What I Need Built A complete private AI system with: 1. Backend (FastAPI) • /chat endpoint that: o Loads mode-specific system prompts o Performs vector retrieval (Chroma) o Constructs messages for the LLM o Calls OpenAI models (GPT-4.x / 5.x class) o Returns assistant responses 2. Frontend (Streamlit) • Password-gated access • Mode selector (Tax / Profit / Content / General) • Full chat interface with history in session state • Fast, responsive UI 3. Document Knowledge Base (RAG) • Document ingestion pipeline: o PDF/DOCX text extraction o Chunking (configurable) o Embedding (OpenAI) o Storage in ChromaDB with metadata • Runtime retrieval: o Query embedding o Top-k similarity search o Automatic context injection 4. Mode-Based Agent Behavior Load prompts from external files (4 modes): • Tax Planner • Profitability Coach • Content Writer • General Advisor The backend should orchestrate prompts cleanly, not hard-code them. 5. Security & Config • Password protection for the UI • .env for secrets • No API keys exposed to frontend 6. Documentation A professional-quality README explaining: • How to run the system • How to add documents • How to create new modes • How to change models • Optional: how to run everything via Docker ________________________________________ Tech Stack Requirements Required Experience You must be strong in: • Python (senior-level) • FastAPI (production-quality routes & architecture) • Streamlit (clean user interface) • OpenAI API (chat + embeddings) • Vector DBs (Chroma, Pinecone, Qdrant, etc.) • RAG design patterns: o chunking strategies o embedding management o context window optimization o metadata filters • Prompt architecture & multi-agent patterns Strongly Preferred • Experience with Ollama or other local models • Docker • Building similar “private GPT” solutions • Understanding of tax or financial domain (not required, but helpful) Not Interested In • Beginners • People who only use LangChain without understanding what happens under the hood • No-code tools (e.g., Bubble, WordPress plugins) • “Chatbot builders” with no real backend knowledge If you cannot explain embeddings, chunking, and RAG tradeoffs clearly, please do not apply. ________________________________________ Deliverables • Fully working FastAPI backend • Fully working Streamlit frontend • Ingestion script • Vector DB setup (Chroma) • Mode-based prompt system • Clean, simple project structure (folders provided upon hire) • Excellent documentation ________________________________________ Budget & Timeline • Budget: $2,000–$3,500 (fixed price or milestone-based) • Timeline: 2–3 weeks I’m willing to pay top rate within the budget for senior talent who can build this cleanly, modularly, and efficiently. ________________________________________ To Apply (Important) Please include the following in your proposal: 1. A short summary of your experience building LLM/RAG systems. 2. One example of an LLM app you built (no NDAs needed—just describe architecture & decisions). 3. Confirmation that you are comfortable with: o FastAPI o Streamlit o Chroma or similar o RAG design 4. Your estimated timeline and approach to this project. Shortlisted candidates will be asked one technical question about embeddings and chunking to verify expertise. Apply tot his job