Knowledge Copilot is a Retrieval-Augmented Generation (RAG) application with multi tenant isolation based on LDAP for companies. It allows organizations to interact with their private document libraries securely, providing accurate answers with transparent quality scoring, source citations and traceability.
Status: Alpha Release. This project is in active development.
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
- Intelligent Customizable RAG Pipeline: Streaming ingestion of PDF, DOCX, MD and TXT files and customizable query engine.
- Multi-Tenant & Cross-Department RAG Isolation: Strict LDAP-based multi-tenancy ensures document-level access control, while enabling controlled cross-department knowledge sharing through group-scoped retrieval and fine-grained permission filters at query time.
- Dense Vector Search with pgvector: High-performance semantic retrieval powered by PostgreSQL.
- Sparse Full Text Search With Elasticsearch:
- Citation & Quality Scoring: Every AI response includes direct citations to source chunks, complete with Similarity Scores (0.0-1.0) and color-coded quality levels.
- Enterprise-Ready Infrastructure: Built-in support for LDAP authentication, S3-compatible storage (SeaweedFS), and Kafka-driven asynchronous processing.
- Modern AI Stack: Java 25 & Spring AI. Tested with local LLMs via Ollama.
- Premium UI: A sleek, dark-themed React dashboard with glassmorphism and smooth animations.
The file ingestion pipeline is designed around a strict separation between synchronous durability and asynchronous processing. When a user uploads a document, the system handles only the critical operations synchronously: validating the request, uploading the binary to an S3-compatible object storage, and persisting the file metadata. Crucially, an outbox event is created within the same transaction. This guarantees atomicity across storage and metadata, eliminating the risk of orphaned files or missing processing events.
Instead of directly publishing to Kafka, the system uses the transactional outbox pattern. All events are first written into an outbox table, and a scheduler periodically scans and publishes them. This design avoids the classic dual-write problem and ensures at-least-once delivery semantics. The scheduler updates event states based on outcomes, marking them as SENT or FAILED, and integrates with retry and dead-letter queue (DLQ) mechanisms. This makes the ingestion pipeline resilient under transient failures and observable under permanent ones.
Once the document uploaded event is consumed, the system transitions into asynchronous processing. The document is fetched from object storage and parsed using Apache Tika, which extracts raw text regardless of file format. The extracted content is then split into semantically meaningful chunks using a token-aware splitter. This step is carefully tuned to balance context preservation and embedding efficiency, avoiding both overly large chunks that degrade embedding quality and overly small fragments that lose semantic coherence.
Each generated chunk triggers further processing. Metadata such as LDAP groups, tags, and document-level attributes is resolved and attached to the chunk. Embeddings are then generated and stored in a vector database , while metadata and processing states are persisted in a relational database. The system is explicitly designed to be idempotent at this stage, meaning duplicate event consumption or retries do not corrupt state or produce inconsistent results. File and chunk statuses are updated incrementally, allowing partial failures to be handled gracefully without compromising the entire ingestion flow.
Overall, the ingestion architecture provides strong guarantees: consistency through transactional boundaries, scalability via asynchronous Kafka-based processing, and fault tolerance through retries and DLQ handling. This ensures the system can handle high-throughput ingestion workloads while maintaining correctness and recoverability.
The query engine is built as a modular pipeline. Incoming queries first pass through protective layers such as bulkhead isolation and rate limiting, ensuring that resource-intensive operations like LLM inference do not degrade overall system stability. From there, the query enters a structured pipeline composed of sequential and parallel steps, enabling flexible and extensible execution strategies.
The first stage is query enrichment, where the system enhances the original user query using LLM-assisted transformations. This may include normalization, semantic expansion, or intent extraction. The goal is to produce a richer representation of the query that improves downstream retrieval quality.
Retrieval is performed using a hybrid, multi-strategy approach executed in parallel. Semantic retrieval operates over vector embeddings to capture conceptual similarity, while keyword-based retrieval extracts important terms for precise filtering. In parallel, a BM25-based search engine performs sparse retrieval, which excels at exact matches and rare tokens. Additionally, tag-based retrieval introduces structured domain signals derived from metadata. These strategies complement each other, mitigating individual weaknesses and producing a more comprehensive candidate set.
The system then performs chunk retrieval using the combined signals. This is not a simple vector similarity lookup; instead, it applies complex filtering based on LDAP group permissions, tags, and keywords. As a result, retrieval is both relevance-aware and access-controlled. This ensures that users only see content they are authorized to access while still benefiting from high-quality semantic matching.
To further improve result quality, the retrieved chunks are passed through a reranking stage. A dedicated reranker microservice, powered by a cross-encoder model, evaluates each chunk in the context of the query and assigns a refined relevance score. This step is critical because raw vector similarity does not always correlate with true relevance. By reordering results based on deeper semantic understanding, the system significantly improves answer accuracy and reduces hallucination risk.
After reranking, the top-ranked chunks are used to construct the final prompt. The system carefully injects these chunks into the LLM context, including associated metadata such as source references and relevance scores. This produces a clean, high-signal prompt that maximizes the effectiveness of the generation step while minimizing noise.
The final stage is LLM execution, which is optimized through infrastructure-level routing. Requests are directed via a reverse proxy to different backends depending on their nature, such as GPU-backed services for text generation and CPU-bound services for embeddings. This separation prevents resource contention and allows parallel workloads to be handled efficiently.
Responses are streamed back to the client using Server-Sent Events (SSE). The stream includes structured event types such as metadata (for explainability), citations (for source transparency), and generated text. This approach not only improves user experience through real-time feedback but also introduces a high level of transparency, enabling users to understand how answers were derived.
Overall, the query engine represents a sophisticated retrieval and reasoning system. Its strengths lie in hybrid retrieval, modular pipeline composition, explainability, and production-grade resilience. Rather than a basic RAG implementation, it functions as an extensible knowledge orchestration platform capable of adapting to complex enterprise requirements.
Besides the client, you can see the stream traces to detect phase bottlenecks and optimize the resources used in the rag
- Backend: Java 25, Spring Boot 4, Spring AI, Hibernate, Flyway, Zipkin.
- Frontend: React 19, TypeScript, Vite, Framer Motion, Tailwind CSS (Vanilla CSS modules).
- Data & AI: PostgreSQL, Redis Stack, Ollama.
- Messaging: Apache Kafka, Zookeeper.
- Storage: Seaweedfs.
-
Clone the repository:
git clone https://github.com/yourusername/knowledge-copilot.git cd knowledge-copilot -
Start all the stuff:
docker-compose --profile backend --profile frontend -f docker/docker-compose.yaml up
-
Clone the repository:
git clone https://github.com/yourusername/knowledge-copilot.git cd knowledge-copilot -
Start the infrastructure:
docker-compose -f docker/docker-compose.yaml up -d
-
Run the Backend:
./mvnw spring-boot:run
-
Run the Frontend:
cd client npm install npm run dev
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Developed by Andres - Powering Private Knowledge with AI.










