MindBridge: Agentic RAG Documentation System

An intelligent documentation and code analysis system that combines vector-based semantic search with graph database relationships to provide comprehensive answers about software tools, libraries, frameworks, and programming languages.

🚀 Features

🔍 Hybrid Search: Combines vector-based semantic search with graph relationship traversal
📊 Repository Analysis: Automated processing of GitHub repositories to extract code structure and documentation
🏷️ Version Management: Handles multiple versions of tools/libraries with version-specific queries
🤖 Contextual RAG: Advanced retrieval-augmented generation with cross-referenced information
🔒 Secure API: JWT-based authentication with comprehensive REST endpoints
⚡ Async Processing: Background job processing for large repository analysis
📈 Observability: Built-in monitoring, logging, and health checks

🏗️ Architecture

graph TB
    Client[Client Applications] --> API[FastAPI Backend]
    API --> Auth[JWT Authentication]
    API --> Queue[Celery Workers]
    Queue --> GitHub[GitHub Integration]
    Queue --> Parser[Code/Doc Parser]
    Parser --> DB[(PostgreSQL + pgvector)]
    Parser --> Cache[(Redis Cache/Queue)]
    API --> Search[Search Service]
    Search --> DB
    Search --> Cache

Technology Stack

Backend: FastAPI with async support
Database: PostgreSQL with pgvector extension for vector operations
Cache/Queue: Redis 7+ for caching and job processing with persistence
Workers: Celery for background task processing
Authentication: JWT-based security
Monitoring: OpenTelemetry with structured logging
Deployment: Docker Compose with Kubernetes readiness

🛠️ Installation

Prerequisites

Python 3.12+
PostgreSQL 15+ with pgvector extension
Redis 7+
Git
Docker & Docker Compose (for containerized setup)

Quick Start with Docker

# Clone the repository
git clone https://github.com/nhlongnguyen/mindbridge.git
cd mindbridge

# Start all services using the automated setup script
./scripts/docker-setup.sh

# Or manually start services
docker-compose --env-file .env.docker up -d

# Check service health
curl http://localhost:8000/health

Complete Docker Setup

The project includes a comprehensive Docker setup with production-ready configurations:

🚀 Automated Setup Script

# One-command setup (recommended)
./scripts/docker-setup.sh

This script will:

Build all Docker images
Start PostgreSQL and Redis services
Run database migrations with all enhancements
Start the full application stack
Verify all services are healthy

🐳 Manual Docker Setup

# 1. Build and start database services
docker-compose --env-file .env.docker up -d postgres redis

# 2. Wait for services to be healthy
docker-compose --env-file .env.docker logs postgres
docker-compose --env-file .env.docker logs redis

# 3. Run database migrations (includes pgvector, HNSW indexes, constraints)
docker-compose --env-file .env.docker run --rm -e PYTHONPATH=/app/src app alembic upgrade head

# 4. Start the full application stack
docker-compose --env-file .env.docker up -d

# 5. Verify all services
docker-compose --env-file .env.docker ps
curl http://localhost:8000/health

📊 Docker Services

Service	Port	Description
API Server	8000	FastAPI application with health checks
PostgreSQL	5432	Database with pgvector extension
Redis	6379	Cache and job queue
Redis Insight	8001	Redis management interface
Celery Worker	-	Background job processing
Celery Beat	-	Scheduled task management

🗄️ Enhanced Database Features

The Docker setup includes all performance and data integrity enhancements:

✅ PostgreSQL with pgvector - Vector similarity search ✅ HNSW Indexing - 100-1000x faster vector queries ✅ Composite Indexes - Query optimization ✅ CHECK Constraints - Database-level validation ✅ Cascade Deletes - Data integrity ✅ Enum Types - Type safety

🔧 Environment Configuration

The Docker setup uses .env.docker for configuration:

# Database Configuration
POSTGRES_DB=mindbridge
POSTGRES_USER=mindbridge
POSTGRES_PASSWORD=docker-dev-password-123

# Redis Configuration
REDIS_PASSWORD=docker-redis-password-123

# Application Configuration
JWT_SECRET_KEY=docker-development-secret-key-256-bits-long
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:8080,http://localhost:8000

🔍 Docker Management Commands

# View service status
docker-compose --env-file .env.docker ps

# View service logs
docker-compose --env-file .env.docker logs -f app
docker-compose --env-file .env.docker logs -f postgres
docker-compose --env-file .env.docker logs -f redis

# Stop all services
docker-compose --env-file .env.docker down

# Reset database (removes all data)
docker-compose --env-file .env.docker down -v

# Run migrations manually
docker-compose --env-file .env.docker run --rm app alembic upgrade head

# Access database directly
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge

# Test vector operations
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge -c "
SELECT COUNT(*) as total_tables FROM information_schema.tables WHERE table_schema = 'public';
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'vector_documents';
"

🧪 Docker Testing & Validation

# Validate vector operations
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge -c "
INSERT INTO vector_documents (content, title, embedding)
VALUES ('Docker test content', 'Docker Test', ARRAY(SELECT random() FROM generate_series(1, 1536))::vector);

SELECT id, title, vector_dims(embedding) as dimensions
FROM vector_documents WHERE title = 'Docker Test';
"

# Test API endpoints
curl http://localhost:8000/health
curl http://localhost:8000/docs
curl http://localhost:8000/metrics

# Check Redis connectivity
docker-compose --env-file .env.docker exec redis redis-cli -a docker-redis-password-123 ping

Development Setup

# Install Python dependencies (Poetry is already available)
python3 -m poetry install

# Install pre-commit hooks
python3 -m poetry run pre-commit install

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration (ensure ALLOWED_ORIGINS is set)

# Start local development services (PostgreSQL + Redis)
docker-compose -f docker-compose.dev.yml up -d

# Initialize database
python3 -m poetry run alembic upgrade head

# Start development server
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run uvicorn src.mindbridge.main:app --reload --host 0.0.0.0 --port 8000

# Start worker process (in separate terminal)
python3 -m poetry run celery -A src.worker.celery_app worker --loglevel=info

Development Commands Quick Reference

Code Quality & Linting:

# Run all pre-commit hooks (recommended)
python3 -m poetry run pre-commit run --all-files

# Individual commands (if needed)
python3 -m poetry run ruff check src/ tests/ --fix
python3 -m poetry run black src/ tests/
python3 -m poetry run mypy src/

Testing:

# Run all tests with verbose output (with required environment variable)
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Run tests with coverage report
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-report=html

# Run specific test file
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/test_models.py -v

# Quality gate: Run tests with minimum coverage
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-fail-under=85

# Start development services for integration testing
docker-compose -f docker-compose.dev.yml up -d

Application Management:

# Start development server
python3 -m poetry run uvicorn src.mindbridge.main:app --reload

# Run database migrations
python3 -m poetry run alembic upgrade head

# Generate new migration
python3 -m poetry run alembic revision --autogenerate -m "Description"

# Start Celery worker
python3 -m poetry run celery -A src.worker.celery_app worker --loglevel=info

Dependency Management:

# Add new dependency
python3 -m poetry add package-name

# Add development dependency
python3 -m poetry add --group dev package-name

# Update all dependencies
python3 -m poetry update

# Show dependency tree
python3 -m poetry show --tree

🔧 Configuration

Environment Variables

# Database Configuration
DATABASE_URL=postgresql://user:password@localhost:5432/mindbridge
REDIS_URL=redis://localhost:6379/0

# Redis Configuration
REDIS_MAX_CONNECTIONS=10
REDIS_CONNECTION_TIMEOUT=5
REDIS_SOCKET_TIMEOUT=2

# Celery Configuration
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1
CELERY_VISIBILITY_TIMEOUT=3600
CELERY_WORKER_PREFETCH_MULTIPLIER=4
CELERY_TASK_SOFT_TIME_LIMIT=300
CELERY_TASK_TIME_LIMIT=600

# Security
JWT_SECRET_KEY=your-super-secret-jwt-key
JWT_ALGORITHM=HS256
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=30

# GitHub Integration
GITHUB_TOKEN=your-github-personal-access-token

# Application Settings
LOG_LEVEL=INFO
ENVIRONMENT=development
DEBUG=false

# Embedding Configuration
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_DIMENSION=384
BATCH_SIZE=32

# Search Configuration
DEFAULT_SEARCH_LIMIT=10
MAX_SEARCH_LIMIT=100
SIMILARITY_THRESHOLD=0.7

Database Setup

# Install PostgreSQL and pgvector extension
# Ubuntu/Debian:
sudo apt-get install postgresql-15 postgresql-15-pgvector

# macOS with Homebrew:
brew install postgresql pgvector

# Enable pgvector extension in your database
psql -d mindbridge -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run database migrations
python3 -m poetry run alembic upgrade head

📚 API Documentation

Once the server is running, access the interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

Authentication

# Get JWT token
curl -X POST "http://localhost:8000/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "admin"}'

# Use token in subsequent requests
curl -X GET "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Core Endpoints

Repository Management

# Analyze a GitHub repository
curl -X POST "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/user/repo"}'

# Check analysis progress
curl -X GET "http://localhost:8000/jobs/{job_id}" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

# List analyzed repositories
curl -X GET "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Search Operations

# Semantic search across documentation
curl -X POST "http://localhost:8000/search" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how to implement authentication",
    "repository_id": "optional-repo-filter",
    "limit": 10
  }'

# Search within specific repository
curl -X POST "http://localhost:8000/search" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database connection",
    "repository_id": "uuid-of-repository",
    "limit": 5
  }'

🧪 Testing

The project uses pytest with comprehensive test coverage requirements (85%+). All tests require the ALLOWED_ORIGINS environment variable to be set.

Running Tests

# Run all tests (with required environment variable)
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Run with coverage report
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-report=html

# Run specific test categories
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/ -v      # Database tests

ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -k "integration" -v  # Integration tests only

# Run tests with detailed output
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v --tb=short

# Run specific test file
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/test_models.py -v

# Run tests matching pattern
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -k "test_vector" -v

Local Development Services

For integration tests that require database and Redis connections:

# Start PostgreSQL and Redis services for testing
docker-compose -f docker-compose.dev.yml up -d

# Check service health
docker-compose -f docker-compose.dev.yml ps

# View service logs
docker-compose -f docker-compose.dev.yml logs postgres
docker-compose -f docker-compose.dev.yml logs redis

# Stop services when done
docker-compose -f docker-compose.dev.yml down

Test Status

✅ All 167 tests passing ✅ 70% test coverage achieved ✅ CI/CD pipeline fixed and operational

Test Structure

Each test module follows the mandatory pattern:

Expected Use Case: Normal operation with valid inputs
Edge Case: Boundary conditions, empty inputs, maximum limits
Failure Case: Invalid inputs, error conditions, exception handling

Example:

class TestDocumentProcessor:
    def test_process_document_success(self):
        """Expected use case: Process valid document successfully"""
        # Test implementation

    def test_process_empty_document(self):
        """Edge case: Handle empty document"""
        # Test implementation

    def test_process_document_embedding_failure(self):
        """Failure case: Handle embedding service failure"""
        # Test implementation

🚀 Deployment

Production with Docker Compose

# Production deployment
docker-compose -f docker-compose.prod.yml up -d

# Health check
curl http://your-domain.com/health

# View logs
docker-compose logs -f api
docker-compose logs -f worker

Kubernetes Deployment

# Apply Kubernetes manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/postgresql.yaml
kubectl apply -f k8s/redis.yaml
kubectl apply -f k8s/api.yaml
kubectl apply -f k8s/worker.yaml
kubectl apply -f k8s/ingress.yaml

# Check deployment status
kubectl get pods -n mindbridge
kubectl get services -n mindbridge

Environment-Specific Configurations

Development: docker-compose.yml
Staging: docker-compose.staging.yml
Production: docker-compose.prod.yml

📊 Monitoring and Observability

Health Checks

# Application health
curl http://localhost:8000/health

# Database connectivity
curl http://localhost:8000/health/db

# Redis connectivity
curl http://localhost:8000/health/redis

# Detailed system metrics
curl http://localhost:8000/metrics

Logging

The application uses structured logging with OpenTelemetry integration:

import structlog

logger = structlog.get_logger(__name__)

# Example usage
logger.info(
    "Repository analysis started",
    repository_id=repo_id,
    github_url=url,
    user_id=user.id
)

Metrics and Tracing

Metrics: Prometheus-compatible metrics available at /metrics
Tracing: OpenTelemetry traces for request tracking
Dashboards: Grafana dashboards for visualization

🤝 Contributing

We follow strict development standards outlined in CLAUDE.md.

Development Workflow

Fetch Task: Get assigned task from GitHub Projects
Understand Requirements: Analyze issue thoroughly
Create Implementation Plan: Detail technical approach in issue comments
Wait for Approval: Get plan reviewed before coding
TDD Development: Write tests first, implement second
Quality Gates: Pass all tests, linting, and coverage requirements
Update Status: Keep GitHub issue status current

Code Quality Requirements

Test Coverage: Minimum 85%
Linting: Ruff, Black, isort compliance
Type Checking: mypy strict mode
Documentation: Google-style docstrings for all public APIs

# Run quality checks
python3 -m poetry run pytest tests/ --cov=src --cov-fail-under=85
python3 -m poetry run pre-commit run --all-files

Pre-commit Hooks

# Install pre-commit hooks
python3 -m poetry run pre-commit install

# Run hooks manually
python3 -m poetry run pre-commit run --all-files

📈 Performance

Benchmarks

Repository Analysis: ~2-5 minutes for typical Python repository (1000+ files)
Search Latency: <100ms for vector similarity search
Throughput: 100+ concurrent search requests
Storage: ~10MB per 1000 documents (with embeddings)

Optimization Tips

Batch Processing: Use appropriate batch sizes for embedding generation
Database Indexing: Ensure proper indexes on frequently queried columns
Caching: Leverage Redis for frequently accessed data
Connection Pooling: Configure appropriate database connection pools

🔒 Security

Security Measures

Authentication: JWT-based with configurable expiration
Authorization: Role-based access control (RBAC)
Input Validation: Comprehensive request validation
Rate Limiting: Protection against abuse
HTTPS: TLS encryption for all communications
Secrets Management: Environment-based configuration

Security Auditing

# Run security scan
python3 -m poetry run bandit -r src/

# Check for vulnerabilities (if safety is installed)
python3 -m poetry run safety check

# Audit dependencies
python3 -m poetry audit

📋 Roadmap

Current Version (v1.0 - MVP)

✅ PostgreSQL + pgvector for vector operations
✅ Python repository analysis
✅ Markdown documentation processing
✅ Semantic search functionality
✅ JWT authentication
✅ Background job processing

Future Versions

v2.0 - Graph Database Integration

Neo4j integration for code relationships
Hybrid search (vector + graph traversal)
Advanced dependency analysis
Cross-reference linking

v3.0 - Multi-Language Support

JavaScript/TypeScript analysis
Java code processing
Go language support
Generic language framework

v4.0 - Advanced Agentic Features

Query planning and optimization
Tool usage and function calling
Learning from user feedback
Context-aware recommendations

v5.0 - Enterprise Features

Multi-tenant architecture
Advanced analytics and reporting
Custom embedding models
Enterprise SSO integration

🆘 Troubleshooting

Common Issues

Database Connection Issues

# Check PostgreSQL service
sudo systemctl status postgresql

# Test database connectivity
psql -h localhost -U username -d mindbridge -c "SELECT version();"

# Check pgvector extension
psql -d mindbridge -c "SELECT * FROM pg_extension WHERE extname = 'vector';"

Redis Connection Issues

# Check Redis service
redis-cli ping

# Monitor Redis activity
redis-cli monitor

# Check Redis memory usage
redis-cli info memory

# Check Redis configuration
redis-cli config get "*"

# Monitor Redis slow queries
redis-cli slowlog get 10

# Check connected clients
redis-cli client list

Cache and Queue Issues

# Test cache operations
python3 -c "
import asyncio
from mindbridge.cache.redis_cache import get_redis_cache

async def test_cache():
    cache = get_redis_cache()
    await cache.connect()
    print('Cache connected:', await cache.ping())
    await cache.set('test', 'value')
    print('Cache get:', await cache.get('test'))
    await cache.disconnect()

asyncio.run(test_cache())
"

# Check Celery broker connection
python3 -c "
from mindbridge.jobs.celery_config import get_celery_config
from mindbridge.jobs.celery_app import check_broker_connection
import asyncio

async def test_broker():
    config = get_celery_config()
    connected = await check_broker_connection(config)
    print('Broker connected:', connected)

asyncio.run(test_broker())
"

# Monitor Celery queue lengths
redis-cli -h localhost -p 6379 -n 0 llen celery

# Clear Redis cache (use with caution)
redis-cli flushdb

Embedding Generation Slow

# Check available system resources
htop

# Monitor GPU usage (if available)
nvidia-smi

# Adjust batch size in configuration
export BATCH_SIZE=16  # Reduce if memory constrained

Job Processing Issues

# Check Celery worker status
python3 -m poetry run celery -A src.worker.celery_app inspect active

# Monitor job queue
python3 -m poetry run celery -A src.worker.celery_app inspect reserved

# Restart workers
docker-compose restart worker

Test Environment Issues

# Issue: Tests fail with "ALLOWED_ORIGINS environment variable must be set"
# Solution: Always run tests with the environment variable
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Issue: Health check tests fail
# Solution: Start local development services
docker-compose -f docker-compose.dev.yml up -d

# Issue: AsyncMock import errors in tests
# Solution: Ensure proper async mocking in test files
from unittest.mock import AsyncMock

# Issue: CI tests failing but local tests pass
# Solution: Check CI has ALLOWED_ORIGINS environment variable in workflow

Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Project Wiki

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

sentence-transformers for embedding generation
pgvector for PostgreSQL vector operations
FastAPI for the excellent web framework
Celery for distributed task processing

Built with ❤️ for the developer community

For detailed development guidelines, see CLAUDE.md

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
config		config
migrations		migrations
scripts		scripts
src/mindbridge		src/mindbridge
tests		tests
.bandit		.bandit
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.safety-policy.json		.safety-policy.json
.tool-versions		.tool-versions
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

nhlongnguyen/mindbridge

Folders and files

Latest commit

History

Repository files navigation