Skip to content

nhlongnguyen/mindbridge

Repository files navigation

MindBridge: Agentic RAG Documentation System

CI/CD Pipeline Code Coverage Python Version License

An intelligent documentation and code analysis system that combines vector-based semantic search with graph database relationships to provide comprehensive answers about software tools, libraries, frameworks, and programming languages.

πŸš€ Features

  • πŸ” Hybrid Search: Combines vector-based semantic search with graph relationship traversal
  • πŸ“Š Repository Analysis: Automated processing of GitHub repositories to extract code structure and documentation
  • 🏷️ Version Management: Handles multiple versions of tools/libraries with version-specific queries
  • πŸ€– Contextual RAG: Advanced retrieval-augmented generation with cross-referenced information
  • πŸ”’ Secure API: JWT-based authentication with comprehensive REST endpoints
  • ⚑ Async Processing: Background job processing for large repository analysis
  • πŸ“ˆ Observability: Built-in monitoring, logging, and health checks

πŸ—οΈ Architecture

graph TB
    Client[Client Applications] --> API[FastAPI Backend]
    API --> Auth[JWT Authentication]
    API --> Queue[Celery Workers]
    Queue --> GitHub[GitHub Integration]
    Queue --> Parser[Code/Doc Parser]
    Parser --> DB[(PostgreSQL + pgvector)]
    Parser --> Cache[(Redis Cache/Queue)]
    API --> Search[Search Service]
    Search --> DB
    Search --> Cache
Loading

Technology Stack

  • Backend: FastAPI with async support
  • Database: PostgreSQL with pgvector extension for vector operations
  • Cache/Queue: Redis 7+ for caching and job processing with persistence
  • Workers: Celery for background task processing
  • Authentication: JWT-based security
  • Monitoring: OpenTelemetry with structured logging
  • Deployment: Docker Compose with Kubernetes readiness

πŸ› οΈ Installation

Prerequisites

  • Python 3.12+
  • PostgreSQL 15+ with pgvector extension
  • Redis 7+
  • Git
  • Docker & Docker Compose (for containerized setup)

Quick Start with Docker

# Clone the repository
git clone https://github.com/nhlongnguyen/mindbridge.git
cd mindbridge

# Start all services using the automated setup script
./scripts/docker-setup.sh

# Or manually start services
docker-compose --env-file .env.docker up -d

# Check service health
curl http://localhost:8000/health

Complete Docker Setup

The project includes a comprehensive Docker setup with production-ready configurations:

πŸš€ Automated Setup Script

# One-command setup (recommended)
./scripts/docker-setup.sh

This script will:

  • Build all Docker images
  • Start PostgreSQL and Redis services
  • Run database migrations with all enhancements
  • Start the full application stack
  • Verify all services are healthy

🐳 Manual Docker Setup

# 1. Build and start database services
docker-compose --env-file .env.docker up -d postgres redis

# 2. Wait for services to be healthy
docker-compose --env-file .env.docker logs postgres
docker-compose --env-file .env.docker logs redis

# 3. Run database migrations (includes pgvector, HNSW indexes, constraints)
docker-compose --env-file .env.docker run --rm -e PYTHONPATH=/app/src app alembic upgrade head

# 4. Start the full application stack
docker-compose --env-file .env.docker up -d

# 5. Verify all services
docker-compose --env-file .env.docker ps
curl http://localhost:8000/health

πŸ“Š Docker Services

Service Port Description
API Server 8000 FastAPI application with health checks
PostgreSQL 5432 Database with pgvector extension
Redis 6379 Cache and job queue
Redis Insight 8001 Redis management interface
Celery Worker - Background job processing
Celery Beat - Scheduled task management

πŸ—„οΈ Enhanced Database Features

The Docker setup includes all performance and data integrity enhancements:

βœ… PostgreSQL with pgvector - Vector similarity search βœ… HNSW Indexing - 100-1000x faster vector queries βœ… Composite Indexes - Query optimization βœ… CHECK Constraints - Database-level validation βœ… Cascade Deletes - Data integrity βœ… Enum Types - Type safety

πŸ”§ Environment Configuration

The Docker setup uses .env.docker for configuration:

# Database Configuration
POSTGRES_DB=mindbridge
POSTGRES_USER=mindbridge
POSTGRES_PASSWORD=docker-dev-password-123

# Redis Configuration
REDIS_PASSWORD=docker-redis-password-123

# Application Configuration
JWT_SECRET_KEY=docker-development-secret-key-256-bits-long
ALLOWED_ORIGINS=http://localhost:3000,http://localhost:8080,http://localhost:8000

πŸ” Docker Management Commands

# View service status
docker-compose --env-file .env.docker ps

# View service logs
docker-compose --env-file .env.docker logs -f app
docker-compose --env-file .env.docker logs -f postgres
docker-compose --env-file .env.docker logs -f redis

# Stop all services
docker-compose --env-file .env.docker down

# Reset database (removes all data)
docker-compose --env-file .env.docker down -v

# Run migrations manually
docker-compose --env-file .env.docker run --rm app alembic upgrade head

# Access database directly
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge

# Test vector operations
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge -c "
SELECT COUNT(*) as total_tables FROM information_schema.tables WHERE table_schema = 'public';
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'vector_documents';
"

πŸ§ͺ Docker Testing & Validation

# Validate vector operations
docker-compose --env-file .env.docker exec postgres psql -U mindbridge -d mindbridge -c "
INSERT INTO vector_documents (content, title, embedding)
VALUES ('Docker test content', 'Docker Test', ARRAY(SELECT random() FROM generate_series(1, 1536))::vector);

SELECT id, title, vector_dims(embedding) as dimensions
FROM vector_documents WHERE title = 'Docker Test';
"

# Test API endpoints
curl http://localhost:8000/health
curl http://localhost:8000/docs
curl http://localhost:8000/metrics

# Check Redis connectivity
docker-compose --env-file .env.docker exec redis redis-cli -a docker-redis-password-123 ping

Development Setup

# Install Python dependencies (Poetry is already available)
python3 -m poetry install

# Install pre-commit hooks
python3 -m poetry run pre-commit install

# Set up environment variables
cp .env.example .env
# Edit .env with your configuration (ensure ALLOWED_ORIGINS is set)

# Start local development services (PostgreSQL + Redis)
docker-compose -f docker-compose.dev.yml up -d

# Initialize database
python3 -m poetry run alembic upgrade head

# Start development server
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run uvicorn src.mindbridge.main:app --reload --host 0.0.0.0 --port 8000

# Start worker process (in separate terminal)
python3 -m poetry run celery -A src.worker.celery_app worker --loglevel=info

Development Commands Quick Reference

Code Quality & Linting:

# Run all pre-commit hooks (recommended)
python3 -m poetry run pre-commit run --all-files

# Individual commands (if needed)
python3 -m poetry run ruff check src/ tests/ --fix
python3 -m poetry run black src/ tests/
python3 -m poetry run mypy src/

Testing:

# Run all tests with verbose output (with required environment variable)
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Run tests with coverage report
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-report=html

# Run specific test file
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/test_models.py -v

# Quality gate: Run tests with minimum coverage
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-fail-under=85

# Start development services for integration testing
docker-compose -f docker-compose.dev.yml up -d

Application Management:

# Start development server
python3 -m poetry run uvicorn src.mindbridge.main:app --reload

# Run database migrations
python3 -m poetry run alembic upgrade head

# Generate new migration
python3 -m poetry run alembic revision --autogenerate -m "Description"

# Start Celery worker
python3 -m poetry run celery -A src.worker.celery_app worker --loglevel=info

Dependency Management:

# Add new dependency
python3 -m poetry add package-name

# Add development dependency
python3 -m poetry add --group dev package-name

# Update all dependencies
python3 -m poetry update

# Show dependency tree
python3 -m poetry show --tree

πŸ”§ Configuration

Environment Variables

# Database Configuration
DATABASE_URL=postgresql://user:password@localhost:5432/mindbridge
REDIS_URL=redis://localhost:6379/0

# Redis Configuration
REDIS_MAX_CONNECTIONS=10
REDIS_CONNECTION_TIMEOUT=5
REDIS_SOCKET_TIMEOUT=2

# Celery Configuration
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/1
CELERY_VISIBILITY_TIMEOUT=3600
CELERY_WORKER_PREFETCH_MULTIPLIER=4
CELERY_TASK_SOFT_TIME_LIMIT=300
CELERY_TASK_TIME_LIMIT=600

# Security
JWT_SECRET_KEY=your-super-secret-jwt-key
JWT_ALGORITHM=HS256
JWT_ACCESS_TOKEN_EXPIRE_MINUTES=30

# GitHub Integration
GITHUB_TOKEN=your-github-personal-access-token

# Application Settings
LOG_LEVEL=INFO
ENVIRONMENT=development
DEBUG=false

# Embedding Configuration
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_DIMENSION=384
BATCH_SIZE=32

# Search Configuration
DEFAULT_SEARCH_LIMIT=10
MAX_SEARCH_LIMIT=100
SIMILARITY_THRESHOLD=0.7

Database Setup

# Install PostgreSQL and pgvector extension
# Ubuntu/Debian:
sudo apt-get install postgresql-15 postgresql-15-pgvector

# macOS with Homebrew:
brew install postgresql pgvector

# Enable pgvector extension in your database
psql -d mindbridge -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run database migrations
python3 -m poetry run alembic upgrade head

πŸ“š API Documentation

Once the server is running, access the interactive API documentation:

Authentication

# Get JWT token
curl -X POST "http://localhost:8000/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "admin"}'

# Use token in subsequent requests
curl -X GET "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Core Endpoints

Repository Management

# Analyze a GitHub repository
curl -X POST "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/user/repo"}'

# Check analysis progress
curl -X GET "http://localhost:8000/jobs/{job_id}" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

# List analyzed repositories
curl -X GET "http://localhost:8000/repositories" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN"

Search Operations

# Semantic search across documentation
curl -X POST "http://localhost:8000/search" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how to implement authentication",
    "repository_id": "optional-repo-filter",
    "limit": 10
  }'

# Search within specific repository
curl -X POST "http://localhost:8000/search" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database connection",
    "repository_id": "uuid-of-repository",
    "limit": 5
  }'

πŸ§ͺ Testing

The project uses pytest with comprehensive test coverage requirements (85%+). All tests require the ALLOWED_ORIGINS environment variable to be set.

Running Tests

# Run all tests (with required environment variable)
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Run with coverage report
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ --cov=src --cov-report=html

# Run specific test categories
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/ -v      # Database tests

ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -k "integration" -v  # Integration tests only

# Run tests with detailed output
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v --tb=short

# Run specific test file
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/database/test_models.py -v

# Run tests matching pattern
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -k "test_vector" -v

Local Development Services

For integration tests that require database and Redis connections:

# Start PostgreSQL and Redis services for testing
docker-compose -f docker-compose.dev.yml up -d

# Check service health
docker-compose -f docker-compose.dev.yml ps

# View service logs
docker-compose -f docker-compose.dev.yml logs postgres
docker-compose -f docker-compose.dev.yml logs redis

# Stop services when done
docker-compose -f docker-compose.dev.yml down

Test Status

βœ… All 167 tests passing βœ… 70% test coverage achieved βœ… CI/CD pipeline fixed and operational

Test Structure

Each test module follows the mandatory pattern:

  • Expected Use Case: Normal operation with valid inputs
  • Edge Case: Boundary conditions, empty inputs, maximum limits
  • Failure Case: Invalid inputs, error conditions, exception handling

Example:

class TestDocumentProcessor:
    def test_process_document_success(self):
        """Expected use case: Process valid document successfully"""
        # Test implementation

    def test_process_empty_document(self):
        """Edge case: Handle empty document"""
        # Test implementation

    def test_process_document_embedding_failure(self):
        """Failure case: Handle embedding service failure"""
        # Test implementation

πŸš€ Deployment

Production with Docker Compose

# Production deployment
docker-compose -f docker-compose.prod.yml up -d

# Health check
curl http://your-domain.com/health

# View logs
docker-compose logs -f api
docker-compose logs -f worker

Kubernetes Deployment

# Apply Kubernetes manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secrets.yaml
kubectl apply -f k8s/postgresql.yaml
kubectl apply -f k8s/redis.yaml
kubectl apply -f k8s/api.yaml
kubectl apply -f k8s/worker.yaml
kubectl apply -f k8s/ingress.yaml

# Check deployment status
kubectl get pods -n mindbridge
kubectl get services -n mindbridge

Environment-Specific Configurations

  • Development: docker-compose.yml
  • Staging: docker-compose.staging.yml
  • Production: docker-compose.prod.yml

πŸ“Š Monitoring and Observability

Health Checks

# Application health
curl http://localhost:8000/health

# Database connectivity
curl http://localhost:8000/health/db

# Redis connectivity
curl http://localhost:8000/health/redis

# Detailed system metrics
curl http://localhost:8000/metrics

Logging

The application uses structured logging with OpenTelemetry integration:

import structlog

logger = structlog.get_logger(__name__)

# Example usage
logger.info(
    "Repository analysis started",
    repository_id=repo_id,
    github_url=url,
    user_id=user.id
)

Metrics and Tracing

  • Metrics: Prometheus-compatible metrics available at /metrics
  • Tracing: OpenTelemetry traces for request tracking
  • Dashboards: Grafana dashboards for visualization

🀝 Contributing

We follow strict development standards outlined in CLAUDE.md.

Development Workflow

  1. Fetch Task: Get assigned task from GitHub Projects
  2. Understand Requirements: Analyze issue thoroughly
  3. Create Implementation Plan: Detail technical approach in issue comments
  4. Wait for Approval: Get plan reviewed before coding
  5. TDD Development: Write tests first, implement second
  6. Quality Gates: Pass all tests, linting, and coverage requirements
  7. Update Status: Keep GitHub issue status current

Code Quality Requirements

  • Test Coverage: Minimum 85%
  • Linting: Ruff, Black, isort compliance
  • Type Checking: mypy strict mode
  • Documentation: Google-style docstrings for all public APIs
# Run quality checks
python3 -m poetry run pytest tests/ --cov=src --cov-fail-under=85
python3 -m poetry run pre-commit run --all-files

Pre-commit Hooks

# Install pre-commit hooks
python3 -m poetry run pre-commit install

# Run hooks manually
python3 -m poetry run pre-commit run --all-files

πŸ“ˆ Performance

Benchmarks

  • Repository Analysis: ~2-5 minutes for typical Python repository (1000+ files)
  • Search Latency: <100ms for vector similarity search
  • Throughput: 100+ concurrent search requests
  • Storage: ~10MB per 1000 documents (with embeddings)

Optimization Tips

  1. Batch Processing: Use appropriate batch sizes for embedding generation
  2. Database Indexing: Ensure proper indexes on frequently queried columns
  3. Caching: Leverage Redis for frequently accessed data
  4. Connection Pooling: Configure appropriate database connection pools

πŸ”’ Security

Security Measures

  • Authentication: JWT-based with configurable expiration
  • Authorization: Role-based access control (RBAC)
  • Input Validation: Comprehensive request validation
  • Rate Limiting: Protection against abuse
  • HTTPS: TLS encryption for all communications
  • Secrets Management: Environment-based configuration

Security Auditing

# Run security scan
python3 -m poetry run bandit -r src/

# Check for vulnerabilities (if safety is installed)
python3 -m poetry run safety check

# Audit dependencies
python3 -m poetry audit

πŸ“‹ Roadmap

Current Version (v1.0 - MVP)

  • βœ… PostgreSQL + pgvector for vector operations
  • βœ… Python repository analysis
  • βœ… Markdown documentation processing
  • βœ… Semantic search functionality
  • βœ… JWT authentication
  • βœ… Background job processing

Future Versions

v2.0 - Graph Database Integration

  • Neo4j integration for code relationships
  • Hybrid search (vector + graph traversal)
  • Advanced dependency analysis
  • Cross-reference linking

v3.0 - Multi-Language Support

  • JavaScript/TypeScript analysis
  • Java code processing
  • Go language support
  • Generic language framework

v4.0 - Advanced Agentic Features

  • Query planning and optimization
  • Tool usage and function calling
  • Learning from user feedback
  • Context-aware recommendations

v5.0 - Enterprise Features

  • Multi-tenant architecture
  • Advanced analytics and reporting
  • Custom embedding models
  • Enterprise SSO integration

πŸ†˜ Troubleshooting

Common Issues

Database Connection Issues

# Check PostgreSQL service
sudo systemctl status postgresql

# Test database connectivity
psql -h localhost -U username -d mindbridge -c "SELECT version();"

# Check pgvector extension
psql -d mindbridge -c "SELECT * FROM pg_extension WHERE extname = 'vector';"

Redis Connection Issues

# Check Redis service
redis-cli ping

# Monitor Redis activity
redis-cli monitor

# Check Redis memory usage
redis-cli info memory

# Check Redis configuration
redis-cli config get "*"

# Monitor Redis slow queries
redis-cli slowlog get 10

# Check connected clients
redis-cli client list

Cache and Queue Issues

# Test cache operations
python3 -c "
import asyncio
from mindbridge.cache.redis_cache import get_redis_cache

async def test_cache():
    cache = get_redis_cache()
    await cache.connect()
    print('Cache connected:', await cache.ping())
    await cache.set('test', 'value')
    print('Cache get:', await cache.get('test'))
    await cache.disconnect()

asyncio.run(test_cache())
"

# Check Celery broker connection
python3 -c "
from mindbridge.jobs.celery_config import get_celery_config
from mindbridge.jobs.celery_app import check_broker_connection
import asyncio

async def test_broker():
    config = get_celery_config()
    connected = await check_broker_connection(config)
    print('Broker connected:', connected)

asyncio.run(test_broker())
"

# Monitor Celery queue lengths
redis-cli -h localhost -p 6379 -n 0 llen celery

# Clear Redis cache (use with caution)
redis-cli flushdb

Embedding Generation Slow

# Check available system resources
htop

# Monitor GPU usage (if available)
nvidia-smi

# Adjust batch size in configuration
export BATCH_SIZE=16  # Reduce if memory constrained

Job Processing Issues

# Check Celery worker status
python3 -m poetry run celery -A src.worker.celery_app inspect active

# Monitor job queue
python3 -m poetry run celery -A src.worker.celery_app inspect reserved

# Restart workers
docker-compose restart worker

Test Environment Issues

# Issue: Tests fail with "ALLOWED_ORIGINS environment variable must be set"
# Solution: Always run tests with the environment variable
ALLOWED_ORIGINS="http://localhost:3000,http://localhost:8080" \
python3 -m poetry run pytest tests/ -v

# Issue: Health check tests fail
# Solution: Start local development services
docker-compose -f docker-compose.dev.yml up -d

# Issue: AsyncMock import errors in tests
# Solution: Ensure proper async mocking in test files
from unittest.mock import AsyncMock

# Issue: CI tests failing but local tests pass
# Solution: Check CI has ALLOWED_ORIGINS environment variable in workflow

Getting Help

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments


Built with ❀️ for the developer community

For detailed development guidelines, see CLAUDE.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •