This repository contains an AI-assisted coding planning prototype that runs locally with a llama.cpp-backed LLM. The latest build replaces the heuristic agent with a multi-step LangGraph pipeline (planner → coder → reviewer) so the model produces higher-quality plans and code snippets.
app/agent/engine.py– LangGraph orchestration that forwards chat history to a llama.cpp HTTP server. It now chains planner, coder, reviewer nodes and streams each stage (SSE) back to the UI for real-time visibility.app/server.py– threaded HTTP server that exposes/api/session,/api/agent, and serves the static frontend.public/– minimal UI written with vanilla JS + CSS that lets you iterate on prompts and read the agent's output.
Future work can swap the engine with an OpenAI/Anthropic client, add persistent vector memory, and stream tokens back to the UI.
- Install dependencies:
pip install -r requirements.txt
- Install llama.cpp (example on macOS):
brew install llama.cpp
- Start the DeepSeek model with the built-in server:
By default this listens on
llama-server -hf lmstudio-community/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
http://127.0.0.1:8080. Export a different URL viaLLAMA_SERVER_URLif needed. - Launch the web app:
python app/server.py
Then open http://127.0.0.1:8000 in a browser. Each browser tab initializes a new session so you can keep experiments separate.
You can batch-evaluate prompts/models/agent variants with the built-in runner.
- Define tasks in
benchmarks/tasks.jsonl(or the newbenchmarks/tasks_swe.jsonlfor larger SWE-style evaluations). Each line is a JSON object with the fields:id,prompt, optionallanguage, andtest(path to a Python checker that receives the generated code). Algorithmic checkers now live inbenchmarks/algorithm_test/, while the higher-context SWE exercises reside inbenchmarks/swe_benchmark_test/. - Choose an engine:
local-multi,local-single– llama.cpp-backed agents (needs the local server running viaLLAMA_SERVER_URL/LLAMA_SERVER_MODEL).api-multi,api-single– OpenAI-backed agents (requiresOPENAI_API_KEY).
- Run the suite:
python app/run_bench.py \ --engine local-multi \ --tasks benchmarks/tasks.jsonl \ --output results/multi.jsonl
The runner stores newline-delimited JSON outputs (success, elapsed seconds,
checker logs, raw responses) so you can compute aggregate metrics later. Use
--limit N for smoke tests.
- Local execution agent (no toolchain, just codegen + checker):
python app/run_bench.py --engine local-exec --label exec-loop --output results/exec.jsonl - Local self-test agent (agent writes/runs its own tests):
python app/run_bench.py --engine local-selftest --tasks benchmarks/tasks.jsonl --output results/english_selftest.jsonl - API self-test (OpenAI):
python app/run_bench.py --engine api-selftest --tasks benchmarks/tasks.jsonl --output results/english_selftest_api.jsonl - Smoke run first 5 tasks:
python app/run_bench.py --engine local-multi --limit 5 - Default output path (if omitted):
results/latest.jsonl
- Add LangGraph subgraphs for tool selection + retrieval augmented planning.
- Persist session history (Redis/Postgres) for multi-device continuity.
- Add WebSocket streaming so plans render progressively.
- Package the server as FastAPI / Next.js API routes for production use.