Feature Request: Proper GitHub Repository Crawling #578

leex279 · 2025-08-25T13:52:09Z

leex279
Aug 25, 2025
Collaborator

Feature Request: GitHub Repository Crawling Support

Summary

Add comprehensive GitHub repository crawling capabilities to enable complete code repository analysis, not just individual file crawling.

Current Behavior

Archon currently supports crawling individual GitHub files by transforming URLs:

✅ github.com/user/repo/blob/main/file.py → raw.githubusercontent.com/user/repo/main/file.py
❌ github.com/user/repo (repository root) → Not supported
❌ github.com/user/repo/tree/main/src (directories) → Logs warning, no crawling

Requested Feature

Enable full GitHub repository crawling with the following capabilities:

1. Repository Structure Discovery

# Input: Repository URL
https://github.com/microsoft/vscode

# Expected Output: Discover and crawl all relevant files
- README.md, LICENSE, package.json (root files)
- /src/**/*.ts, /src/**/*.js (source code)
- /docs/**/*.md (documentation)
- /.github/**/*.yml (workflows)
- Other configurable file patterns

2. Branch and Tag Support

# Support different branches/tags
https://github.com/user/repo/tree/develop
https://github.com/user/repo/tree/v2.0.0
https://github.com/user/repo  # Default to main/master

3. Smart File Filtering

Include by default: Source code, documentation, configuration files
Exclude by default: Binary files, build artifacts, node_modules, .git, ignore all files in .gitignore
Configurable patterns: Allow custom include/exclude patterns

Technical Implementation Suggestions

Option 1: GitHub API Integration (Recommended)

# Use GitHub REST API for file discovery
GET /repos/{owner}/{repo}/git/trees/{tree_sha}?recursive=1

# Benefits:
- Structured metadata access
- Rate limiting support  
- Authentication for private repos
- Efficient file enumeration

Option 2: Enhanced Web Crawling

# Extend existing URL handler
def discover_repository_files(repo_url: str) -> List[str]:
    # Parse repository structure from GitHub's web interface
    # Generate list of individual file URLs for existing crawler

Option 3: Hybrid Approach

Use GitHub API for file discovery when available
Fall back to web scraping for unauthenticated access
Cache repository structure to avoid repeated API calls

Proposed Configuration

New Settings

github_crawling:
  enabled: true
  api_token: ${GITHUB_TOKEN}  # Optional, for private repos and higher rate limits
  max_files_per_repo: 1000    # Safety limit
  file_patterns:
    include: ["*.md", "*.py", "*.js", "*.ts", "*.json", "*.yml", "*.yaml"]
    exclude: ["node_modules/**", ".git/**", "*.min.js", "*.map"]
  crawl_branches: ["main", "master"]  # Or "all" for all branches

API Endpoints

# New crawl types to support
POST /api/knowledge/crawl
{
  "url": "https://github.com/user/repo",
  "knowledge_type": "github_repository",  # New type
  "max_depth": null,  # Not applicable for repo crawling
  "github_options": {
    "branch": "main",
    "include_patterns": ["*.py", "*.md"],
    "max_files": 500
  }
}

Use Cases

1. Code Documentation and Analysis

Crawl entire codebases for RAG-based code assistance
Extract documentation from large projects
Create knowledge bases from open-source repositories

2. Project Onboarding

New developers can quickly understand project structure
AI agents can provide context-aware code suggestions
Comprehensive project analysis for technical due diligence

3. Research and Learning

Analyze implementation patterns across repositories
Extract code examples from popular projects
Study architectural decisions in large codebases

Expected Benefits

For Users

One-click repository ingestion: Simply provide a GitHub URL
Comprehensive code context: Full repository understanding for AI assistance
Private repository support: With GitHub token authentication

For AI/RAG Systems

Better code understanding: Complete project context vs. individual files
Improved suggestions: Context from related files and project structure
Code example extraction: Automatic discovery of relevant code patterns

Potential Challenges

Rate Limiting

GitHub API has rate limits (5,000 requests/hour authenticated, 60/hour unauthenticated)
Solutions:
- Implement exponential backoff, caching, and batch processing
- Implement rate limiting at the application level
- Implement github token authentication

Large Repositories

Some repositories have thousands of files
Solution: Implement file count limits, smart filtering, and user configuration

Authentication

Private repositories require authentication
Solution: Optional GitHub token support, graceful fallback for public repos

Alternative Implementations

If full GitHub API integration is too complex, consider:

Sitemap-based approach: Some repositories expose XML sitemaps
Limited directory crawling: Support common patterns like /docs, /src
Manual file list input: Allow users to specify file patterns to crawl

Priority Assessment

Impact: High - Enables comprehensive code repository analysis
Complexity: Medium - Requires GitHub API integration or enhanced web crawling
User Demand: High - Common request for code-based RAG systems

Related Issues

Individual file crawling works but is manual and incomplete
Current GitHub URL handling logs warnings but doesn't provide alternatives
No support for private repositories or authenticated access

Wirasm · 2025-08-25T15:30:45Z

Wirasm
Aug 25, 2025
Collaborator

I have been thinking about this as well, and possibly using repomix for this, basically repomixing a codebase and parsing the md/xml file it creates

0 replies

Wirasm · 2025-09-04T15:42:19Z

Wirasm
Sep 4, 2025
Collaborator

Moving this to discussions and feature requests for now, will move to backlog once scoped

0 replies

gideoncatz · 2025-09-21T11:49:41Z

gideoncatz
Sep 21, 2025

Assuming this is is achievabule with Crawl4AI, how about adding an option to limit the crawling to URLs matching a certain pattern?
For example, when trying to crawl the URL https://github.com/github/spec-kit, it also crawls, for example - probably because there are links in the page - those ones (and many other unrelated ones as well):

https://docs.github.com/en/github-cli/github-cli
https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
https://github.com/QwenLM/qwen-code

If it was possible to limit the crawling only to what starts with https://github.com/github/spec-kit, then the crawling would have been a lot more efficient and effective. I could even safely instruct it to crawl more deeply, without risk of ending up with tons of unrelated content.
Such a feature can be useful not only for scraping github, but for pretty much any site.

0 replies

GadgetScript · 2025-10-01T21:30:35Z

GadgetScript
Oct 1, 2025

Well, that sounds interesting. We definitely need a better crawling method for GitHub!
But why go to the trouble of integrating the GitHub API? There is already a very good tool for converting a Git repository into prompt-friendly text for LLMs: Gitingest https://github.com/coderamp-labs/gitingest

It has all the features you described: includes, excludes, branches, tags, private repositories and in addition, the directory structure can be inserted.

With Gitingest, we could take Git crawling to the next level! 😄

0 replies

CAPsMANyo · 2025-10-01T21:41:17Z

CAPsMANyo
Oct 1, 2025

This may also be something to look at https://github.com/cocoindex-io/cocoindex, it uses tree-stitter for identifying code and chunking it accordingly, grouping functions/methods together. It supports multiple sources, (s3, google drive, local), with a remote git source hidden behind their paid version.

I've done some local testing myself using the github api. It works, but reading file contents via API, chunking and embedding them is very time consuming.

One feature that would be useful, but maybe more difficult to implement would be keeping repos up to date. Some way to tell if there's been a commit, and then only re-ingesting files that have been changed/removed/added.

edit: Actually I think cocoindex tracks file hashes already and supports incremental updates.

1 reply

GadgetScript Oct 2, 2025

Support for s3, Google Drive and local folders (shared folders) would be really nice! But that has less to do with this topic. I've only skimmed through the CocoIndex readme, but I don't see any advantage in using it for Git repos. You should create a new feature request for this.

Gitingest reads a repo and creates the output surprisingly quickly. Chunking and embedding is something else, but it depends on the LLM, not Gitingest. The output is also very well structured for an LLM for chunking and embedding.

I also see no problem with keeping the repro up to date. Automatic detection would be nice to have, but in my opinion it's not absolutely necessary. Normally, you crawl a stable repro and not the master, but if you do, then you have to bite the bullet. It's no different with website crawling as long as you don't use the sitemap and can't select specific pages from it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Proper GitHub Repository Crawling #578

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: Proper GitHub Repository Crawling #578

Uh oh!

leex279 Aug 25, 2025 Collaborator

Feature Request: GitHub Repository Crawling Support

Summary

Current Behavior

Requested Feature

1. Repository Structure Discovery

2. Branch and Tag Support

3. Smart File Filtering

Technical Implementation Suggestions

Option 1: GitHub API Integration (Recommended)

Option 2: Enhanced Web Crawling

Option 3: Hybrid Approach

Proposed Configuration

New Settings

API Endpoints

Use Cases

1. Code Documentation and Analysis

2. Project Onboarding

3. Research and Learning

Expected Benefits

For Users

For AI/RAG Systems

Potential Challenges

Rate Limiting

Large Repositories

Authentication

Alternative Implementations

Priority Assessment

Related Issues

Replies: 5 comments · 1 reply

Uh oh!

Wirasm Aug 25, 2025 Collaborator

Uh oh!

Wirasm Sep 4, 2025 Collaborator

Uh oh!

gideoncatz Sep 21, 2025

Uh oh!

GadgetScript Oct 1, 2025

Uh oh!

Uh oh!

CAPsMANyo Oct 1, 2025

Uh oh!

GadgetScript Oct 2, 2025

leex279
Aug 25, 2025
Collaborator

Replies: 5 comments 1 reply

Wirasm
Aug 25, 2025
Collaborator

Wirasm
Sep 4, 2025
Collaborator

gideoncatz
Sep 21, 2025

GadgetScript
Oct 1, 2025

CAPsMANyo
Oct 1, 2025