Feature Request: Proper GitHub Repository Crawling #578
Replies: 5 comments 1 reply
-
|
I have been thinking about this as well, and possibly using repomix for this, basically repomixing a codebase and parsing the md/xml file it creates |
Beta Was this translation helpful? Give feedback.
-
|
Moving this to discussions and feature requests for now, will move to backlog once scoped |
Beta Was this translation helpful? Give feedback.
-
|
Assuming this is is achievabule with Crawl4AI, how about adding an option to limit the crawling to URLs matching a certain pattern? https://docs.github.com/en/github-cli/github-cli If it was possible to limit the crawling only to what starts with https://github.com/github/spec-kit, then the crawling would have been a lot more efficient and effective. I could even safely instruct it to crawl more deeply, without risk of ending up with tons of unrelated content. |
Beta Was this translation helpful? Give feedback.
-
|
Well, that sounds interesting. We definitely need a better crawling method for GitHub! It has all the features you described: includes, excludes, branches, tags, private repositories and in addition, the directory structure can be inserted. With Gitingest, we could take Git crawling to the next level! 😄 |
Beta Was this translation helpful? Give feedback.
-
|
This may also be something to look at https://github.com/cocoindex-io/cocoindex, it uses tree-stitter for identifying code and chunking it accordingly, grouping functions/methods together. It supports multiple sources, (s3, google drive, local), with a remote git source hidden behind their paid version. I've done some local testing myself using the github api. It works, but reading file contents via API, chunking and embedding them is very time consuming. One feature that would be useful, but maybe more difficult to implement would be keeping repos up to date. Some way to tell if there's been a commit, and then only re-ingesting files that have been changed/removed/added. edit: Actually I think cocoindex tracks file hashes already and supports incremental updates. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Feature Request: GitHub Repository Crawling Support
Summary
Add comprehensive GitHub repository crawling capabilities to enable complete code repository analysis, not just individual file crawling.
Current Behavior
Archon currently supports crawling individual GitHub files by transforming URLs:
github.com/user/repo/blob/main/file.py→raw.githubusercontent.com/user/repo/main/file.pygithub.com/user/repo(repository root) → Not supportedgithub.com/user/repo/tree/main/src(directories) → Logs warning, no crawlingRequested Feature
Enable full GitHub repository crawling with the following capabilities:
1. Repository Structure Discovery
2. Branch and Tag Support
3. Smart File Filtering
Technical Implementation Suggestions
Option 1: GitHub API Integration (Recommended)
Option 2: Enhanced Web Crawling
Option 3: Hybrid Approach
Proposed Configuration
New Settings
API Endpoints
Use Cases
1. Code Documentation and Analysis
2. Project Onboarding
3. Research and Learning
Expected Benefits
For Users
For AI/RAG Systems
Potential Challenges
Rate Limiting
Large Repositories
Authentication
Alternative Implementations
If full GitHub API integration is too complex, consider:
/docs,/srcPriority Assessment
Impact: High - Enables comprehensive code repository analysis
Complexity: Medium - Requires GitHub API integration or enhanced web crawling
User Demand: High - Common request for code-based RAG systems
Related Issues
Beta Was this translation helpful? Give feedback.
All reactions