Skip to content

Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.

Notifications You must be signed in to change notification settings

muhammadnavas/Code_Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 AI vs Human Code Detection System

Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.

✨ Key Features

🎯 Multi-Mode Analysis

  • Single Code Analysis: Analyze individual code snippets with detailed breakdown
  • GitHub Repository Scanning: Complete repository analysis with file-by-file insights
  • Batch File Processing: Upload and analyze multiple files simultaneously

🧠 Intelligent Detection Engine

  • Ensemble ML Models: 4 classical models (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
  • Smart Voting System: Advanced consensus mechanism with confidence weighting
  • Contradiction Detection: Automatically corrects predictions when line-level analysis conflicts with file-level results
  • Multi-Language Support: Python, Java, and JavaScript code detection

📊 Advanced Analysis & Explanations

  • Line-by-Line Breakdown: Detailed analysis of individual code lines with pattern detection
  • Confidence Scoring: Precision confidence metrics for all predictions
  • Model Agreement Tracking: Shows which models agree/disagree and why
  • Pattern Recognition: Detects coding patterns like functions, loops, imports, etc.
  • Consistency Validation: Cross-validates file-level vs line-level predictions

🔍 GitHub Integration

  • Repository Scanning: Analyzes entire GitHub repositories automatically
  • Progress Tracking: Real-time analysis progress with status updates
  • Comprehensive Reports: Downloadable analysis reports with detailed insights
  • API Integration: Direct GitHub API integration for seamless repository access

🏗️ Project Architecture

Code_Detector/
├── 📱 Web Application
│   └── app.py                    # Main Streamlit application with 3 analysis modes
├── 🤖 Machine Learning Pipeline  
│   ├── ml_train.py              # Classical ML model training (4 algorithms)
│   └── dl_train.py              # Deep learning model training (Transformers)
├── 📊 Data & Models
│   ├── Dataset/                 # Training data organized by language
│   │   ├── Python/             # Python samples (AI vs HUMAN)
│   │   ├── Java/               # Java samples (AI vs HUMAN) 
│   │   └── JS/                 # JavaScript samples (AI vs HUMAN)
│   ├── model/                  # Trained classical ML models
│   │   ├── logisticregression.pkl
│   │   ├── randomforest.pkl
│   │   ├── gradientboosting.pkl
│   │   ├── xgboost.pkl
│   │   ├── vectorizer.pkl
│   │   └── labelencoder.pkl
│   └── output/                 # Trained transformer models
│       ├── CodeBERT/           # Microsoft CodeBERT model
│       ├── CodeT5/             # Salesforce CodeT5 model  
│       └── GraphCodeBERT/      # Microsoft GraphCodeBERT model
├── 📋 Documentation
│   ├── README.md               # This file
│   └── requirements.txt        # Python dependencies
└── 🗂️ Cache & Temp Files
    └── __pycache__/            # Python bytecode cache

🚀 Quick Start Guide

Prerequisites

  • Python 3.8 or higher
  • 4GB+ RAM (8GB+ recommended for transformer models)
  • Internet connection (for GitHub repository analysis)

1. Installation

# Clone the repository
git clone https://github.com/muhammadnavas/Code_Detector.git
cd Code_Detector

# Install required dependencies
pip install -r requirements.txt

2. Launch the Application

# Start the Streamlit web interface
streamlit run app.py

🌐 Access the app at: http://localhost:8501

3. Model Training (Optional)

If you want to retrain models with custom data:

# Train classical ML models (faster, CPU-friendly)
python ml_train.py

# Train transformer models (requires GPU for optimal performance)
python dl_train.py

🎮 Usage Modes

1. 📝 Single Code Analysis

Perfect for analyzing individual code snippets:

  1. Input Methods:

    • Paste code directly into the text area
    • Upload single Python/Java/JavaScript files
  2. Analysis Output:

    • 🎯 Overall Prediction: AI vs Human with confidence score
    • 🔧 Model Breakdown: Individual model predictions and confidence
    • 📋 Line-by-Line Analysis: Detailed analysis of each code line
    • 🏷️ Pattern Detection: Identified coding patterns and structures

2. 🐙 GitHub Repository Analysis

Comprehensive analysis of entire GitHub repositories:

  1. Repository Input:

    https://github.com/username/repository
    
  2. Analysis Process:

    • 🔍 Auto-Discovery: Finds all Python files in the repository
    • Progress Tracking: Real-time analysis with progress indicators
    • 📊 Summary Statistics: Repository-wide AI vs Human breakdown
    • 📁 File-by-File Results: Detailed analysis for each file
  3. Advanced Features:

    • 🎯 Smart Corrections: Automatically corrects contradictory predictions
    • ⚠️ Warning System: Flags suspicious patterns or inconsistencies
    • 📄 Report Generation: Download comprehensive analysis reports

3. 📂 Batch File Analysis

Upload and analyze multiple files simultaneously:

  1. Multi-File Upload: Support for .py, .java, .js files
  2. Batch Processing: Analyze all files with progress tracking
  3. Consolidated Results: Summary statistics across all uploaded files

🧠 Machine Learning Architecture

🎯 Ensemble Prediction System

Our intelligent ensemble combines multiple approaches for maximum accuracy:

Classical ML Models (4 Models)

  1. 🔗 Logistic Regression

    • Linear classification with TF-IDF features
    • Fast prediction, good baseline performance
    • Confidence: Probability scores from sigmoid function
  2. 🌲 Random Forest

    • Ensemble of decision trees with voting
    • Handles feature interactions well
    • Confidence: Vote proportion from trees
  3. 📈 Gradient Boosting

    • Sequential ensemble with error correction
    • Strong performance on structured data
    • Confidence: Probability from gradient boost
  4. ⚡ XGBoost

    • Optimized gradient boosting framework
    • State-of-the-art classical ML performance
    • Confidence: Native probability estimation

🤖 Smart Ensemble Logic

  • Majority Voting: 3+ models must agree for high confidence
  • Confidence Weighting: Uses model-specific confidence scores
  • Contradiction Detection: Compares file-level vs line-level predictions
  • Smart Corrections: Automatically adjusts predictions when inconsistencies detected

🔍 Advanced Analysis Features

📋 Line-by-Line Analysis

  • Smart Filtering: Skips comments, imports, and trivial lines
  • Pattern Detection: Identifies functions, loops, conditionals, etc.
  • Confidence Thresholding: Only includes high-confidence line predictions (>60%)
  • Context Preservation: Maintains code structure understanding

⚠️ Intelligent Contradiction Detection

Our system automatically detects and corrects contradictory predictions:

# Example: File predicted as AI, but 73% of lines are Human
Original Prediction: AI (confidence: 0.86)
Line Analysis: 73% Human lines
Smart Correction: → HUMAN (adjusted confidence: 0.72)
Status: [PREDICTION CORRECTED: AIHUMAN]

🎨 Pattern Recognition Engine

Detects various coding patterns:

  • Structural: Functions, classes, imports
  • Control Flow: Loops, conditionals, exception handling
  • Modern Python: F-strings, list comprehensions, lambda functions
  • Style Indicators: Docstrings, comments, naming conventions

📊 Understanding the Results

🎯 Prediction Confidence Levels

  • 🔵 High Confidence (>0.8): Very reliable prediction
  • 🟡 Medium Confidence (0.6-0.8): Generally reliable with some uncertainty
  • 🔴 Low Confidence (<0.6): Results may be unreliable, manual review recommended

🤖 Model Agreement Indicators

  • ✅ Unanimous: All models agree (highest confidence)
  • 📊 Majority: 3/4 models agree (good confidence)
  • ⚠️ Split Decision: 2/2 split (requires careful interpretation)

🔍 Consistency Analysis

  • ✅ Consistent: File and line predictions align
  • 📊 Mixed Signals: Some disagreement between levels
  • 🔄 Auto-Corrected: System detected and fixed contradiction
  • ❌ Major Contradiction: Significant disagreement requiring manual review

🛠️ Technical Implementation

📦 Dependencies & Requirements

# Core Framework
streamlit>=1.28.0        # Web application framework

# Machine Learning  
scikit-learn>=1.3.0      # Classical ML algorithms
xgboost>=1.7.0          # Gradient boosting framework
numpy>=1.24.0           # Numerical computing
pandas>=2.0.0           # Data manipulation

# Deep Learning (Optional)
torch>=2.0.0            # PyTorch framework
transformers>=4.30.0    # Hugging Face transformers

# Web & API
requests>=2.31.0        # HTTP requests for GitHub API
joblib>=1.3.0          # Model serialization

# Utilities  
pathlib                 # Path handling (built-in)
re                      # Regular expressions (built-in)
typing                  # Type hints (built-in)

🎯 Feature Engineering

Text Preprocessing Pipeline

  1. Code Cleaning: Remove excess whitespace, normalize line endings
  2. TF-IDF Vectorization: Character-level n-grams (3-5) for classical models
  3. Feature Extraction: Syntactic patterns, complexity metrics
  4. Tokenization: Language-specific tokenization for transformers

Advanced Features

  • Syntactic Patterns: Language constructs (functions, classes, loops)
  • Stylistic Features: Naming conventions, spacing patterns
  • Complexity Metrics: Code depth, nesting levels, line lengths
  • AI Indicators: Patterns typical in AI-generated code

⚙️ System Architecture

Model Loading & Caching

# Smart model loading with caching
@st.cache_resource
def load_models():
    models = {
        'logistic': joblib.load('model/logisticregression.pkl'),
        'rf': joblib.load('model/randomforest.pkl'),
        'gb': joblib.load('model/gradientboosting.pkl'),
        'xgb': joblib.load('model/xgboost.pkl')
    }
    vectorizer = joblib.load('model/vectorizer.pkl')
    return models, vectorizer

GitHub API Integration

  • Rate Limiting: Respects GitHub API limits
  • Error Handling: Robust error handling for network issues
  • Recursive Scanning: Deep repository traversal for Python files
  • Content Processing: Handles various file encodings

Performance Optimizations

  • Streamlit Caching: Models loaded once and cached
  • Batch Processing: Efficient handling of multiple files
  • Memory Management: Optimized for large repositories
  • Progress Tracking: Real-time user feedback

🎮 Advanced Usage Examples

🔧 Programmatic Usage

# Example: Analyzing code with the system
from app import CodeAnalyzer

# Initialize analyzer
analyzer = CodeAnalyzer()

# Analyze code snippet
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

results, prediction, confidence = analyzer.analyze_code(code)

print(f"Prediction: {prediction}")
print(f"Confidence: {confidence:.3f}")

# Get individual model results
for result in results:
    print(f"{result.name}: {result.prediction} ({result.confidence:.3f})")

📊 Batch Analysis

# Example: Analyzing multiple files
files = ['file1.py', 'file2.py', 'file3.py']
results = []

for file_path in files:
    with open(file_path, 'r') as f:
        code = f.read()
    
    file_result = analyzer.analyze_file(file_path, code)
    results.append(file_result)

# Generate summary
summary = SummarizationEngine.summarize_file_analysis(results)
print(f"AI Files: {summary['ai_files']}/{summary['total_files']}")

🔧 Configuration & Customization

⚙️ Model Configuration

You can customize which models to use:

# In app.py, modify the models dictionary
model_config = {
    'logistic': True,      # Enable/disable Logistic Regression
    'random_forest': True, # Enable/disable Random Forest  
    'gradient_boost': True,# Enable/disable Gradient Boosting
    'xgboost': True       # Enable/disable XGBoost
}

🎨 UI Customization

Modify the Streamlit interface:

# Custom page configuration
st.set_page_config(
    page_title="Custom AI Detector",
    page_icon="🤖",
    layout="wide",
    initial_sidebar_state="expanded"
)

# Custom styling
st.markdown("""
<style>
    .main-header { color: #1e88e5; }
    .prediction-ai { background-color: #ffebee; }
    .prediction-human { background-color: #e8f5e8; }
</style>
""", unsafe_allow_html=True)

📈 Performance Tuning

For Large Repositories

# Adjust these parameters in app.py
MAX_FILES_ANALYZE = 100      # Limit files to analyze
LINE_CONFIDENCE_THRESHOLD = 0.7  # Higher threshold for line analysis
ENABLE_LINE_ANALYSIS = False     # Disable for faster processing

Memory Optimization

# Process files in batches
BATCH_SIZE = 10
for i in range(0, len(files), BATCH_SIZE):
    batch = files[i:i+BATCH_SIZE]
    process_batch(batch)

🚀 Model Training Guide

📚 Dataset Preparation

Organize your training data in this structure:

Dataset/
├── Python/
│   ├── AI/           # AI-generated Python code samples
│   │   ├── A1.py, A2.py, ...
│   └── HUMAN/        # Human-written Python code samples  
│       ├── H1.py, H2.py, ...
├── Java/
│   ├── AI/           # AI-generated Java code samples
│   └── HUMAN/        # Human-written Java code samples
└── JS/
    ├── AI/           # AI-generated JavaScript code samples  
    └── HUMAN/        # Human-written JavaScript code samples

🎯 Training Classical ML Models

# Train all classical models with cross-validation
python ml_train.py

Training Process:

  1. Data Loading: Loads code samples from Dataset/ directories
  2. Preprocessing: TF-IDF vectorization with character n-grams
  3. Class Balancing: Handles imbalanced datasets with class weights
  4. Model Training: Trains 4 different algorithms with hyperparameter tuning
  5. Validation: Stratified cross-validation for robust evaluation
  6. Model Saving: Saves trained models to model/ directory

Expected Output:

Loading dataset...
Found 1000 Python samples (500 AI, 500 Human)
Training Logistic Regression... Accuracy: 0.85
Training Random Forest...      Accuracy: 0.88  
Training Gradient Boosting...  Accuracy: 0.87
Training XGBoost...           Accuracy: 0.89
Models saved to model/ directory

🤖 Training Deep Learning Models

# Train transformer models (requires GPU for optimal speed)
python dl_train.py

Supported Models:

  • CodeBERT: Microsoft's code understanding model
  • CodeT5: Salesforce's code generation model
  • GraphCodeBERT: Enhanced with data flow understanding

Training Features:

  • Custom Trainer: Weighted loss for class imbalance
  • Early Stopping: Prevents overfitting
  • Learning Rate Scheduling: Optimizes training convergence
  • Evaluation Metrics: F1-macro score for balanced evaluation

🔧 Performance Tips

Faster Analysis

  1. Disable Line Analysis: For quick file-level predictions only
  2. Use Fewer Models: Enable only fast models (Logistic, Random Forest)
  3. Batch Processing: Analyze multiple files together
  4. GPU Acceleration: Use CUDA for transformer models

Better Accuracy

  1. Enable All Models: Use full ensemble for best results
  2. Line Analysis: Enable for detailed insights
  3. Large Training Data: More diverse training samples improve accuracy
  4. Regular Retraining: Update models with new AI-generated code patterns

📊 Understanding Model Behavior

Why Models Disagree

  • Different Feature Focus: Each model looks at different code aspects
  • Training Data Variance: Models trained on slightly different samples
  • Algorithm Differences: Linear vs tree-based vs ensemble approaches
  • Overfitting: Some models may overfit to specific patterns

When to Trust Results

  • High Agreement: All 4 models agree → High confidence
  • High Confidence: Individual confidence scores > 0.8
  • Line Consistency: File prediction matches line analysis
  • Pattern Recognition: Clear AI/Human coding patterns detected

🤝 Contributing

We welcome contributions to improve the AI detection system!

🎯 Areas for Contribution

  1. New Programming Languages

    • Add support for C++, Go, Rust, etc.
    • Language-specific pattern detection
    • Training data collection
  2. Model Improvements

    • Advanced ensemble techniques
    • New feature engineering approaches
    • Deep learning architecture improvements
  3. User Interface Enhancements

    • Better visualization components
    • Real-time analysis features
    • API endpoint development
  4. Dataset Expansion

    • More diverse AI-generated code samples
    • Different AI model outputs (GPT, Claude, etc.)
    • Domain-specific code samples

📋 Development Setup

# 1. Fork and clone the repository
git clone https://github.com/your-username/Code_Detector.git
cd Code_Detector

# 2. Create development environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install development dependencies  
pip install -r requirements.txt
pip install pytest black flake8  # Additional dev tools

# 4. Run tests
pytest tests/

# 5. Format code
black .
flake8 .

🔄 Contribution Workflow

  1. Create Issue: Describe the feature/bug
  2. Fork Repository: Create your own copy
  3. Create Branch: git checkout -b feature/your-feature
  4. Make Changes: Implement your improvements
  5. Add Tests: Ensure functionality works
  6. Submit PR: Create pull request with description

🆘 Support & Community

📞 Getting Help

📈 Roadmap

  • Multi-language expansion (C++, Go, Rust)
  • Real-time API endpoints for integration
  • Advanced visualizations for pattern analysis
  • Cloud deployment options
  • Mobile app for on-the-go analysis
  • Plugin development for popular IDEs

🌟 Star History

If you find this project useful, please ⭐ star it on GitHub to help others discover it!


Built with ❤️ for the developer community

Empowering developers with intelligent AI detection capabilities

About

Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages