🔍 URL Checker - Turing ES

Automated URL validation system for Turing ES using Selenium for real browser navigation, avoiding anti-bot blocks.

📋 Features

✅ Real browser navigation via Selenium (Chrome/Chromium)
✅ Parallel execution with 2-5 simultaneous browsers (configurable)
✅ Web interface with Streamlit for visual configuration and real-time monitoring
✅ Command-line interface (CLI) for automation
✅ Smart retry system (up to 3 attempts per URL)
✅ Automatic API pagination
✅ Reports in TXT and JSON formats
✅ Automatic email sending via Mailchimp Transactional
✅ Performance optimizations (disables images/CSS/JS)
✅ Configuration via INI file or environment variables
✅ Detailed logging
✅ Multi-platform support (Windows/Linux/Mac)

🚀 Installation and Execution

Prerequisites

Python 3.8 or higher
Google Chrome or Chromium installed
Internet connection

Architecture

The application is unified in a single file (app.py) that automatically detects the execution mode:

GUI mode: When launched via Streamlit (run-web.bat or streamlit run app.py)
CLI mode: When launched directly (run-cli.bat or python app.py)

Both modes share the same background services and architecture for consistency.

Quick Start

Option 1: Web Interface (Recommended)

Visual interface with real-time monitoring:

Windows (PowerShell):

.\run-web.bat

Linux/Mac:

chmod +x run-web.sh
./run-web.sh

Then open your browser at: http://localhost:8501

Option 2: Command Line Interface

Automated execution for scripts:

Windows (PowerShell):

# Interactive menu (select URL from list)
.\run-cli.bat

# Direct URL selection by name
.\run-cli.bat prod-publish

# With options
.\run-cli.bat prod-publish --verbose
.\run-cli.bat stage-author --no-email

Linux/Mac:

chmod +x run-cli.sh

# Interactive menu (select URL from list)
./run-cli.sh

# Direct URL selection by name
./run-cli.sh prod-publish

# With options
./run-cli.sh prod-publish --verbose
./run-cli.sh stage-author --no-email

Available CLI options:

--url-name <name>: Select URL by name (e.g., prod-publish, stage-author)
--verbose: Enable detailed logging
--headless: Run browser in headless mode
--no-email: Disable email sending
--config <file>: Use custom config file

What the scripts do:

✓ Check Python version
✓ Create virtual environment (venv)
✓ Install dependencies
✓ Run app.py in the appropriate mode

Manual Installation

If you prefer manual installation:

# 1. Create virtual environment
python -m venv venv

# 2. Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run
python main.py

⚙️ Configuration

config.ini File

The config.ini file contains all system settings:

[API]
# Multiple base URLs - Format: base_url.ID = name;url
base_url.1 = prod-publish;https://turing.insper.edu.br/api/sn/insper-prod-publish/search
base_url.2 = prod-author;https://turing.insper.edu.br/api/sn/insper-prod-author/search
base_url.3 = stage-publish;https://hml-turing.insper.edu.br/api/sn/insper-stage-publish/search
base_url.4 = stage-author;https://hml-turing.insper.edu.br/api/sn/insper-stage-author/search
locale = pt

[EMAIL]
recipient = [email protected]
sender_email = [email protected]
sender_name = URL Checker - Turing

[BREVO]
# Get your key at: https://app.brevo.com/settings/keys/api
api_key = your-api-key-here

[SELENIUM]
page_load_timeout = 15
element_wait_timeout = 10
headless = false

[RETRY]
max_attempts = 3
retry_delay = 2

[PERFORMANCE]
page_delay = 0.5
url_check_delay = 0.3
disable_images = true
parallel_browsers = 3  # 1-5 simultaneous browsers (1 = sequential)

[REPORT]
output_dir = reports
max_urls_in_email = 50

Key Configuration Parameters

Multiple Base URLs

You can configure multiple base URLs in config.ini:

[API]
base_url.1 = prod-publish;https://turing.insper.edu.br/api/sn/insper-prod-publish/search
base_url.2 = stage-author;https://hml-turing.insper.edu.br/api/sn/insper-stage-author/search

Format: base_url.ID = name;url

ID: Unique number (1, 2, 3...)
name: Short identifier (used in CLI)
url: Full API endpoint

CLI Usage:

# Interactive menu
./run-cli.sh

# Direct selection by name
./run-cli.sh prod-publish
./run-cli.sh stage-author

Parallel Execution

The system supports parallel URL checking with multiple simultaneous browsers:

parallel_browsers = 1: Sequential mode (original behavior)
parallel_browsers = 2-5: Parallel mode with N simultaneous Chrome instances
Recommended: 3 browsers for balanced performance
Console Output: Shows "⚡ Parallel mode: N simultaneous browsers"

Performance Tips:

More browsers = faster checking but higher CPU/memory usage
Start with 3 browsers and adjust based on your system resources
Sequential mode (1 browser) is more stable but slower

Other Settings

Environment Variables

You can use environment variables to override config.ini settings:

# Windows (PowerShell):
$env:BREVO_API_KEY="your-key-here"
$env:EMAIL_RECIPIENT="[email protected]"

# Linux/Mac:
export BREVO_API_KEY="your-key-here"
export EMAIL_RECIPIENT="[email protected]"

Variable format: SECTION_KEY (e.g., BREVO_API_KEY, EMAIL_RECIPIENT)

Brevo Configuration

To send emails, you need to configure Brevo (formerly Sendinblue):

Access: https://app.brevo.com/
Create account or login (300 emails/day free)
Go to Settings → SMTP & API → API Keys
Create new API Key
Configure in config.ini or via environment variable

📖 Usage

Web Interface

Start the Streamlit application:

# Windows
.\run-web.bat

# Linux/Mac
./run-web.sh

The web interface will automatically open in your browser at http://localhost:8501

Web Interface Features

📊 Dashboard Metrics (4-column layout):

Total URLs Checked: Shows current/total URLs (e.g., 150/500)
Failed URLs: Real-time count of problematic URLs
Requests/s: Current processing speed
Est. Time Remaining: Dynamic time estimate to completion

📈 Real-time Performance Chart:

Response time tracking for each URL check
Moving average trend line (10-request window)
Live updates as URLs are being checked
Visual performance monitoring

📝 Recent Activity Log (Last 15 entries):

Sequential ID tracking (e.g., #123, #124, #125)
Timestamp for each check (HH:MM:SS format)
Status indicators:
- ✅ OK: Successful check with response time (e.g., "850ms")
- ❌ Failed: Error with HTTP status code (e.g., "Status: 404")
- 🔍 Checking: Currently in progress with attempt number
Full URL display
FIFO order (most recent first)
Smooth scrolling with no jumping items

🎛️ Visual Configuration Panel:

Base URL Selection: Choose from multiple configured endpoints
Parallel Execution: Slider to set 1-5 simultaneous browsers
Timeout Settings: Adjust page load and element wait times
Delays Configuration: Set delays between pages and URL checks
Retry Settings: Configure max attempts and retry delay
Performance Options: Toggle image loading, CSS, and JavaScript
Email Settings: Configure recipient and sender details
Real-time Validation: All settings validated before starting

⏯️ Execution Control:

Start Button: Begin URL checking with current configuration
Stop Button: Gracefully stop the checking process anytime
Progress Bar: Visual progress indicator with percentage and elapsed time
Status Messages: Real-time feedback on execution state

📧 Automatic Email Notifications:

Sends report automatically when check completes with failures
Visual confirmation when email is sent
Warning messages if email configuration is missing
Error handling with clear feedback messages

💾 Results Export:

Download failed URLs as JSON file
Includes complete metadata for each failure
Ready for further processing or analysis

🔄 Live Updates:

Batch processing: Handles up to 10 updates per refresh cycle
Optimized refresh rate (50ms) for smooth real-time experience
No duplicate counting: Each failed URL counted once (even with retries)
Consistent display: All metrics update simultaneously

Command Line Interface

For automation and scripts:

python main.py

CLI Options:

# Verbose mode (detailed logs)
python main.py --verbose

# Headless mode (no browser window)
python main.py --headless

# Don't send email
python main.py --no-email

# Use custom configuration file
python main.py --config my-config.ini

# Combine options
python main.py --verbose --headless --no-email

📂 Project Structure

url-checker/
├── main.py                 # CLI entry point
├── app.py                  # Streamlit web interface
├── config.ini             # Configuration
├── requirements.txt       # Python dependencies
├── run.bat               # Windows CLI execution
├── run.sh                # Linux/Mac CLI execution
├── run-web.bat           # Windows web interface
├── run-web.sh            # Linux/Mac web interface
├── README.md             # This documentation
├── .gitignore            # Git ignored files
├── .env.example          # Environment variables example
├── src/                  # Source code
│   ├── __init__.py
│   ├── checker.py        # Main URLChecker class
│   ├── config_loader.py  # Configuration manager
│   ├── report_generator.py  # Report generator
│   └── email_sender.py   # Email sending
├── reports/              # Generated reports (TXT/JSON)
└── logs/                 # Log files

📊 Reports

The system generates two types of reports:

TXT Report

Human-readable format
Includes date, time, and summary
Lists all problematic URLs

Example:

================================================================================
URL VALIDATION REPORT
================================================================================
Date: 01/15/2024 14:30:00
Total problematic URLs: 5
================================================================================

URL: https://example.com/page1
Status Code: 404
API Page: 3
Attempts: 3
--------------------------------------------------------------------------------
...

JSON Report

Structured format for automated processing
Includes complete metadata
Easy integration with other tools

Example:

{
  "timestamp": "2024-01-15T14:30:00",
  "total_failed": 5,
  "failed_urls": [
    {
      "url": "https://example.com/page1",
      "status_code": 404,
      "page": 3,
      "attempts": 3
    }
  ]
}

📧 Automatic Email

When problematic URLs are found, the system automatically sends an email containing:

Summary: Total problematic URLs
URL List: Up to 50 URLs in email body
Attachments: Complete reports (TXT and JSON)
HTML Formatting: Styled and easy-to-read email

Email is NOT sent if:

No problematic URLs found
Mailchimp API Key not configured
--no-email option used

🔧 Customization

Adjust Timeouts

In config.ini:

[SELENIUM]
page_load_timeout = 20  # Increase if pages take long to load
element_wait_timeout = 15

[RETRY]
max_attempts = 5  # More attempts for unstable sites
retry_delay = 3  # More time between attempts

Performance

To speed up verification:

[PERFORMANCE]
page_delay = 0.2  # Reduce delay between pages
url_check_delay = 0.1  # Reduce delay between URLs
disable_images = true  # Keep disabled
parallel_browsers = 5  # Maximum parallelization (use with caution)

For slow connections or resource-constrained systems:

[PERFORMANCE]
page_delay = 1.0  # Increase delay
url_check_delay = 0.5
parallel_browsers = 1  # Sequential mode (most stable)

Note: Parallel execution with multiple browsers significantly speeds up checking but increases CPU and memory usage. Monitor your system resources and adjust accordingly.

Headless Mode

To run without graphical interface (servers):

[SELENIUM]
headless = true

Or via command line:

python main.py --headless

🐛 Troubleshooting

Error: "Chrome/Chromium not found"

Linux:

# Ubuntu/Debian
sudo apt-get install chromium-browser chromium-chromedriver

# Fedora/RHEL
sudo dnf install chromium chromium-chromedriver

Windows/Mac: Install Google Chrome from official website

Error: "ModuleNotFoundError"

Make sure virtual environment is activated and dependencies installed:

pip install -r requirements.txt

Email not being sent

Check:

✓ Mailchimp API Key configured
✓ Recipient email configured
✓ mailchimp-transactional library installed
✓ Problematic URLs were found
✓ --no-email option not used

Pages taking too long

Reduce timeouts in config.ini:

[SELENIUM]
page_load_timeout = 10
element_wait_timeout = 5

📝 Logs

Logs are saved in logs/url_checker_YYYYMMDD.log:

INFO: Normal operations
WARNING: Non-critical issues
ERROR: Errors needing attention

Use --verbose for more detailed logs:

python main.py --verbose

🤝 Contributing

Contributions are welcome! To contribute:

Fork the project
Create a branch for your feature (git checkout -b feature/MyFeature)
Commit your changes (git commit -m 'Add MyFeature')
Push to the branch (git push origin feature/MyFeature)
Open a Pull Request

📄 License

This project is under the MIT license. See the LICENSE file for more details.

Version: 2025.4 Last update: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
app_cli.py		app_cli.py
app_gui.py		app_gui.py
config.example.ini		config.example.ini
email_template.html		email_template.html
kill_browsers.py		kill_browsers.py
requirements.txt		requirements.txt
run-cli.bat		run-cli.bat
run-cli.sh		run-cli.sh
run-web.bat		run-web.bat
run-web.sh		run-web.sh
run.py		run.py

License

openviglet/turing-monitoring

Folders and files

Latest commit

History

Repository files navigation