Automated URL validation system for Turing ES using Selenium for real browser navigation, avoiding anti-bot blocks.
- β Real browser navigation via Selenium (Chrome/Chromium)
- β Parallel execution with 2-5 simultaneous browsers (configurable)
- β Web interface with Streamlit for visual configuration and real-time monitoring
- β Command-line interface (CLI) for automation
- β Smart retry system (up to 3 attempts per URL)
- β Automatic API pagination
- β Reports in TXT and JSON formats
- β Automatic email sending via Mailchimp Transactional
- β Performance optimizations (disables images/CSS/JS)
- β Configuration via INI file or environment variables
- β Detailed logging
- β Multi-platform support (Windows/Linux/Mac)
- Python 3.8 or higher
- Google Chrome or Chromium installed
- Internet connection
The application is unified in a single file (app.py) that automatically detects the execution mode:
- GUI mode: When launched via Streamlit (
run-web.batorstreamlit run app.py) - CLI mode: When launched directly (
run-cli.batorpython app.py)
Both modes share the same background services and architecture for consistency.
Visual interface with real-time monitoring:
Windows (PowerShell):
.\run-web.batLinux/Mac:
chmod +x run-web.sh
./run-web.shThen open your browser at: http://localhost:8501
Automated execution for scripts:
Windows (PowerShell):
# Interactive menu (select URL from list)
.\run-cli.bat
# Direct URL selection by name
.\run-cli.bat prod-publish
# With options
.\run-cli.bat prod-publish --verbose
.\run-cli.bat stage-author --no-emailLinux/Mac:
chmod +x run-cli.sh
# Interactive menu (select URL from list)
./run-cli.sh
# Direct URL selection by name
./run-cli.sh prod-publish
# With options
./run-cli.sh prod-publish --verbose
./run-cli.sh stage-author --no-emailAvailable CLI options:
--url-name <name>: Select URL by name (e.g., prod-publish, stage-author)--verbose: Enable detailed logging--headless: Run browser in headless mode--no-email: Disable email sending--config <file>: Use custom config file
What the scripts do:
- β Check Python version
- β Create virtual environment (venv)
- β Install dependencies
- β Run
app.pyin the appropriate mode
If you prefer manual installation:
# 1. Create virtual environment
python -m venv venv
# 2. Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run
python main.pyThe config.ini file contains all system settings:
[API]
# Multiple base URLs - Format: base_url.ID = name;url
base_url.1 = prod-publish;https://turing.insper.edu.br/api/sn/insper-prod-publish/search
base_url.2 = prod-author;https://turing.insper.edu.br/api/sn/insper-prod-author/search
base_url.3 = stage-publish;https://hml-turing.insper.edu.br/api/sn/insper-stage-publish/search
base_url.4 = stage-author;https://hml-turing.insper.edu.br/api/sn/insper-stage-author/search
locale = pt
[EMAIL]
recipient = [email protected]
sender_email = [email protected]
sender_name = URL Checker - Turing
[BREVO]
# Get your key at: https://app.brevo.com/settings/keys/api
api_key = your-api-key-here
[SELENIUM]
page_load_timeout = 15
element_wait_timeout = 10
headless = false
[RETRY]
max_attempts = 3
retry_delay = 2
[PERFORMANCE]
page_delay = 0.5
url_check_delay = 0.3
disable_images = true
parallel_browsers = 3 # 1-5 simultaneous browsers (1 = sequential)
[REPORT]
output_dir = reports
max_urls_in_email = 50You can configure multiple base URLs in config.ini:
[API]
base_url.1 = prod-publish;https://turing.insper.edu.br/api/sn/insper-prod-publish/search
base_url.2 = stage-author;https://hml-turing.insper.edu.br/api/sn/insper-stage-author/searchFormat: base_url.ID = name;url
- ID: Unique number (1, 2, 3...)
- name: Short identifier (used in CLI)
- url: Full API endpoint
CLI Usage:
# Interactive menu
./run-cli.sh
# Direct selection by name
./run-cli.sh prod-publish
./run-cli.sh stage-authorThe system supports parallel URL checking with multiple simultaneous browsers:
parallel_browsers = 1: Sequential mode (original behavior)parallel_browsers = 2-5: Parallel mode with N simultaneous Chrome instances- Recommended:
3browsers for balanced performance - Console Output: Shows "β‘ Parallel mode: N simultaneous browsers"
Performance Tips:
- More browsers = faster checking but higher CPU/memory usage
- Start with 3 browsers and adjust based on your system resources
- Sequential mode (1 browser) is more stable but slower
You can use environment variables to override config.ini settings:
# Windows (PowerShell):
$env:BREVO_API_KEY="your-key-here"
$env:EMAIL_RECIPIENT="[email protected]"
# Linux/Mac:
export BREVO_API_KEY="your-key-here"
export EMAIL_RECIPIENT="[email protected]"Variable format: SECTION_KEY (e.g., BREVO_API_KEY, EMAIL_RECIPIENT)
To send emails, you need to configure Brevo (formerly Sendinblue):
- Access: https://app.brevo.com/
- Create account or login (300 emails/day free)
- Go to Settings β SMTP & API β API Keys
- Create new API Key
- Configure in
config.inior via environment variable
Start the Streamlit application:
# Windows
.\run-web.bat
# Linux/Mac
./run-web.shThe web interface will automatically open in your browser at http://localhost:8501
π Dashboard Metrics (4-column layout):
- Total URLs Checked: Shows current/total URLs (e.g., 150/500)
- Failed URLs: Real-time count of problematic URLs
- Requests/s: Current processing speed
- Est. Time Remaining: Dynamic time estimate to completion
π Real-time Performance Chart:
- Response time tracking for each URL check
- Moving average trend line (10-request window)
- Live updates as URLs are being checked
- Visual performance monitoring
π Recent Activity Log (Last 15 entries):
- Sequential ID tracking (e.g., #123, #124, #125)
- Timestamp for each check (HH:MM:SS format)
- Status indicators:
- β OK: Successful check with response time (e.g., "850ms")
- β Failed: Error with HTTP status code (e.g., "Status: 404")
- π Checking: Currently in progress with attempt number
- Full URL display
- FIFO order (most recent first)
- Smooth scrolling with no jumping items
ποΈ Visual Configuration Panel:
- Base URL Selection: Choose from multiple configured endpoints
- Parallel Execution: Slider to set 1-5 simultaneous browsers
- Timeout Settings: Adjust page load and element wait times
- Delays Configuration: Set delays between pages and URL checks
- Retry Settings: Configure max attempts and retry delay
- Performance Options: Toggle image loading, CSS, and JavaScript
- Email Settings: Configure recipient and sender details
- Real-time Validation: All settings validated before starting
β―οΈ Execution Control:
- Start Button: Begin URL checking with current configuration
- Stop Button: Gracefully stop the checking process anytime
- Progress Bar: Visual progress indicator with percentage and elapsed time
- Status Messages: Real-time feedback on execution state
π§ Automatic Email Notifications:
- Sends report automatically when check completes with failures
- Visual confirmation when email is sent
- Warning messages if email configuration is missing
- Error handling with clear feedback messages
πΎ Results Export:
- Download failed URLs as JSON file
- Includes complete metadata for each failure
- Ready for further processing or analysis
π Live Updates:
- Batch processing: Handles up to 10 updates per refresh cycle
- Optimized refresh rate (50ms) for smooth real-time experience
- No duplicate counting: Each failed URL counted once (even with retries)
- Consistent display: All metrics update simultaneously
For automation and scripts:
python main.pyCLI Options:
# Verbose mode (detailed logs)
python main.py --verbose
# Headless mode (no browser window)
python main.py --headless
# Don't send email
python main.py --no-email
# Use custom configuration file
python main.py --config my-config.ini
# Combine options
python main.py --verbose --headless --no-emailurl-checker/
βββ main.py # CLI entry point
βββ app.py # Streamlit web interface
βββ config.ini # Configuration
βββ requirements.txt # Python dependencies
βββ run.bat # Windows CLI execution
βββ run.sh # Linux/Mac CLI execution
βββ run-web.bat # Windows web interface
βββ run-web.sh # Linux/Mac web interface
βββ README.md # This documentation
βββ .gitignore # Git ignored files
βββ .env.example # Environment variables example
βββ src/ # Source code
β βββ __init__.py
β βββ checker.py # Main URLChecker class
β βββ config_loader.py # Configuration manager
β βββ report_generator.py # Report generator
β βββ email_sender.py # Email sending
βββ reports/ # Generated reports (TXT/JSON)
βββ logs/ # Log files
The system generates two types of reports:
- Human-readable format
- Includes date, time, and summary
- Lists all problematic URLs
Example:
================================================================================
URL VALIDATION REPORT
================================================================================
Date: 01/15/2024 14:30:00
Total problematic URLs: 5
================================================================================
URL: https://example.com/page1
Status Code: 404
API Page: 3
Attempts: 3
--------------------------------------------------------------------------------
...
- Structured format for automated processing
- Includes complete metadata
- Easy integration with other tools
Example:
{
"timestamp": "2024-01-15T14:30:00",
"total_failed": 5,
"failed_urls": [
{
"url": "https://example.com/page1",
"status_code": 404,
"page": 3,
"attempts": 3
}
]
}When problematic URLs are found, the system automatically sends an email containing:
- Summary: Total problematic URLs
- URL List: Up to 50 URLs in email body
- Attachments: Complete reports (TXT and JSON)
- HTML Formatting: Styled and easy-to-read email
Email is NOT sent if:
- No problematic URLs found
- Mailchimp API Key not configured
--no-emailoption used
In config.ini:
[SELENIUM]
page_load_timeout = 20 # Increase if pages take long to load
element_wait_timeout = 15
[RETRY]
max_attempts = 5 # More attempts for unstable sites
retry_delay = 3 # More time between attemptsTo speed up verification:
[PERFORMANCE]
page_delay = 0.2 # Reduce delay between pages
url_check_delay = 0.1 # Reduce delay between URLs
disable_images = true # Keep disabled
parallel_browsers = 5 # Maximum parallelization (use with caution)For slow connections or resource-constrained systems:
[PERFORMANCE]
page_delay = 1.0 # Increase delay
url_check_delay = 0.5
parallel_browsers = 1 # Sequential mode (most stable)Note: Parallel execution with multiple browsers significantly speeds up checking but increases CPU and memory usage. Monitor your system resources and adjust accordingly.
To run without graphical interface (servers):
[SELENIUM]
headless = trueOr via command line:
python main.py --headlessLinux:
# Ubuntu/Debian
sudo apt-get install chromium-browser chromium-chromedriver
# Fedora/RHEL
sudo dnf install chromium chromium-chromedriverWindows/Mac: Install Google Chrome from official website
Make sure virtual environment is activated and dependencies installed:
pip install -r requirements.txtCheck:
- β Mailchimp API Key configured
- β Recipient email configured
- β
mailchimp-transactionallibrary installed - β Problematic URLs were found
- β
--no-emailoption not used
Reduce timeouts in config.ini:
[SELENIUM]
page_load_timeout = 10
element_wait_timeout = 5Logs are saved in logs/url_checker_YYYYMMDD.log:
- INFO: Normal operations
- WARNING: Non-critical issues
- ERROR: Errors needing attention
Use --verbose for more detailed logs:
python main.py --verboseContributions are welcome! To contribute:
- Fork the project
- Create a branch for your feature (
git checkout -b feature/MyFeature) - Commit your changes (
git commit -m 'Add MyFeature') - Push to the branch (
git push origin feature/MyFeature) - Open a Pull Request
This project is under the MIT license. See the LICENSE file for more details.
Version: 2025.4 Last update: December 2025