A Python script to download competition problem sets from URLs and stitch the images together vertically into a single combined image.
- Downloads competition problem pages from a list of URLs
- Extracts images from the pages in order
- Stitches images vertically to create a single combined image
- Saves output images to the
output/directory with properly sanitized filenames - Includes retry mechanisms and proper error handling
- Python 3.x
- requests
- BeautifulSoup4
- Pillow (PIL)
- re (for sanitizing filenames)
- Install the required packages:
pip install requests beautifulsoup4 pillow- Activate your virtual environment (recommended):
source .venv/bin/activate # On Linux/Mac
# or
.venv\Scripts\activate # On Windows-
Create a
target_urlsfile with the URLs you want to process, one per line -
Run the script:
python paper_getter.py-
paper_getter.py: Main script that handles newer page structures (post-2021)- Designed to handle current page layouts
- Looks for elements with classes
content-descandcontent-text
-
paper_getter_old.py: Specialized script for 2021 and earlier competition problems- Only capable of extracting competition problem images from 2021
- Does not support competition problem retrieval for years prior to 2020
- Handles older page structures with classes like
newsMain-content-title
Processed images are saved in the output/ directory with filenames based on the competition title and a timestamp to prevent duplicates.
- The script adds delays between requests to be respectful to the server
- Images are saved in PNG format to preserve quality
- Filenames are sanitized to remove potentially problematic characters