You get:
- brittle selectors that break when the site changes
- no error handling for missing elements
- no rate limiting (getting blocked)
- no pagination handling
- scrapers that work once and never again
But a scraper is not a one-off script.
It is a maintainable data extraction system.
- Requests: handle headers, cookies, rate limiting
- Parsing: BeautifulSoup or Selenium for dynamic content
- Selectors: CSS or XPath (robust, not brittle)
- Error handling: missing elements, timeouts, retries
- Data storage: CSV, JSON, or database
Without structure, scrapers break and get blocked.
This framework forces AI to build robust scrapers.
Assume the role of a web scraping engineer who builds robust, maintainable scrapers. Your task is to generate a web scraper script. Generate: 1. IMPORTS - requests, BeautifulSoup (or selenium) - time, csv/json 2. SCRAPER CONFIGURATION - Headers (User-Agent) - Rate limiting (time.sleep) 3. FETCH FUNCTION - Get page HTML - Handle HTTP errors - Retry logic 4. PARSE FUNCTION - Extract data using CSS selectors or XPath - Handle missing elements gracefully 5. PAGINATION HANDLING (if applicable) - Loop through pages - Stop when no more pages 6. DATA STORAGE - Save to CSV or JSON 7. MAIN FUNCTION - Orchestrate the scraping process INPUTS: Target URL: [INSERT] Data to Extract (list fields with descriptions): [LIST] Site Structure (how data is organized): [E.G., "Product listings on grid, each product has title, price, rating"] Pagination Type: [NONE / NEXT BUTTON / URL PARAMETER / INFINITE SCROLL] Dynamic Content (JavaScript rendered): [YES / NO] Rate Limit (requests per second): [INSERT OR "1"] RULES: - Always set a User-Agent header (identify your scraper) - Add delays between requests (be respectful) - Use CSS selectors or XPath (not regex for HTML) - Handle missing elements (set to None, don't crash) - Save data incrementally (don't lose progress on error) - Check robots.txt before scraping - Respect website terms of service
- Check robots.txt before scraping (e.g., example.com/robots.txt).
- Start with a small test (limit pages to 5) before scaling.
- Add delays between requests (1-2 seconds minimum).
- If the site uses JavaScript, use Selenium (not BeautifulSoup).
- Save data to CSV incrementally to avoid losing progress on errors.
- Be respectful — don’t hammer the server.
Target URL: https://books.toscrape.com
Data to Extract: Title (h3 > a title attribute), Price (price_color class), Rating (star-rating class), Availability (instock availability class)
Site Structure: Each book is in an article with class “product_pod”. Next page button exists.
Pagination Type: NEXT BUTTON
Dynamic Content: NO (static HTML)
Rate Limit: 1 request per second
This framework improves outcomes by forcing:
- proper headers (avoid blocks)
- rate limiting (be respectful)
- error handling (resilience)
- pagination (completeness)
- data storage (usability)
Great web scrapers don’t just extract data — they handle errors, respect rate limits, and save progress.
Build Better AI Systems
Subscribe for advanced prompt engineering, AI coding tools, Python frameworks, and practical strategies for developers and engineers.
