Scraping websites for training data breaks every 2-4 weeks because the site changes its HTML structure
devtoolsdevtools0 views
You built a scraper to collect product reviews from e-commerce sites for sentiment analysis training data. It worked perfectly for 3 weeks. Then the site redesigned their review section — the div class changed from 'review-content' to 'ReviewCard__content' and the star rating moved from an aria-label to a data attribute. Your scraper returns empty arrays. You fix it. Two weeks later, they add a React hydration layer that loads reviews client-side — your scraper gets the server-rendered skeleton with no reviews. You switch to Playwright headless browser. It works for a month. They add Cloudflare bot detection. So what? Web scraping is the primary method for collecting real-world text data (reviews, forums, news, social media) for ML training. But every scraper is brittle — tied to specific HTML structures that change without warning. A data pipeline that depends on 50 scrapers has 2-3 breaking changes per week across those 50 sources. Maintaining scrapers is a full-time job that produces no value — you are not building anything new, just keeping existing collection alive. Why does this persist? Websites have no obligation to maintain stable HTML structures. They optimize for user experience and A/B testing, not for scraper compatibility. APIs exist for some platforms (Reddit, Twitter) but are increasingly paywalled or rate-limited. Common Crawl provides historical snapshots but not real-time data. There is no standard for 'machine-readable website content' beyond RSS (which most sites have abandoned).
Evidence
Common Crawl: 250B+ pages but monthly snapshots, not real-time. Reddit API pricing: $0.24 per 1K API calls since June 2023. Twitter/X API: $42K/month for full access. Cloudflare bot detection blocks 30%+ of automated requests. Scrapy and BeautifulSoup are the most popular tools but have no auto-repair when HTML changes.