Collecting training data from APIs now costs 10-100x what it cost in 2022 because platforms monetized their data

devtools0 views
In 2022, you could scrape Reddit via the free API (60 requests/minute), download all of Twitter's academic archive for free, use Google Search API at $5 per 1,000 queries, and access Stack Overflow data dumps for free. In 2026: Reddit API costs $0.24 per 1,000 calls (capped at specific tiers), Twitter/X API is $42,000/month for full archive access, Google Search API is $10 per 1,000 queries, and Stack Overflow licensed its data exclusively to specific AI companies. Building the same training dataset that cost $500 in API calls in 2022 now costs $50,000-500,000. So what? The era of cheap data collection is over. Platforms realized their user-generated content is the raw material for AI training and they want to be paid. This is economically rational for the platforms but devastating for AI startups and researchers. Only companies that can afford $42K/month Twitter API access or multi-million-dollar Reddit data licenses can build models on social media data. University researchers are priced out entirely. The result: AI training data becomes a moat that favors incumbents who collected data before the paywalls went up. Why does this persist? Platforms have every right to monetize their data. But the transition was sudden — free APIs that researchers and startups depended on were shut down or repriced 100x within months. No alternative data sources emerged to fill the gap. Common Crawl provides web snapshots but excludes API-gated content. The data that is most valuable for AI (conversations, opinions, expert knowledge) is exactly the data that platforms are locking down.

Evidence

Reddit API pricing: $0.24/1K calls, effective June 2023. Twitter/X API: Free tier removed Jan 2023, Basic $200/month (limited), Pro $42K/month. Stack Overflow data licensing: exclusive deals with AI companies (reported 2024). Google Custom Search API: $5-10 per 1K queries. The Pile and RedPajama were built before API paywalls — equivalent datasets could not be built today at the same cost.

Comments