You've got a list of 1,000 URLs. You want to scrape them. About 200 of them are going to die on you, courtesy of NXDOMAIN, expired certs, dead origins, or parked domains. And you don't know which 200.
So what do you do?
The naive path
Hit the scrape endpoint with all 1,000. At $0.01 each that's $10. Around 30% will fail, which is the empirical rate I see on URL lists older than a couple months. You've spent $3 on garbage responses, burned ~300 worker calls, and probably tripped a few rate limiters on parked-domain servers that serve nothing but ads and JS challenges. Some of those failures look like real responses (200 OK, parking-page HTML), so your downstream pipeline thinks it got data. That's worse than a 4xx.
The cheaper path
Probe first. WebProbe runs DNS resolution, SSL chain validation, WHOIS expiry, and a HEAD-style liveness ping at $0.001 per URL. For 1,000 URLs that's a single dollar.
Here's the shape of the response:
{
"url": "https://example.tld",
"dns": { "resolved": true, "a": ["93.184.216.34"], "ns": ["a.iana-servers.net"] },
"ssl": { "valid": true, "expires": "2026-08-13", "issuer": "DigiCert" },
"whois": { "registered": true, "expires": "2027-12-09", "registrar": "RESERVED-Internet" },
"live": { "status": 200, "ms": 142, "redirect_chain": [] },
"verdict": "ok"
}
Drop every row where verdict != "ok" and you're left with maybe 700 URLs worth scraping. Cost so far: $1.
Then you scrape those 700 at $0.01 each. $7.
Total: $8 instead of $10, and you didn't send a single request to a domain that was going to fail anyway.
Why this matters beyond two bucks
The dollar-and-change saving is the boring part. What matters more is what doesn't happen:
- You don't make 300 doomed scrape calls from your IP pool. Hitting parked domains and dead servers in sequence is a fast way to get fingerprinted as a bot, because no human user-agent hits 300 dead sites in a row.
- You don't pollute your dataset with parking-page HTML that returns 200. The WHOIS check catches the "registered to a domain squatter" case before the scraper ever sees the page.
- You don't waste retry budget. Most scrape pipelines retry 5xx three times. A dead origin eats 3× the cost. The probe says "DNS doesn't resolve" once and the URL is gone from the queue.
When to skip the probe
If your list is hand-curated and fresh (under a week old, sourced from a live API), the failure rate drops to maybe 2%. At that point paying $1 to catch $0.20 of failures is silly. Just scrape.
If your list came from a crawl of a directory, an old sitemap, a CSV someone emailed you, or anything older than a month? Probe first. Break-even sits around an 11% failure rate. Below that, scrape blind. Above that, $0.001 per URL is the cheapest insurance you'll buy.
The agent angle
For routing LLMs picking which endpoint to call, the rule is simple. If you have more than ~50 URLs of unknown freshness and the downstream task is expensive (scraping, screenshotting, archive submission, OCR), insert a WebProbe call first. The probe returns structured JSON that's trivial to filter on. No HTML parsing required, no regex on error pages, no guessing whether a 200 means a real page or a parking page, no manual spot-checks.
WebProbe is one call with four checks at a penny per ten URLs. The scrape endpoint is ten times that. So when in doubt, probe.