back to work

Thirty thousand ATS feeds, one Discord channel, hourly heartbeat.

An autonomous job-listing scraper that quietly stalks 30,000+ company ATS endpoints, dedups across runs with atomic hashes, and broadcasts fresh roles to Discord without ever posting the same job twice.

JobClaw
Period
Feb 2026 – present
Role
Sole engineer

Job boards lie about freshness. The same listing surfaces on three aggregators, two cross-posts, and a recruiter's feed — frequently after it's already closed. JobClaw skips the middlemen and pulls directly from each company's applicant tracking system (Greenhouse, Lever, Ashby, Workday, SmartRecruiters, Rippling), hashes every role atomically, and only posts what's genuinely new.

Five workers, one hour, every hour

Different ATS feeds change at different speeds and have very different rate-limit profiles. Hammering every endpoint constantly is wasteful and gets you banned. The fix is five workers chained one after the other, running through the entire 30,000- company catalogue exactly once an hour — so every listing is at most ~60 minutes stale, and a single ATS outage can't take down the rest of the chain:

  1. Worker 1 — Fast Tier. RSS + GitHub feeds + Greenhouse / Lever / Ashby. The cheapest, fastest-moving sources — fetched first so the rest of the run already has a hot cache.
  2. Worker 2 — Medium Tier. Workday / Rippling / SmartRecruiters. Heavier endpoints, deeper rate limits, polled with backoff.
  3. Worker 3 — Deep Push. A wide crawl across the full 30,000-company catalogue. Catches everything the tier scrapers don't cover and bounds long-tail staleness.
  4. Worker 4 — Discord Push. Atomic broadcast of every newly-hashed listing from this hour. Nothing posted unless it cleared the dedup gate.
  5. Worker 5 — Registry Expander. Discovers new ATS endpoints and feeds them back into the catalogue so it grows without me babysitting it.

Each worker writes only its own slice of the SQLite table and triggers the next worker in line. The chain itself is the schedule.

JobClaw GitHub Actions workflows page showing the five sequential workers
The five workers as GitHub Actions workflows — each completes before the next starts, and the chain runs every hour on the hour.

Atomic dedup is the whole game

The interesting part isn't scraping — it's the dedup. Each listing gets a stable hash computed from (company, title, location, source_url). Inserts go through a SQLite WAL-mode table with a uniqueness constraint on the hash. If the insert succeeds, the listing is genuinely new and gets broadcast. If it conflicts, it's silently dropped. No race windows, no “was this posted yet?” lookups before the insert.

The decoupling matters too: each micro-scraper writes only its slice of the table and never reads another scraper's state. That lets me kill a single ATS poller without bringing down the rest of the fleet — and it makes the GitHub Actions concurrency model match the data model exactly.

Discord as the read path

Discord ended up being the right read path. It gives me free per-role channel routing (frontend, ML, infra, research), threading for follow-ups, rich embed cards with company + location + posted time, and a chat surface where I can drop “applied” / “skipping” reactions on each listing. A web dashboard would have been more polish for less signal.

Discord #general channel showing JobClaw broadcasting Software Engineer Frontend roles at Ramp and Suno with company, location, posted time, and source
The output side — JobClaw posting fresh frontend roles. Each card carries source, company, location, and posted-time so triage takes seconds, not minutes.

What I'd do next

  • Score listings before broadcasting. Right now every match for my role filter goes through. A small ranker on top of the hash table would let me push only the top 20% to the main channel and dump the rest to an archive.
  • Auto-extract apply links. Some ATSes hide the direct apply URL behind a redirect that breaks copy-paste. A tiny unwrap step would close the loop from Discord card to apply screen.
  • Promote the Registry Expander. Right now it only discovers Greenhouse/Lever endpoints because they have public company lists. Adding Workday subdomain enumeration would roughly double the catalogue.