Data Methodology & Sources
Everything published on Datamata Studios about tech skills demand is derived from real job listings. This page explains exactly where the data comes from, how it's processed and what its limitations are.
Data Sources
All job listing data is sourced from the public APIs of three applicant-tracking system (ATS) providers:
- Greenhouse — One of the most widely adopted ATS platforms among mid-market and enterprise tech companies. We query the Greenhouse Job Board API for active, publicly visible listings.
- Lever — A popular choice for growth-stage technology companies. We use the Lever Postings API to fetch open roles.
- Ashby — A newer ATS gaining traction with engineering-led organisations. We query the Ashby Job Board API for current openings.
We do not scrape career pages directly or parse HTML from job boards. All data is fetched from official, rate-limited APIs in compliance with each provider's terms of service.
Collection Cadence
The pipeline runs daily at midnight Sydney time (AEDT/AEST, DST-safe). Each run:
- Fetches all currently active listings from each ATS API.
- Deduplicates by listing ID to avoid counting refreshed postings twice.
- Marks listings no longer present in the API as inactive.
- Extracts and normalises skill signals from role descriptions and required-skills fields.
- Writes a snapshot of skill demand percentages per category to the database.
Blog content (skill spotlights, comparisons, weekly pulses) is regenerated and published on the following schedule when data quality checks pass:
- Monday — Weekly skills pulse
- Tuesday & Thursday — Up to 20 skill spotlight pages (top-demanded skills)
- Wednesday — One head-to-head skill comparison
- January, April, July, October 1 — Quarterly hiring report
Skill Normalisation
Raw listing text contains many variants of the same skill (e.g. "Python 3", "Python programming", "python (3.10+)"). We apply a multi-step normalisation pipeline:
- Case normalisation — all skill tokens are lowercased for matching.
- Canonical alias mapping — a curated dictionary maps known aliases to a single canonical name (e.g. "node" → "node.js").
- Stop-word filtering — generic terms like "programming", "experience" and "proficiency" are stripped from extracted tokens.
- Minimum frequency threshold — a skill must appear in at least 5 listings to be included in demand calculations.
Demand percentages are computed as:
skill_demand_pct = listings_with_skill / total_active_listings × 100
This is calculated separately for each role category (Data & Analytics, Software Engineering, Product & Design, DevOps & Infrastructure, Security, AI & Machine Learning).
Quality Gates
Before any generated post is published, it must pass automated quality checks:
- Minimum word count — posts shorter than 180 words are held as drafts.
- Minimum sample size — skill spotlights require at least 30 listings; weekly pulses require at least 300 total active listings; quarterly reports require at least 1,000.
- Anomaly detection — any single skill movement exceeding 35 percentage points week-over-week triggers a hold, as this likely indicates a data collection issue rather than genuine market movement.
- Meta field validation — meta title (40–65 chars) and meta description (120–165 chars) length checks run automatically.
Posts that fail a quality gate are saved as draft and not surfaced to readers until the underlying issue is resolved.
Known Limitations
- ATS coverage is not universal. Many companies use custom career pages, LinkedIn, Indeed or other platforms not covered by Greenhouse, Lever or Ashby. Our dataset is a representative sample, not the full market.
- Large companies are over-represented. Enterprise organisations with high listing volumes contribute proportionally more signal. Smaller companies and non-tech industries are under-indexed.
- Salary data is sparse. Many listings do not publish compensation ranges. Salary figures shown are medians from the subset of listings that do include salary data and should not be treated as market benchmarks.
- Role category assignment is imperfect. Categories are assigned by the ATS customer at listing creation and may not match standardised industry taxonomy.
- Geographic scope varies. We do not currently filter by geography. Demand percentages blend remote, hybrid and on-site roles globally, with a concentration in English-speaking markets.
- Trend data requires history. 30-day trends are only meaningful once sufficient snapshot history exists (approximately 5+ data points). Early snapshots may show large apparent movements as the baseline stabilises.
