Pipeline architecture
ECI HTML (results.eci.gov.in) ─┐
│
Census 2011 XLS (censusindia.gov.in) ──┤
│
kracekumar 2021 CSV (GitHub) ──┼─► Python (Polars + DuckDB)
│ │
ECI Form 20 PDFs (TN CEO) ──┤ ▼
↓ pdfplumber │ Hetzner S3 (s3://tnelection2026/)
│ ├── raw/ ← exact bytes fetched
TN CEO voter rolls (CAPTCHA + PDF) ──┤ └── curated/ ← typed Parquet, zstd
↓ 2captcha + tesseract │ │
│ ▼
DataMeet shapefile (GitHub) ──┘ DuckDB queries → docs/insights/*.json
│
▼
VitePress site (this site)Two hosts, one bucket
| Host | What runs here | Why |
|---|---|---|
| Laptop (Indian IP) | form20.py, voters.py — TN CEO scrapes | TN State sites geo-block foreign IPs |
| deemwar-app1 (Hetzner Germany, 12 cores, 62GB RAM) | results.py, historical.py, demographics.py, path_a_build.py, voters_ocr.py | Compute-heavy + no geo issues |
Both hosts read the same env file shape (Hetzner S3 creds) and write to the same s3://tnelection2026/ bucket. Spikes are pure functions of network → S3 — they don't care which host ran them.
Storage layout
s3://tnelection2026/
├── results/ raw HTML + curated parquet (AC × candidate × round, 2026)
├── form20/ raw PDF + curated parquet (booth × candidate, 2026 — partial)
├── voters/ raw PDF + OCR'd voter list (booth-level, partial)
├── candidates/ MyNeta scrapes (deprioritized — present but not used here)
├── historical/ 2011/2016/2021 raw + parquet
├── caste/ per-AC reservation flag + Wikipedia prose
├── religion/ state baseline + (when available) booth-inferred from voter names
├── alliance/ Wikipedia-sourced party↔alliance map per year
├── geo/ DataMeet AC polygons (WKT)
├── demographics/ Census 2011 district religion mix + AC-joined view
└── insights/ 2021→2026 swing × demographic joins (curated)Phases
Phase 1 — Raw dump
Every spike downloads exact bytes from the original source to s3://.../raw/. This is the audit trail. Never mutated. If a parser changes, we re-derive from raw.
Phase 2 — Curated parquet
Each spike's curate() reads from raw/, normalises, writes typed Parquet to s3://.../curated/year=YYYY/ac=NN/*.parquet. Polars for in-memory work, DuckDB for SQL + S3 reads.
Phase 3 — Insights
DuckDB queries across curated/ parquets produce small JSON insight files for the site. See pipelines/path_a_build.py for the complete end-to-end script.
Why Parquet + zstd
A 174 KB HTML page (one AC's Roundwise) becomes an 8.7 KB Parquet after parsing — 20× compression with no information loss. DuckDB reads Parquet from S3 via httpfs extension at near-native speed:
INSTALL httpfs; LOAD httpfs;
SET s3_endpoint='hel1.your-objectstorage.com';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';
SELECT ac_no, ac_name, SUM(total_votes) AS total
FROM read_parquet('s3://tnelection2026/results/curated/year=2026/ac=*/candidate_totals.parquet',
hive_partitioning=true)
GROUP BY 1, 2
ORDER BY total DESC LIMIT 5;That's the whole architecture. No database server, no Spark, no cron Airflow. Bucket + Parquet + DuckDB on a 12-core box.
Two products under one repo
This pipeline is the analyser. There's a separate live scoreboard in the same repo (scrape.ts, update.sh, docs/insights.json) that ran every 2 minutes during the May 2026 counting day and was the user-facing dashboard. The two share no code; the analyser is additive.