Pipeline architecture

ECI HTML (results.eci.gov.in)        ─┐
                                       │
Census 2011 XLS (censusindia.gov.in) ──┤
                                       │
kracekumar 2021 CSV (GitHub)         ──┼─►  Python (Polars + DuckDB)
                                       │         │
ECI Form 20 PDFs (TN CEO)            ──┤         ▼
   ↓ pdfplumber                        │  Hetzner S3 (s3://tnelection2026/)
                                       │   ├── raw/      ← exact bytes fetched
TN CEO voter rolls (CAPTCHA + PDF)   ──┤   └── curated/  ← typed Parquet, zstd
   ↓ 2captcha + tesseract              │         │
                                       │         ▼
DataMeet shapefile (GitHub)          ──┘  DuckDB queries → docs/insights/*.json
                                                │
                                                ▼
                                       VitePress site (this site)

Two hosts, one bucket

Host	What runs here	Why
Laptop (Indian IP)	`form20.py`, `voters.py` — TN CEO scrapes	TN State sites geo-block foreign IPs
deemwar-app1 (Hetzner Germany, 12 cores, 62GB RAM)	`results.py`, `historical.py`, `demographics.py`, `path_a_build.py`, `voters_ocr.py`	Compute-heavy + no geo issues

Both hosts read the same env file shape (Hetzner S3 creds) and write to the same s3://tnelection2026/ bucket. Spikes are pure functions of network → S3 — they don't care which host ran them.

Storage layout

s3://tnelection2026/
├── results/      raw HTML + curated parquet (AC × candidate × round, 2026)
├── form20/       raw PDF + curated parquet (booth × candidate, 2026 — partial)
├── voters/       raw PDF + OCR'd voter list (booth-level, partial)
├── candidates/   MyNeta scrapes (deprioritized — present but not used here)
├── historical/   2011/2016/2021 raw + parquet
├── caste/        per-AC reservation flag + Wikipedia prose
├── religion/     state baseline + (when available) booth-inferred from voter names
├── alliance/     Wikipedia-sourced party↔alliance map per year
├── geo/          DataMeet AC polygons (WKT)
├── demographics/ Census 2011 district religion mix + AC-joined view
└── insights/     2021→2026 swing × demographic joins (curated)

Phases

Phase 1 — Raw dump

Every spike downloads exact bytes from the original source to s3://.../raw/. This is the audit trail. Never mutated. If a parser changes, we re-derive from raw.

Phase 2 — Curated parquet

Each spike's curate() reads from raw/, normalises, writes typed Parquet to s3://.../curated/year=YYYY/ac=NN/*.parquet. Polars for in-memory work, DuckDB for SQL + S3 reads.

Phase 3 — Insights

DuckDB queries across curated/ parquets produce small JSON insight files for the site. See pipelines/path_a_build.py for the complete end-to-end script.

Why Parquet + zstd

A 174 KB HTML page (one AC's Roundwise) becomes an 8.7 KB Parquet after parsing — 20× compression with no information loss. DuckDB reads Parquet from S3 via httpfs extension at near-native speed:

sql

INSTALL httpfs; LOAD httpfs;
SET s3_endpoint='hel1.your-objectstorage.com';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';

SELECT ac_no, ac_name, SUM(total_votes) AS total
FROM read_parquet('s3://tnelection2026/results/curated/year=2026/ac=*/candidate_totals.parquet',
                  hive_partitioning=true)
GROUP BY 1, 2
ORDER BY total DESC LIMIT 5;

That's the whole architecture. No database server, no Spark, no cron Airflow. Bucket + Parquet + DuckDB on a 12-core box.

Two products under one repo

This pipeline is the analyser. There's a separate live scoreboard in the same repo (scrape.ts, update.sh, docs/insights.json) that ran every 2 minutes during the May 2026 counting day and was the user-facing dashboard. The two share no code; the analyser is additive.

Pipeline architecture ​

Two hosts, one bucket ​

Storage layout ​

Phases ​

Phase 1 — Raw dump ​

Phase 2 — Curated parquet ​

Phase 3 — Insights ​

Why Parquet + zstd ​

Two products under one repo ​