Reproducing this analysis
Everything is deterministic from public sources. ~10 minutes from clean clone to full insight regeneration, assuming Hetzner S3 credentials.
Prerequisites
- Python 3.11+
uvfor Python deps- Hetzner Object Storage credentials (or any S3-compatible storage — swap the endpoint)
Setup
bash
git clone git@github.com:muthuishere/tnelection2026.git
cd tnelection2026
uv sync # installs boto3, duckdb, polars, httpx, pdfplumber...Drop S3 creds into a file like infra/vault/production/.env.production:
bash
S3_ENDPOINT=hel1.your-objectstorage.com
S3_BUCKET=tnelection2026 # or whatever you provision
S3_REGION=hel1
S3_ACCESS_KEY=...
S3_SECRET_KEY=...The infra/ directory is gitignored — no creds leak.
Provision the bucket
bash
set -a; . infra/vault/production/.env.production; set +a
uv run python pipelines/bootstrap_bucket.pyIdempotent — re-runnable. Creates the 9-dataset folder skeleton in your S3 bucket.
Run the ingestion spikes
bash
# 2026 ECI results — all 234 ACs
uv run python pipelines/results.py --ac all
# 2021 historical + Census 2011 demographics + final analysis
uv run python pipelines/path_a_build.pypath_a_build.py downloads the Census XLS + kracekumar 2021 CSV the first time, then builds the swing-by-religion join and writes docs/insights/path_a_summary.{json,md} plus all the curated parquets to S3.
Render this site
bash
bun install # vitepress is in devDependencies
bun run docs:dev # local preview at http://localhost:5173
bun run docs:build # static build to site/.vitepress/distRe-run a single insight query
If you just want to query the existing parquets in S3 without re-ingesting:
python
import os, duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute(f"SET s3_endpoint='{os.environ['S3_ENDPOINT']}'")
con.execute(f"SET s3_access_key_id='{os.environ['S3_ACCESS_KEY']}'")
con.execute(f"SET s3_secret_access_key='{os.environ['S3_SECRET_KEY']}'")
# AC with biggest TVK gain
con.execute("""
SELECT ac_no, ac_name, district_name, swing_pct
FROM read_parquet('s3://tnelection2026/insights/curated/swing_2021_to_2026.parquet')
WHERE party_norm = 'TVK'
ORDER BY swing_pct DESC
LIMIT 5
""").fetchall()What's NOT in the public flow
- Hetzner S3 bucket itself — you'd need to provision your own (~$5/month for ~10 GB).
- 2captcha API key (only needed for Path B voter-roll OCR; not used in Path A).
- The site deploy step — currently manual; Cloudflare Pages or GitHub Pages both work.
Where to extend
pipelines/path_a_build.pyis the orchestrator. Add new joins / metrics here._name_inference.pyis the Tamil-aware surname classifier. Augment patterns there.voters_ocr.pyis ready for Path B if you want booth-level data.
See pipeline architecture for the bigger picture.