Skip to content

Reproducing this analysis

Everything is deterministic from public sources. ~10 minutes from clean clone to full insight regeneration, assuming Hetzner S3 credentials.

Prerequisites

  • Python 3.11+
  • uv for Python deps
  • Hetzner Object Storage credentials (or any S3-compatible storage — swap the endpoint)

Setup

bash
git clone git@github.com:muthuishere/tnelection2026.git
cd tnelection2026
uv sync                              # installs boto3, duckdb, polars, httpx, pdfplumber...

Drop S3 creds into a file like infra/vault/production/.env.production:

bash
S3_ENDPOINT=hel1.your-objectstorage.com
S3_BUCKET=tnelection2026   # or whatever you provision
S3_REGION=hel1
S3_ACCESS_KEY=...
S3_SECRET_KEY=...

The infra/ directory is gitignored — no creds leak.

Provision the bucket

bash
set -a; . infra/vault/production/.env.production; set +a
uv run python pipelines/bootstrap_bucket.py

Idempotent — re-runnable. Creates the 9-dataset folder skeleton in your S3 bucket.

Run the ingestion spikes

bash
# 2026 ECI results — all 234 ACs
uv run python pipelines/results.py --ac all

# 2021 historical + Census 2011 demographics + final analysis
uv run python pipelines/path_a_build.py

path_a_build.py downloads the Census XLS + kracekumar 2021 CSV the first time, then builds the swing-by-religion join and writes docs/insights/path_a_summary.{json,md} plus all the curated parquets to S3.

Render this site

bash
bun install                          # vitepress is in devDependencies
bun run docs:dev                     # local preview at http://localhost:5173
bun run docs:build                   # static build to site/.vitepress/dist

Re-run a single insight query

If you just want to query the existing parquets in S3 without re-ingesting:

python
import os, duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute(f"SET s3_endpoint='{os.environ['S3_ENDPOINT']}'")
con.execute(f"SET s3_access_key_id='{os.environ['S3_ACCESS_KEY']}'")
con.execute(f"SET s3_secret_access_key='{os.environ['S3_SECRET_KEY']}'")

# AC with biggest TVK gain
con.execute("""
  SELECT ac_no, ac_name, district_name, swing_pct
  FROM read_parquet('s3://tnelection2026/insights/curated/swing_2021_to_2026.parquet')
  WHERE party_norm = 'TVK'
  ORDER BY swing_pct DESC
  LIMIT 5
""").fetchall()

What's NOT in the public flow

  • Hetzner S3 bucket itself — you'd need to provision your own (~$5/month for ~10 GB).
  • 2captcha API key (only needed for Path B voter-roll OCR; not used in Path A).
  • The site deploy step — currently manual; Cloudflare Pages or GitHub Pages both work.

Where to extend

  • pipelines/path_a_build.py is the orchestrator. Add new joins / metrics here.
  • _name_inference.py is the Tamil-aware surname classifier. Augment patterns there.
  • voters_ocr.py is ready for Path B if you want booth-level data.

See pipeline architecture for the bigger picture.

Built from public data — ECI, Census 2011, kracekumar/tn_elections.