Finding 5 — What this analysis cannot see
Honest gaps in the data, not in the method. Knowing what you can't answer is part of the answer.
Booth-level patterns
This site is Assembly Constituency level. Within an AC, vote behaviour varies hugely — Anna Nagar might vote differently from Mogappair within Chennai-Avadi. None of that variation is visible here.
The booth-level data exists — ECI publishes Form 20 PDFs per AC, and TN CEO publishes electoral rolls per booth — but both are image PDFs requiring OCR. The pipeline supports this (pipelines/voters_ocr.py) but it would cost ~$2K of Google Cloud Vision API spend to OCR all 80,000 TN booths. See data sources for the cost breakdown.
Caste at jati level
Religion split (Hindu / Muslim / Christian / Sikh / Buddhist / Jain / Other) is in the Census 2011 data at sub-district level. Caste at jati level (Iyer, Mudaliar, Vanniyar, Thevar, etc.) does not exist in any open structured dataset.
What we have:
- Reservation status per AC (General / SC / ST) — official, exact.
- SC% and ST% per district — via Census A-10 / A-11 (not yet ingested in this pipeline; available).
- Jati composition per AC — not in open data. SECC 2011 collected this but it was never publicly released.
What we could build with extra work:
- Apply a name-classifier (
pipelines/_name_inference.py) to OCR'd voter rolls → infer caste at booth level. Heuristic, not authoritative. Would work better on full names than the initial-style names common in TN.
Post-2011 new districts
Tamil Nadu added several districts after the 2011 Census (Tirupathur, Mayiladuthurai, Ranipet, Chengalpattu, Kallakurichi, Tenkasi). The 234 ACs span all 38 current districts, but Census religion data is at the 32-district 2011 boundary. We attached religion to 203 / 234 ACs; the 31 unmatched are in those new districts.
A manual parent-district mapping (each new district has a parent on the 2011 list) would close this gap. Future work.
Causality
This is correlation analysis. Even where signal exists (BJP × Muslim%, BJP × reserved seats), we can't say why.
A booth-level + candidate-quality + ground-game dataset would let us run a proper causal inference. We don't have any of those.
2024 LS swing as second baseline
The 2024 Lok Sabha election in TN is a useful intermediate signal between 2021 and 2026 (TVK didn't contest 2024; DMK alliance dominated). Including it would sharpen the swing analysis — was the DMK→TVK movement already starting in LS 2024, or did it materialise in AE 2026?
Not ingested in this v1. Wikipedia mirror + ECI archive both have it.
Lokniti / CSDS post-poll caste data
CSDS Lokniti conducts post-poll surveys with self-reported caste questions. The TN 2026 release typically lands 6-9 months post-counting. When it drops, joining the AC-level swing here to Lokniti's per-AC caste tables would give a much sharper answer to "did caste predict the vote".
In short
The honest answer to "did caste/religion predict the 2026 TN swing" is:
- At AC level, no — not in a way that this open data can see.
- At booth level, we don't know — the data exists but isn't OCR'd.
- At individual voter level, Lokniti post-poll will tell us in ~6-9 months.
Path A (this site) is the AC-level answer. Path B (booth OCR + name-inference) is the next plateau.