Data sources
Every row in every parquet on this site is traceable to a publicly accessible source. License-respectful; attribution per row where the data permits.
Primary sources used in this analysis (Path A)
1. 2026 results — ECI
- URL:
https://results.eci.gov.in/ResultAcGenMay2026/ - What: HTML pages —
ConstituencywiseS22{ac}.htm(final tallies) andRoundwiseS22{ac}.htm(per-round detail), one per AC. - Method: Custom-UA scrape (curl/8.4.0 — ECI 403s default UAs). See
pipelines/results.py. - Granularity: AC × candidate × round.
- Cost: Free, public domain.
- In S3:
s3://tnelection2026/results/raw/year=2026/ac=*/(HTML) +curated/.../candidate_totals.parquetandround_votes.parquet.
2. 2021 results — kracekumar/tn_elections
- URL:
https://github.com/kracekumar/tn_elections/blob/master/2021_detailed_results.csv - What: Pre-extracted CSV of all 234 TN AE 2021 results at candidate level.
- Method: Direct download.
- Granularity: AC × candidate.
- Cost: Free, open repo (no license stated but data is public-domain ECI scrape).
- In S3:
s3://tnelection2026/historical/raw/year=2021/kracekumar_detailed_results.csv+ parquet.
3. Religion mix — Census of India 2011
- URL:
https://censusindia.gov.in/nada/index.php/catalog/11392/download/14505/DDW33C-01%20MDDS.XLS - What: C-01 "Population by Religious Community" for Tamil Nadu (state code 33), sub-district granularity.
- Method: Direct XLS download (requires
verify_ssl=False— Census site has a known cert issue). - Granularity: Sub-district (tehsil); we aggregate to district for the join.
- Cost: Free, public domain.
- In S3:
s3://tnelection2026/demographics/raw/census2011/DDW33C-01_TN_religion.xls+curated/district_religion.parquet.
4. Reservation status — derived from ECI AC names
- ECI tags reserved seats with
(SC)or(ST)in the AC name string (e.g. "Ponneri (SC)"). - Confidence: high. Official source.
Other sources we evaluated but didn't use here
Click to expand
| Source | Why considered | Why not used here |
|---|---|---|
| TCPD Lokdhaba (Ashoka) | AC-level results 1962-2021, free CSV | 2026 not yet ingested (typical 6-18 month lag); 2021 covered by kracekumar |
| SHRUG v2.1 (DevDataLab) | Sub-district demographics with built-in shrid→AC mapping | Auth-walled (registration required); we used coarser district fallback |
| Dataful.in TN 1967-2026 | Pre-joined historical CSV | Login-walled |
| MyNeta TN 2026 | Candidate assets / criminal cases | User deprioritized candidate-level data |
| Form 20 PDFs (TN CEO) | Booth-level vote tallies | Image PDFs, would require OCR |
| Electoral rolls (TN CEO) | Voter names per booth | CAPTCHA + image PDF, requires OCR |
| Susewind Nature 2025 | Booth-level results 2009-2019 (11 states incl. TN) | Pre-2026, valuable for future swing baselines |
| Lokniti CSDS post-poll | Survey-level caste-by-vote crosstabs | TN 2026 not yet released |
| SECC 2011 | Caste detail at sub-district | Caste names never publicly released |
| Grey-market voter list resellers | Booth-level + caste-tagged voter CSVs | DPDPA 2023 grey zone; ~Rs 50K per AC |
For the full Buy-vs-DIY matrix see pipelines/SOURCES_BUY_VS_DIY.md.
Cost summary
| Resource | Money spent |
|---|---|
| Hetzner Object Storage | covered by existing bucket subscription |
| 2captcha (used for ~10 booth PDFs during Path B prototyping) | ~$0.01 |
| Cloud OCR | $0 (not used for Path A) |
| Commercial voter data | $0 (didn't go grey-market) |
| Total Path A cost | $0 |
Licensing
- ECI data: public domain (Government of India).
- Census 2011: public domain.
- kracekumar/tn_elections: no explicit license stated; data is public-domain underlying. We treat as MIT-equivalent.
- This site's analysis + code: MIT (when committed). All derived parquets attribute their source per row.