Real-World Government Scheme Data Audit (RGSSS)
This project focuses on real-life data cleaning, validation, and anomaly detection for the Rajiv Gandhi Social Security Scheme (RGSSS), implemented by the Revenue Department of the Government of Puducherry. The goal was to detect potential beneficiary duplications across multiple official PDF orders issued between October 2024 and April 2025.
-
Python (PyMuPDF, pandas, fuzzywuzzy, regex)
-
CSV export for structured data reporting
-
Command-line automation for batch processing of PDFs
Tools & Technologies Used
-
Skipped scanned first pages during PDF parsing.
-
Extracted: deceased name, claimant name, full address, amount, and file source.
-
Removed relational markers and special characters from names.
-
Applied fuzzy logic to detect duplicate claimants across files (Name ≥ 90%, Address ≥ 85%).
-
Exported the final structured dataset into a CSV with 1,146 entries.
What I did
-
Extract structured data from scanned and semi-structured government PDFs.
-
Normalize and clean claimant names (removing S/o, W/o, D/o, aliases).
-
Separate amount data from unstructured address fields.
-
Perform fuzzy matching to detect potential duplicate claims.
-
Export the clean data into a CSV for verification and future analysis.
Objectives
Data Sources


Screenshots Included
-
Terminal output of data processing summary
-
Preview of cleaned CSV file in tabular format
Key Highlights
-
1090 real-world records extracted from official RGSSS PDF documents.
-
Automated parsing of scanned/semi-scanned documents using Python.
-
Cleaned and normalized claimant names by removing S/o, W/o, D/o, and aliases.
-
Amount data separated from raw address fields using regex pattern matching.
-
Fuzzy duplicate check across all files (Name ≥ 90%, Address ≥ 85%).
-
CSV export ready for audit, dashboarding, or further analysis.
-
Successfully validated zero critical anomalies in official beneficiary records.
-
Built a reusable toolkit for future scheme-level audits.
Links
Presentation: Presentation 14.07.2025
Infographic Poster: Data Analytics for Monitoring & Evaluation of Government Schemes