- Challenge: Evaluate the performance of a constructed wetland treatment solution using 1,449 days of unstructured wastewater flow and water-quality sensor readings — no existing schema, no defined baseline, and findings subject to regulatory scrutiny.
- What I built: A Python and Pandas pipeline to clean, structure, and analyse the time-series sensor data, applying threshold-based detection to flag overflow events and surface seasonal flow patterns.
- Impact: Identified 463 overflow events and recurring seasonal trends; delivered a 5-KPI executive dashboard across 6 visualisations in Power BI, automating a previously manual compliance assessment process and supporting a projected 100% regulatory compliance outcome.
Indhu Shree Prakash
Data Engineer
- ⚙️ I build pipelines.
Not PowerPoints about pipelines. - 🇬🇧 UK commercial experience —
Severn Trent Water (FTSE 100).
Spent 1.5 years designing ETL pipelines, optimising SQL and data models, and automating the workflows behind enterprise reporting.
Experience across Boomi, Severn Trent Water, Cognizant and Infosys.
SQL for breakfast, data models for lunch, pipelines all afternoon — and refactor Python for dinner.
From Analytics to Engineering
How reporting turned into engineering
I started in reporting and analytics. The job was to help stakeholders understand data. The unintended consequence was that I kept ending up nose-deep in the pipelines feeding those reports — asking why they were slow, who broke them, and whether anyone had written documentation. (The answer was usually no.)
At Boomi, working with 20M+ records across a 32-table platform exposed me to the real cost of poor data modelling: a 6 GB report that took too long, cost too much, and broke too often. Fixing it meant restructuring data models, rewriting SQL, and automating workflows that had been done by hand for months.
At Severn Trent Water, I built a pipeline from scratch — ingesting 1,449 days of sensor data, surfacing 463 overflow events, and delivering a compliance dashboard for an executive audience operating under regulatory scrutiny.
Most people see dashboards.
I became more interested in the pipelines behind them.
How It Evolved
- Reporting & Analytics
- SQL Optimisation
- Data Modelling
- Data Quality
- Automation
- Cloud Systems
- Data Engineering
ETL Pipelines
Designing reliable data pipelines that move information from operational systems to analytics platforms. Built ingestion, transformation, and validation workflows handling production-scale datasets across enterprise and utility environments.
Tools:
PythonPandasAWS LambdaS3Snowflake
Data Modelling
Building data models that make reporting faster, simpler, and more reliable. Restructured schemas, removed redundant joins, and improved data quality across analytics platforms serving business stakeholders.
Tools:
SQLSnowflakePower BI
SQL Optimisation
Optimising queries and reporting systems to reduce load times, lower costs, and improve maintainability. Reduced a reporting platform from 6 GB to 2.8 GB while improving dashboard performance by 35%.
Tools:
SQLSnowflakePostgreSQL
Cloud Engineering
Building event-driven cloud architectures that scale automatically without server management. Developed serverless workflows using AWS services for ingestion, processing, and storage.
Tools:
AWS S3LambdaDynamoDBDockerPython
Professional Experience
Where the work actually happened
- Challenge: A 6 GB Power BI platform spanning 32 source tables and ~20 million records had become slow to load, costly to query, and dependent on manual refreshes by analysts.
- What I did: Restructured the underlying data model to remove redundant joins, rewrote inefficient SQL queries, and built automated refresh and validation workflows in Snowflake to catch data quality issues before they reached the dashboard.
- Impact: Reduced platform size from 6 GB to 2.8 GB (53%), improved dashboard load times by 35%, cut query costs by 20%, reduced analyst reporting effort by 40%, and lowered data inconsistency incidents by 50%.
Earlier roles
- Challenge: Data preparation for reporting across 626K+ records and 14 relational tables was handled manually and inconsistently, increasing the risk of errors reaching downstream dashboards.
- What I did: Built reusable SQL transformation models to standardise data preparation, added automated validation checks to catch inconsistencies before they reached reporting layers, and exposed the cleaned data through a query-based backend for non-technical users.
- Impact: Reduced data pre-processing errors by 25% and gave non-technical stakeholders direct, validated access to 626K+ records across 14 tables — without analyst involvement.
- Challenge: Candidate-to-job matching relied on manual review across a large volume of job descriptions, with no benchmarked method for evaluating match quality.
- What I did: Benchmarked vector search (Qdrant with MiniLM embeddings) against machine learning-based matching approaches across 50+ job descriptions, scoring each method's retrieval accuracy against a set of labelled relevant matches.
- Impact: Vector search achieved 92% retrieval accuracy, outperforming machine learning approaches by 4–8 percentage points, and was adopted as the basis for a semantic matching pipeline using ranked retrieval.
Selected Projects
Things I built (and what I learned fixing them)
Serverless Data Pipeline — AWS
Event-driven cloud pipeline with zero infrastructure management
- Problem
- Building a scalable ingestion-to-storage pipeline without standing infrastructure — no servers to provision, patch, or babysit.
- Why it was difficult
- Event-driven design requires each stage to be independently reliable, with failures caught without a central coordinator. Getting IAM permissions, trigger chains, and DynamoDB writes to work together cleanly took architectural discipline.
- Solution
- Architected a fully serverless pipeline: S3 ingestion triggers Lambda for stateless processing, Lambda writes structured records to DynamoDB. Each stage is decoupled and scales automatically.
- Impact
- 25+ assets processed through fully automated trigger-based workflows, with zero standing infrastructure — demonstrating an event-driven architecture: S3 → Lambda → DynamoDB.
BI Reporting Using Chat Interface
Self-service analytics layer on a 14-table relational database
- Problem
- Business stakeholders needed access to 626K+ records across 14 tables, but had no SQL knowledge. Every report request went through an analyst, creating a bottleneck in decision-making.
- Why it was difficult
- Natural-language questions are often ambiguous, and translating them into correct SQL across 14 joined tables required careful query construction to avoid returning incorrect or misleading results.
- Solution
- Built a Flask backend with a structured query generation layer, input validation pipelines, and two export formats — so stakeholders could extract data without analyst involvement.
- Impact
- 626K+ records queryable by non-technical users. Pre-processing errors reduced by 25%. Analyst reporting bottleneck eliminated.
AI StudyMate — RAG Learning Assistant
Retrieval-augmented generation system for academic study support
- Problem
- Students spend significant time searching across multiple disconnected resources to find relevant study materials, with no single interface able to surface topic-relevant answers.
- Why it was difficult
- Combining semantic retrieval, external service APIs, and AI-generated responses into a cohesive system required careful orchestration, with tolerance for retrieval latency and occasional misses.
- Solution
- Built a 6-module RAG system integrating 4 external services for semantic retrieval and AI-generated study content — unified through a single interface.
- Impact
- Topic-relevant study assistance across multiple content types, automated resource generation, and single-interface access to semantically retrieved materials.
NewsSwarm — Agentic News Automation
Agentic AI workflow for news ingestion, summarisation, and automated publishing
- Problem
- Manually sourcing, summarising, and publishing news content is time-consuming and does not scale. Each stage of the workflow (fetch, summarise, format, publish) was typically handled independently.
- Why it was difficult
- Integrating multiple external APIs into a coordinated multi-step workflow — where each stage depends on the output of the previous — required robust error handling, orchestration logic, and state management.
- Solution
- Developed an agentic system using Python, Groq LLM, NewsAPI, and social media APIs to automate the full pipeline: ingestion → summarisation → formatting → publishing.
- Impact
- Fully automated news workflow covering ingestion, summarisation, and publishing, enabling real-time aggregation and distribution without manual intervention.
Let's Talk