Softellar

ETL Monitoring and Observability Platform for a Multi-Tenant Data Provider

Enterprise observability for hundreds of warehouse jobs: dependency-aware alerting, deduplication, root-cause analysis, and AI-assisted remediation with safe automation.

Client

B2B data platform provider (multi-tenant analytics and warehouse product)

Location

North America

Platform

Cloud-native microservices (.NET APIs and workers, Next.js admin UI, Azure Container Apps)

Engagement Model

Dedicated Team

Team Size

7 specialists

Duration

14 months

Industries

Data Analytics
Software Products

Technologies

.NET
C#
AI
Next.js
TypeScript
PostgreSQL
Valkey
Redis
RabbitMQ
Temporal
Terraform
Azure
Azure Container Apps
Slack
GitHub Actions
ETL

About The Customer

The Customer is a B2B data company that packages warehouse and analytics capabilities for many downstream clients. Their product sits on top of a large surface area: scheduled ETL and ELT pipelines, API-driven integrations, and mixed database estates. A meaningful share of workloads touches regulated or sensitive categories, so operational discipline, auditability, and tenant isolation were non-negotiable from day one.

Key Highlights

  • Unified ingestion for health checks, job lifecycle signals, and API contract probes into one event model
  • Time-windowed alert deduplication and incident aggregation to cut notification noise
  • Dependency graph across jobs and systems to explain blast radius and group correlated failures
  • Root-cause analysis and severity classification wired into Slack with calm, deduped notifications
  • AI-assisted solution suggestions and optional GitHub pull requests for well-scoped, reviewed fixes - not silent production changes
  • Admin experience for tenants, job registry, dependencies, and routing rules to speed up onboarding of new clients

The Challenge

The operations team was underwater. Every missed SLA or late batch produced another page, often duplicated across channels when the same upstream failure tripped dozens of dependent jobs. Engineers spent the first thirty minutes of every incident asking the same questions: what actually broke first, which tenants were impacted, and which alerts were symptoms versus cause.

Scaling the business made this worse. Each new tenant historically meant cloning brittle monitoring configuration, re-wiring dashboards, and hoping naming conventions matched reality. The Customer needed a single platform that could represent jobs and dependencies truthfully, enforce multi-tenant boundaries, and support both immediate response (late runs, missed windows, failed steps) and proactive signals (uptime, synthetic API checks, log-derived anomalies) - without drowning the team in noise.

Pain Points

  • High alert volume with weak correlation across pipelines and warehouses
  • Slow mean time to resolution when failures cascaded through dependency chains
  • Inconsistent visibility across SQL Server and PostgreSQL schedulers, scripts, and external APIs
  • Heavy manual effort to onboard new clients into monitoring and ownership rules
  • Pressure to automate remediation without bypassing change control or audit expectations

Challenges We Addressed

  • Multi-tenancy: Every table and API path was designed with tenant_id first, aligned with registry and routing so isolation mistakes did not leak events across customers.
  • Event scale and bursts: Ingestion spikes during batch windows had to be absorbed without dropping signals or overloading the primary database.
  • Heterogeneous sources: Monitors in databases, HTTP health checks, and ETL wrappers had to normalize into one schema the graph and classifiers could reason about.
  • Safe automation: Auto-remediation and AI-generated fixes were constrained to pull requests, approvals, and audit logs - not blind production mutation.

Project Team Composition

  • 2 Senior .NET engineers (collector and admin APIs, orchestrator worker, integrations)
  • 1 Frontend engineer (Next.js admin UI, tenant and job management)
  • 1 Data / platform engineer (Postgres model, Timescale-style metrics patterns, graph modeling)
  • 1 DevOps engineer (Terraform, Azure Container Apps, CI/CD, observability baselines)
  • 1 ML / automation engineer (RCA enrichment, solution generation, GitHub Actions wiring)
  • 1 Project manager / business analyst (prioritization, stakeholder alignment, compliance touchpoints)

Our Solution

Softellar delivered a two-plane architecture. The data plane owns ingestion, durable processing, deduplication, graph and RCA computation, and outbound notifications. The control plane provides the admin UI and configuration APIs for tenants, jobs, dependencies, severity rules, and access - so operations and customer success could evolve the system without redeploying collectors.

ETL monitoring target architecture: data sources, noise reduction, AI intelligence, smart decisions, and automated actions with KPI badges
Target to-be architecture diagram: five stages from data sources and noise reduction through AI intelligence (dependency graph, RCA, pattern learning, metrics), smart decisions (severity, solutions, remediation, audit), to automated actions (Slack, client dashboard, PR automation).

Data plane: ingest, orchestrate, decide

External systems and agents post normalized events to a lightweight .NET collector API (health-style payloads with tenant_id, job_id, status, and structured details). The collector persists an append-only event record to PostgreSQL for traceability, then publishes to RabbitMQ so spikes never block the HTTP path.

A worker service backed by Temporal consumes the queue, executes activities with retries, and coordinates longer-running steps. Before opening a new incident, the worker checks Valkey (Redis-compatible) with TTL-based keys for deduplication - collapsing repeated failures into one actionable thread. Job state and graph-friendly projections remain in PostgreSQL, which also backs the registry of jobs, schedules, and dependencies.

Operations overview dashboard with KPI cards, activity trends chart, jobs by tenant chart, and system resource meters
Operations overview: active incidents and running jobs, 24-hour activity trends, jobs-by-tenant distribution, and live resource indicators (CPU, memory, queue depth, throughput).

Control plane: admin UI and policy

The operations and customer-facing teams use a React stack implemented with Next.js and TypeScript to list tenants, register jobs, edit dependency edges, tune severity and routing, and review audit history. A separate admin API enforces RBAC and keeps configuration changes explicit and versioned.

Dependency graph, RCA, and incident shaping

A dependency graph engine maps upstream and downstream relationships so a single root failure surfaces as one primary incident with linked affected jobs, instead of dozens of unrelated pages. Root-cause analysis combines structural signals from the graph with temporal ordering of events and log excerpts where available. A severity classifier scores customer impact and breach of SLO classes before Slack delivery.

Incident Activity tab showing timeline of deduplication, severity change, AI analysis, comments, and PR remediation
Incident Activity tab: audit-style timeline from auto-created incident and merged alerts through severity updates, AI analysis, operator comments, and actions such as opening a fix pull request.

AI-assisted solutions, .NET integration, and pull requests

For suitable classes of defects - configuration drift, known flaky transforms, missing retries - the solution generator proposes concrete changes. Those proposals materialize as GitHub Actions-driven branches and pull requests, preserving code review and deployment pipelines. Human approval stays in the loop; the platform accelerates drafting and standardizes fix patterns rather than mutating production directly.

Incident overview tab with affected jobs, AI root-cause analysis, related incidents, deduplicated alerts chart, and Generate Fix PR
Incident overview (Overview tab): affected jobs and statuses, AI root-cause narrative with confidence, related incidents, deduplicated alert trend, correlated services and deployments, recent log lines, and one-click suggested-fix pull request.

How AI is wired into the .NET stack

Inference runs out-of-process from interactive users: Temporal activities and background workers in .NET call a small AI gateway service (also .NET) using HttpClient with strict timeouts, retry budgets, and circuit breakers. The gateway's only job is to serialize a versioned analysis contract (JSON schema) into model prompts and parse structured responses back into domain objects: ranked root-cause hypotheses, short operator narratives, and optional patch hints that reference internal rule IDs rather than raw repository trees.

Orchestration stays deterministic: the workflow decides when to call the model, what contract version to use, and whether to skip AI entirely on hot paths. Model latency never blocks ingestion or deduplication; failed or slow calls degrade to heuristic RCA and still open incidents with a clear "AI unavailable" flag in audit metadata.

Security: no public AI and no sharing of proprietary code

The Customer barred sending intellectual property or full source to consumer-grade or multi-tenant public endpoints. Production traffic targets a private, customer-controlled deployment (Azure OpenAI in the same cloud tenant, private endpoints, and network rules that deny egress to unapproved hosts). Prompt construction is allow-listed: only normalized incident fields, dependency subgraph slices, scrubbed error codes, and short log fragments that already passed redaction pipelines are eligible fields - never entire files, solution folders, or connection strings.

  • Policy enforcement in code: the gateway rejects payloads that exceed field length, contain path-like tokens, or match regexes for secrets before a request leaves the VPC.
  • PR content: suggested fixes are expressed as diffs against approved templates or generated through internal static analysis plus model-generated descriptions; reviewers still see normal Git diffs in GitHub, not opaque remote edits.
  • Telemetry: optional prompt/output logging is off by default for production; where enabled, retention and encryption match the Customer's data-classification rules.

API, uptime, and ETL lifecycle coverage

Beyond batch status, the same pipeline ingests synthetic API checks (request/response validation, latency thresholds) and uptime probes. ETL jobs report started, completed, late, and missed outcomes through thin client libraries and decorators in both Python and .NET environments so engineers could adopt monitoring with minimal boilerplate.

Why We Built It This Way

  • Split collector and admin API: High-volume ingestion stays isolated from configuration workloads, limiting blast radius if either tier is stressed or misconfigured.
  • Message broker: RabbitMQ smooths bursty batch windows and gives the team explicit back-pressure and operational tooling.
  • Temporal: Durable workflows and retries replace ad-hoc cron glue for multi-step incident handling.
  • Valkey for deduplication: Sub-millisecond checks with TTL fit the hot path better than relational contention for noise reduction keys.
  • PostgreSQL as system of record: Events, registry, and projections share one auditable store the Customer could backup, replicate, and query for analytics.
  • Terraform and Azure Container Apps: Repeatable environments per stage, aligned with the Customer's cloud direction and autoscaling needs.

Our Approach

Delivery followed four deliberate phases:

  1. Discovery and architecture
    Mapped job inventories, tenant model, and compliance constraints; chose the split data/control plane, broker, and orchestration stack; defined the canonical event schema and deduplication strategy.
  2. MVP: trustworthy ingestion and noise reduction
    Shipped collector API, PostgreSQL persistence, RabbitMQ fan-out, worker with Valkey deduplication, Slack notifier, and minimal registry APIs.
  3. Graph, RCA, and admin experience
    Introduced dependency modeling, incident aggregation keys (root cause, affected system family, severity band), Next.js admin UI for tenants and jobs, and Grafana-style read paths for historical analytics.
  4. Intelligent remediation and hardening
    Layered solution generation, GitHub PR automation, expanded synthetic checks, RBAC refinements, load testing, and runbooks for on-call.

Results and Impact

Outcomes are directional and depend on tenant mix, but the Customer reported materially faster triage, fewer duplicate escalations, and a repeatable playbook for onboarding new clients onto shared monitoring primitives.

Business outcomes

  • Strong reduction in alert noise after deduplication and dependency-aware grouping
  • Faster incident leadership because primary cause and blast radius were visible early
  • Shorter time-to-first-dashboard for new tenants via registry-driven configuration
  • Clearer audit story for who changed routing, severity, or automation policy

Technical outcomes

  • Single event backbone for health checks, API probes, and ETL lifecycle telemetry
  • Resilient processing under batch spikes using the broker and Temporal
  • PR-based automation path that respects enterprise change control
  • Infrastructure-as-code baselines for consistent Azure deployments

Tools and Technologies

.NET, C#, AI-assisted services, Next.js, TypeScript, PostgreSQL, Valkey, RabbitMQ, Temporal, Terraform, Azure Container Apps, Slack, GitHub Actions, ETL integrations across SQL Server and PostgreSQL estates.

For organizations facing similar complexity, our data architecture and cloud architecture practices align engineering, reliability, and scale-up goals end to end.

Strengthen Your Data Platform and Operations

From data architecture to reliable delivery - we help teams ship observable, scalable systems.

Ready to Scale Your Development Team?

Let's discuss how our expert developers can help accelerate your project and achieve your business goals with cutting-edge technology solutions.