Enterprise observability for hundreds of warehouse jobs: dependency-aware alerting, deduplication, root-cause analysis, and AI-assisted remediation with safe automation.
Client
B2B data platform provider (multi-tenant analytics and warehouse product)
Location
North America
Platform
Cloud-native microservices (.NET APIs and workers, Next.js admin UI, Azure Container Apps)
Engagement Model
Dedicated Team
Team Size
7 specialists
Duration
14 months
The Customer is a B2B data company that packages warehouse and analytics capabilities for many downstream clients. Their product sits on top of a large surface area: scheduled ETL and ELT pipelines, API-driven integrations, and mixed database estates. A meaningful share of workloads touches regulated or sensitive categories, so operational discipline, auditability, and tenant isolation were non-negotiable from day one.
The operations team was underwater. Every missed SLA or late batch produced another page, often duplicated across channels when the same upstream failure tripped dozens of dependent jobs. Engineers spent the first thirty minutes of every incident asking the same questions: what actually broke first, which tenants were impacted, and which alerts were symptoms versus cause.
Scaling the business made this worse. Each new tenant historically meant cloning brittle monitoring configuration, re-wiring dashboards, and hoping naming conventions matched reality. The Customer needed a single platform that could represent jobs and dependencies truthfully, enforce multi-tenant boundaries, and support both immediate response (late runs, missed windows, failed steps) and proactive signals (uptime, synthetic API checks, log-derived anomalies) - without drowning the team in noise.
tenant_id first, aligned with registry and routing so isolation mistakes did not leak events across customers.Softellar delivered a two-plane architecture. The data plane owns ingestion, durable processing, deduplication, graph and RCA computation, and outbound notifications. The control plane provides the admin UI and configuration APIs for tenants, jobs, dependencies, severity rules, and access - so operations and customer success could evolve the system without redeploying collectors.

External systems and agents post normalized events to a lightweight .NET collector API (health-style payloads with tenant_id, job_id, status, and structured details). The collector persists an append-only event record to PostgreSQL for traceability, then publishes to RabbitMQ so spikes never block the HTTP path.
A worker service backed by Temporal consumes the queue, executes activities with retries, and coordinates longer-running steps. Before opening a new incident, the worker checks Valkey (Redis-compatible) with TTL-based keys for deduplication - collapsing repeated failures into one actionable thread. Job state and graph-friendly projections remain in PostgreSQL, which also backs the registry of jobs, schedules, and dependencies.

The operations and customer-facing teams use a React stack implemented with Next.js and TypeScript to list tenants, register jobs, edit dependency edges, tune severity and routing, and review audit history. A separate admin API enforces RBAC and keeps configuration changes explicit and versioned.
A dependency graph engine maps upstream and downstream relationships so a single root failure surfaces as one primary incident with linked affected jobs, instead of dozens of unrelated pages. Root-cause analysis combines structural signals from the graph with temporal ordering of events and log excerpts where available. A severity classifier scores customer impact and breach of SLO classes before Slack delivery.

For suitable classes of defects - configuration drift, known flaky transforms, missing retries - the solution generator proposes concrete changes. Those proposals materialize as GitHub Actions-driven branches and pull requests, preserving code review and deployment pipelines. Human approval stays in the loop; the platform accelerates drafting and standardizes fix patterns rather than mutating production directly.

Inference runs out-of-process from interactive users: Temporal activities and background workers in .NET call a small AI gateway service (also .NET) using HttpClient with strict timeouts, retry budgets, and circuit breakers. The gateway's only job is to serialize a versioned analysis contract (JSON schema) into model prompts and parse structured responses back into domain objects: ranked root-cause hypotheses, short operator narratives, and optional patch hints that reference internal rule IDs rather than raw repository trees.
Orchestration stays deterministic: the workflow decides when to call the model, what contract version to use, and whether to skip AI entirely on hot paths. Model latency never blocks ingestion or deduplication; failed or slow calls degrade to heuristic RCA and still open incidents with a clear "AI unavailable" flag in audit metadata.
The Customer barred sending intellectual property or full source to consumer-grade or multi-tenant public endpoints. Production traffic targets a private, customer-controlled deployment (Azure OpenAI in the same cloud tenant, private endpoints, and network rules that deny egress to unapproved hosts). Prompt construction is allow-listed: only normalized incident fields, dependency subgraph slices, scrubbed error codes, and short log fragments that already passed redaction pipelines are eligible fields - never entire files, solution folders, or connection strings.
Beyond batch status, the same pipeline ingests synthetic API checks (request/response validation, latency thresholds) and uptime probes. ETL jobs report started, completed, late, and missed outcomes through thin client libraries and decorators in both Python and .NET environments so engineers could adopt monitoring with minimal boilerplate.
Delivery followed four deliberate phases:
Outcomes are directional and depend on tenant mix, but the Customer reported materially faster triage, fewer duplicate escalations, and a repeatable playbook for onboarding new clients onto shared monitoring primitives.
.NET, C#, AI-assisted services, Next.js, TypeScript, PostgreSQL, Valkey, RabbitMQ, Temporal, Terraform, Azure Container Apps, Slack, GitHub Actions, ETL integrations across SQL Server and PostgreSQL estates.
For organizations facing similar complexity, our data architecture and cloud architecture practices align engineering, reliability, and scale-up goals end to end.
From data architecture to reliable delivery - we help teams ship observable, scalable systems.
Industries:
Technologies:
Industries:
Technologies:
Industries:
Technologies: