Full Agentic AIOps — Complete Solution

⬡

Enterprise AIOps — Full Agentic Solution

Complete reference · Azure AI Foundry · 18 agents · ReAct loops · Learning system

True agents

60+

Skills

Tools

Memory layers

● Full agentic Azure Foundry Self-improving

Architecture principle: Orchestrator and Memory are platform infrastructure — not agents. They are the ground every agent runs on. The 18 agents below are the only components with autonomous ReAct loops. Click any agent to inspect its full specification.

Platform infrastructure

Always-on · not agents · all 18 agents depend on these

⚙️

Orchestrator runtime

State machine (not an agent). IDLE→PLANNING→DISPATCHING→EXECUTING→REPLANNING→DONE. Inner LLM generates plans. Outer shell is deterministic.

🗄️

Memory fabric

4-layer infrastructure service. Working (Redis) · Episodic (AI Search vectors) · Semantic KB (AI Search hybrid) · Knowledge Graph (Cosmos Gremlin).

⚖️

Policy engine + MCP gateway

OPA/Rego evaluates every action before execution. MCP Gateway routes all tool calls. Deny by default. Audit log per invocation.

Tier 1

Reasoning services — always running

Shared inference utilities. Called by all other agents when judgment is needed.

📐

Planner Agent

Goal → executable DAG. HTN planning, dynamic re-planning on failure.

HTNDAGRollback plans

🧠

Reasoning Agent

Chain-of-thought under uncertainty. Never acts. Called by all agents needing judgment.

CoTBayesianExplainability

Tier 2

Perception agents — sense the environment

Eyes and ears. Continuous signal ingestion and correlation.

🌊

Event Correlation

10,000+ alerts/hr → actionable incidents via temporal clustering.

TemporalTopology groupDedup

📡

Anomaly Detection

Statistical baselines + ML. Z-score, Holt-Winters, LSTM, isolation forest.

LSTMIsolation ForestDynamic thresholds

🔄

Change Detection

Tracks deploys, config drift, Git commits. 80% of incidents follow a change.

Deploy trackConfig driftGit corr.

🕸️

Topology Discovery

Live service dependency mapping via mesh, API traces, network flows.

Service meshAPI traceCMDB sync

📋

Log Intelligence

Structured insights from log streams. Drain clustering, NLP, log-to-metric.

Drain clusterNLPLog2metric

🛡️

Security Posture

CVE scanning, SIEM integration, compliance drift, access anomaly detection.

CVE scanSIEMCompliance drift

Tier 3

Analysis agents — think + diagnose

Heavy reasoners. Most LLM calls per incident.

🔬

Root Cause Analysis

Graph traversal + causal inference. True root cause vs downstream symptoms.

Graph traversalBayesian nets5-why

💥

Impact Analysis

Blast radius: users, revenue at risk, SLA breach probability, fan-out.

User sessionsRevenue modelSLA countdown

🔭

Predictive Analytics

Forecasts failures and capacity exhaustion before users are affected.

ARIMAProphetWhat-if sim

⚡

Performance Analysis

Latency decomposition, flame graph correlation, SLO burn rate.

P99 analysisFlame graphsSLO burn

💰

Cost Optimization

Cloud spend, right-sizing, idle resource detection, FinOps integration.

Right-sizingRI optimizeWaste detect

📜

Compliance Audit

SOC2/HIPAA/PCI-DSS. Auto-collects evidence, tests controls, detects drift.

SOC2HIPAAPCI-DSS

Tier 4

Action agents — execute + communicate

Highest risk. All require blast radius enforcement and approval gates.

🔧

Auto-Remediation

Selects and executes fixes with safety guardrails. Dry-run, blast radius, rollback.

Runbook matchDry-runBlast radius

📢

Communication

Audience-aware NL summaries. Right person, right detail, right time.

NL summariesPagerDutySlack/Teams

📈

Capacity Management

Auto-scales infra. HPA/VPA tuning, pre-scaling for known events.

HPA/VPANode scalingPre-scale

📝

Postmortem Agent

Auto-generates blameless postmortems with timeline, 5-why, action items.

Timeline rebuild5-whyAction items

🚦

Deployment Gate

Canary analysis, progressive rollout (1→25→100%), auto-rollback on SLI degradation.

Canary scoreProgressiveAuto-rollback

🧩

Knowledge Curator

Learns from every incident. Closes the learning loop. Makes platform smarter.

Pattern extractRunbook scoreKG enrich

Agent detail

Click any agent card to inspect

⬡

Select an agent card
to view full specification

ReAct loops — what makes each agent truly agentic

Every agent has a defined Perceive → Reason → Act → Observe → Learn cycle. This is the loop that separates an agent from a function call. Without this defined per agent, you have a catalog, not a running system.

The test of full agentic: A new type of incident arrives that has never been seen before. A full agentic system reasons over symptoms, retrieves closest past incidents, constructs a novel hypothesis, plans cautiously, and escalates with a structured brief. An automated pipeline just fails. The ReAct loop is what makes this possible.

📐 Planner Agent — ReAct loop

ReAct loop definition

PERCEIVE

Goal + context + constraints from Orchestrator. Top-3 similar past plans from Episodic Memory.

REASON

LLM calls (max 3): goal decomposition → dependency ordering → resource estimation. Output: validated DAG JSON.

ACT

No tool calls. Pure reasoning. Writes candidate plan to Working Memory for Orchestrator review.

OBSERVE

Orchestrator returns policy verdict. If blocked: re-plan around constraints (max 2 attempts).

LEARN

Writes plan outcome to Working Memory. Successful plans stored in Episodic Memory as templates.

STOP

Valid plan approved · OR · max 3 re-plans reached → escalate with explanation.

Max 3 LLM callsNo tool callsPure reasoning

🔬 Root Cause Analysis Agent — ReAct loop

ReAct loop definition

PERCEIVE

Correlated incident set + topology graph + change signals from WM. Top-5 similar past incidents via vector search.

REASON

LLM walks dependency graph upstream from symptoms. Builds 5-why chain. Bayesian scoring per candidate root cause. Calls Reasoning Agent for uncertainty arbitration.

ACT

metric_query · log_search · graph_query · vector_search via MCP. Writes RCA conclusion to WM + Knowledge Graph "caused" edge.

OBSERVE

Checks if conclusion explains ALL correlated alerts. If residual unexplained: multi-root detection pass.

LEARN

Writes RCA fingerprint to Episodic Memory. Updates Knowledge Graph confirmed causal edge.

STOP

Confidence ≥ 0.8 · OR · max 5 LLM calls · OR · human escalation.

Max 5 LLM calls4 tool typesHighest complexity

🔧 Auto-Remediation Agent — ReAct loop

ReAct loop definition

PERCEIVE

RCA conclusion + blast radius from WM. Matching runbooks scored by historical success from Semantic KB. Human approval status from Policy Engine.

REASON

LLM selects best runbook, parameterises for this incident. LLM generates dry-run validation criteria.

ACT

[1] Dry-run → log expected outcome. [2] Request approval (block until received for L1/L2). [3] Execute runbook steps via Skills. [4] Post-execution validation. [5] If not recovered: rollback → escalate.

OBSERVE

Monitors SLIs for 5min post-remediation before declaring success.

LEARN

Writes outcome to Episodic Memory. Updates runbook confidence score in Semantic KB.

STOP

SLIs recovered · OR · rollback executed · OR · human escalation triggered.

Never acts without approvalk8s_exec · cloud_exec

🧩 Knowledge Curator Agent — ReAct loop

ReAct loop definition (runs after every resolved incident)

PERCEIVE

Completed postmortem from Postmortem Agent. Resolution actions from Auto-Remediation. All agent reasoning traces from Artifact Store.

REASON

LLM extracts reusable patterns. LLM updates runbook confidence based on outcome. LLM identifies new correlation rules.

ACT

Writes new embedding to Episodic Memory. Updates Knowledge Graph confirmed edges. Updates runbook confidence scores. Generates new correlation rules for Event Correlation Agent.

OBSERVE

Validates new knowledge is consistent with existing KB. Rejects contradictory patterns.

LEARN

THIS AGENT IS THE LEARNING LOOP. Every run improves the platform. Platform MTTR decreases over time.

STOP

All learning artifacts written + Knowledge Graph updated.

Closes learning loopvector_search · graph_query

🌊 Event Correlation Agent — ReAct loop

ReAct loop definition

PERCEIVE

10,000+ raw alerts/hr from monitoring systems.

REASON

LLM classifies and groups by temporal + topological proximity. Applies learned correlation rules from Knowledge Curator.

ACT

Writes collapsed incident set to WM. Alert storm → 3 actionable incidents. Updates Knowledge Graph alert nodes.

OBSERVE

Validates grouped alerts have causal coherence. Rechecks ungrouped alerts.

LEARN

Stores new correlation fingerprints in Episodic Memory.

STOP

<50 ungrouped alerts remaining · OR · 5min time limit.

Max 2 LLM callsmetric_query · graph_query

📢 Communication Agent — ReAct loop

ReAct loop definition

PERCEIVE

Current incident state from WM + impacted services + SLA status. Audience list from graph_query (on-call, management, customers).

REASON

LLM generates audience-appropriate NL summary per audience. Different depth, jargon, and urgency framing: SRE vs PM vs CTO vs customer.

ACT

Sends via PagerDuty, Slack, Teams, email. Creates/updates status page. Sets up war room channel for P1/P2.

OBSERVE

Validates messages delivered (read receipts where available).

LEARN

Tracks which communication patterns reduce escalation noise.

STOP

All required audiences notified + delivery confirmed.

Safe — no infra mutationsend_message · create_ticket

All 18 agents follow this same pattern. The 6 shown above are the most critical. Every agent has a maximum LLM call count, defined tool bindings, a validated output envelope, and an explicit stop condition. Without these defined, the agent loops indefinitely or terminates silently — neither of which is agentic.

Skills catalog — 8 categories, 60+ skills

Skills are the critical middle layer between agents and tools. Agents decide WHAT. Skills know HOW. Tools do atomic work. Without skills, every agent re-implements the same logic. With skills: one fix improves all agents. Versioned in Git, tested in sandbox, audited on every run.

Diagnostic

12 skills · read-only · no approval

SAFE

diagnose_db_pool

Connection pool exhaustion, slow queries, replication lag

metric_querylog_searchllm_call

low

diagnose_k8s_pod

CrashLoopBackOff, OOMKill, image pull, readiness failures

k8s_execlog_search

low

diagnose_network

Latency spikes, packet loss, DNS failures, cert issues

low

diagnose_api_latency

P99 decomposition, trace flame graph, bottleneck ID

low

diagnose_memory_leak

Heap growth, GC pressure, leak patterns

low

diagnose_disk_pressure

IOPS saturation, space exhaustion, log rotation

low

diagnose_queue_backlog

Consumer lag, dead letters, partition imbalance

low

diagnose_auth_failure

Token expiry, cert rotation, OIDC misconfig

low

diagnose_cpu_saturation

Throttling, noisy neighbours, runaway processes

low

diagnose_cache_miss

Hit ratio degradation, eviction pressure

low

diagnose_dependency

Upstream/downstream service health

low

diagnose_certificate

Expiry, chain issues, mismatch detection

low

Remediation

15 skills · mutating · policy-gated

MUTATING

restart_service

Graceful rolling restart with drain + health checks

medium

scale_horizontally

Add replicas, wait for ready, verify load distribution

low

reset_connection_pool

Kill idle connections, reset PgBouncer/HikariCP

medium

rollback_deployment

Revert to last known good, verify canary, shift traffic

high · L1 approval

flush_cache

Redis/Memcached invalidation + warm-up strategy

medium

failover_dns

Switch to DR region, validate routing

high · L1 approval

rotate_credentials

Rotate secrets, update dependents, verify

high · L1 approval

clear_disk_space

Archive logs, purge tmp, compress old data

low

drain_node

Cordon + drain K8s node before maintenance

medium

scale_vertically

Resize CPU/memory requests/limits

medium

enable_circuit_breaker

Trip circuit breaker on degraded upstream

low

patch_config

Apply config change from runbook

medium

increase_rate_limit

Adjust throttle thresholds temporarily

low

archive_old_data

Move cold data to cheaper tier

low

force_gc

Trigger garbage collection on JVM/Node

low

Analysis

10 skills · compute-heavy · ML-powered

capacity_forecast

Predict resource limits: CPU, memory, disk, connections

low

cost_anomaly_detect

Spot unexpected cloud spend spikes by service

low

slo_burn_rate

Calculate error budget consumption rate per SLO

low

change_risk_score

Score deployment risk using history + blast radius

low

failure_probability

Score likelihood of imminent failure per service

low

blast_radius_estimate

Calculate scope of impact if action taken

low

latency_decompose

Break down P99 latency by service + operation

low

pattern_match

Match current symptoms to known failure patterns

low

trend_analysis

Detect slow-burn degradation over days/weeks

low

anomaly_score

Multi-metric composite anomaly score

low

Communication

8 skills · outbound · NLG-powered

OUTBOUND

incident_summary

NL summary tailored: SRE vs PM vs CTO

low

draft_postmortem

Blameless postmortem with timeline, 5-why, action items

low

status_page_update

Compose and post to Statuspage/Instatus

low

escalation_brief

Package context for L2/L3 handoff with evidence links

low

war_room_setup

Create incident channel, add responders, pin context

low

resolution_notify

Notify all stakeholders of incident resolution

low

customer_advisory

Draft external-facing advisory (sanitized)

low

sla_breach_alert

Trigger SLA breach notification with countdown

low

Discovery

6 skills · topology · read-only

SAFE

map_service_deps

Trace API calls to build live dependency graph

low

discover_cloud_assets

Scan AWS/Azure/GCP for untracked resources

low

detect_config_drift

Compare running config vs Git/IaC declared state

low

trace_blast_path

Walk graph from changed component to all dependents

low

inventory_scan

Full asset inventory for a service/namespace

low

ownership_lookup

Find team responsible for a service/resource

low

Security

6 skills · threat & compliance

COMPLIANCE

assess_vulnerability

CVE lookup, runtime exposure check, patch priority

low

audit_access_patterns

Detect anomalous IAM, SSH, and API access

low

compliance_check

Validate against SOC2/HIPAA/PCI controls

low

threat_intel_lookup

Check IOCs against threat feeds

low

secret_audit

Find exposed or expiring credentials

low

network_exposure_check

Identify unintended public surface area

low

Optimization

5 skills · FinOps & tuning

MUTATING

rightsize_compute

Recommend CPU/memory based on actual utilization

low

optimize_queries

Identify slow SQL, suggest indexes, rewrite plans

low

tune_autoscaler

Adjust HPA/VPA thresholds based on traffic patterns

medium

optimize_spot_usage

Maximise spot/preemptible instance savings safely

medium

eliminate_idle_resources

Find and decommission waste resources

medium

Workflow (composite)

4 skills · multi-skill chains

COMPOSITE

full_incident_response

diagnose → remediate → validate → communicate → learn. End-to-end P2/P3 chain.

medium

safe_deployment

risk_score → deploy → canary_analyze → promote_or_rollback

medium

proactive_maintenance

capacity_forecast → rightsize → drift_check → compliance_check

low

security_sweep

vulnerability → access_audit → compliance. Full security posture.

low

Tools catalog — 20 governed tools via MCP gateway

Every tool call from every agent flows through the MCP Gateway. No exceptions. Policy check before execution. Audit log per invocation. Deny by default — if a tool is not on the explicit allowlist for an agent, the call is rejected before any network request is made.

Core rule: Agents call Skills. Skills call Tools. Tools call infrastructure via MCP Gateway. An agent never calls infrastructure directly. This chain is enforced architecturally — not by convention.

Data access tools

metric_query

query_metrics(expr, range, step)

Query time-series data with PromQL/MQL syntax against Prometheus/Azure Monitor.

read-onlyno approvalprometheus

log_search

search_logs(query, timerange, filters)

Full-text + structured search across all log indices. Supports regex, field filters.

read-onlyno approvallog analytics

trace_lookup

get_trace(trace_id) | search_spans()

Retrieve distributed traces by ID. Search spans. Build service maps from trace data.

read-onlyno approvalapp insights

graph_query

graph_query(cypher) | find_path(a,b)

Traverse the knowledge graph for topology + history. Gremlin/Cypher query support.

read-onlyno approvalcosmos gremlin

Infrastructure execution tools

k8s_exec

k8s_exec(action, resource, ns)

Kubectl operations: scale, restart, rollout, drain, cordon. Full namespace scoping.

mutatingL2/L3 approvalAKS

cloud_exec

cloud_exec(provider, service, action)

AWS/Azure/GCP resource operations via unified abstraction layer.

mutatingL1/L2 approvalmulti-cloud

remote_exec

remote_exec(host, command, sudo?)

Run commands on hosts via SSH with full audit trail. Restricted command allowlist.

mutatingL2 approvalssh

dns_lb_control

dns_update() | lb_shift_traffic()

DNS failover, traffic shifting, health check management. DR activation support.

mutatingL1 approvalazure dns

Integration & ITSM tools

create_ticket

create_ticket() | update_ticket()

CRUD on ServiceNow, Jira, Freshdesk tickets with structured metadata.

writeno approvalITSM

send_message

send_message(channel, body, urgency)

Slack, Teams, PagerDuty, email with templates. War room creation support.

outboundno approvalmulti-channel

cmdb_sync

cmdb_get(ci) | cmdb_update(ci, attrs)

Read/write configuration items and relationships. ServiceNow CMDB integration.

read/writeL3CMDB

ci_cd_pipeline

trigger_pipeline() | get_build_status()

Trigger builds, read pipeline status, gate and release deployments.

triggerL2 approvalAzure DevOps

AI / ML tools

llm_call

llm_call(model, prompt, tools?)

Prompt any LLM with routing, caching, fallback. Cost tracked per call. Via Azure OpenAI.

inferencecost-trackedAzure OpenAI

vector_search

vector_search(query, collection, top_k)

Semantic search over incidents, docs, runbooks. Retrieval quality logged per query.

read-onlyno approvalAI Search

ml_model_serve

predict(model_id, features)

Run inference on custom anomaly/forecast models. ARIMA, Prophet, isolation forest.

inferenceno approvalAzure ML

nlp_pipeline

nlp_process(text, tasks[])

Entity extraction, classification, summarization. Log parsing and error fingerprinting.

inferenceno approvalAzure AI

Safety & governance tools

check_policy

check_policy(action, ctx) → allow|deny

Evaluate OPA/Rego rules before any action executes. Returns allow/deny + rule matched.

governancealways calledOPA/Rego

audit_logger

log_action(agent, action, evidence)

Immutable append-only log of every agent decision and action. Azure Log Analytics.

writealways calledLog Analytics

approval_gateway

request_approval(action, approvers)

Request human approval for high-risk actions. Teams adaptive cards + timeout escalation.

blockingL1 actionsLogic Apps

secret_manager

get_secret(key, ttl) | rotate(key)

Credential retrieval with JIT access and rotation. Azure Key Vault integration.

JIT accessL1 for rotateKey Vault

Runtime flow — how a P2 incident executes end-to-end

This is the exact sequence of events when an incident triggers. Every step maps to the architecture. Every arrow is a real API call. The Orchestrator never sleeps — it holds state in PostgreSQL so it survives crashes and restarts.

User / alert

Runtime / orchestrator

LLM (model plane)

MCP gateway

Knowledge fabric

Working memory

State store

Observability

Initialization

Alert/user

→

Runtime

Incident trigger + identity + SLA/priority context

P2 — payment-service

Runtime

→

State store

Initialize execution context (corrId, state=INIT)

idempotency key set

Runtime

→

Observability

Start audit span (incident ID, timestamp, identity)

Runtime

→

Knowledge fabric

Seed request: top-5 similar incidents + relevant SOPs

Knowledge fabric

→

Runtime

Episodic memory results + runbook candidates returned

Runtime

→

Working memory

Seed working context (intent, identity, SLA, episodic results)

Planning (Orchestrator inner LLM shell)

Runtime

→

LLM

Plan request: context + available agents + policy constraints

LLM

→

Runtime

Execution plan DAG (JSON schema validated)

LLM proposes only

Runtime

→

MCP gateway

check_policy(plan) — validate all steps approved

Runtime

→

State store

Persist plan hash + transition to DISPATCHING

Execution loop — for each plan step

Loop — for each DAG step (parallel where depends_on allows)

Runtime

→

Working memory

Refresh agent context (latest facts from prior steps)

Runtime

→

LLM

Agent reasoning call (context + tool results + KB context)

LLM

→

Runtime

Next tool call OR stop condition (validated against schema)

Runtime

→

MCP gateway

Invoke governed tool (params, corrId, agent identity)

MCP gateway

→

MCP gateway

check_policy → allow/deny → execute → audit_logger

deny by default

MCP gateway

→

Runtime

Tool result + logs + artifact ref

Runtime

→

Knowledge fabric

Retrieve SOPs/graph/KB relevant to current state

Runtime

→

Working memory

Upsert latest facts (tool results, agent findings)

Runtime

→

State store

Upsert step record (status, retryCount, idempotencyKey)

Runtime

→

Observability

Audit tool call (inputs/outputs, scope, latency, cost)

Completion + learning

Runtime

→

State store

Mark workflow COMPLETE — final status, cost total

Runtime

→

Knowledge fabric

Upsert final summary (resolution, RCA fingerprint, tags)

Runtime

→

Observability

Finalize audit (metrics, total cost, trace summary, accuracy signal)

Runtime

→

User / alert

Final output + explanation + provenance links + cost summary

Memory fabric — 4 layers, all infrastructure

Memory is not an agent. It is a shared infrastructure service with four distinct layers. Every agent reads from it at start, writes to it at close. The Memory Fabric is what makes the platform a learning system rather than just an automation system.

Layer 1 — Working memory

Azure Cache for Redis (Premium, zone-redundant) · TTL-scoped per incident lifetime · <1ms latency · expires when incident reaches DONE state

Holds

Active reasoning context · tool results (latest) · partial agent results · current plan state · identity + SLA

Seeded

At INIT state by Orchestrator from Episodic Memory (top-5 similar incidents)

Updated

Every agent writes its result when complete. Orchestrator refreshes before each dispatch.

Expires

When incident moves to DONE + configurable TTL (default 24h)

Layer 2 — Episodic memory

Azure AI Search (vector index + semantic reranker) · All resolved incidents as dense embeddings · Native Foundry integration · 10-50ms similarity search

Holds

Every resolved incident as structured embedding: RCA fingerprint, resolution action, outcome, MTTR, cost, confidence scores

Used at

INIT: top-5 similar past incidents retrieved. RCA Agent uses as in-context examples. Auto-Remediation uses past resolution actions.

Updated

At DONE: Knowledge Curator writes new incident embedding with full resolution context.

Quality

Every retrieval logged: query, top-k, similarity scores, chunks used in context.

Layer 3 — Semantic knowledge base

Azure AI Search (hybrid full-text + vector) · SOPs, runbooks, architecture docs, postmortems · 500+ runbooks with confidence scores · 20-80ms latency

Holds

500+ runbooks (versioned, chunked), architecture docs, past postmortems, compliance control mappings, team ownership docs

Used by

Reasoning Agent (SOPs per step) · Auto-Remediation (runbooks) · Compliance Agent (control definitions) · Postmortem Agent (similar postmortems)

Updated

Knowledge Curator updates runbook confidence scores after every incident outcome. New docs indexed via Azure Blob trigger.

Layer 4 — Knowledge graph

Azure Cosmos DB (Gremlin API) · All infrastructure topology · Continuously updated · 5-30ms graph traversal · Global distribution

Nodes

Service · Host · Database · Queue · Team · Deployment · Config · Alert

Edges

depends_on · owned_by · deployed_at · calls · caused · resolved_by · has_config · monitors

Used by

Topology Discovery (writes) · RCA (traverses upstream) · Impact Analysis (fan-out downstream) · Change Detection (correlates change nodes)

Updated

Topology Discovery Agent continuously. Knowledge Curator confirms/adds causal edges after every incident close.

Why Memory is infrastructure, not an agent: Every agent reads from and writes to Memory. If Memory were an agent, who orchestrates it? It has no parent. It cannot have a ReAct loop because it has no goal — it responds to requests. It must be always-on, clustered, and highly available (HA) because if it goes down, ALL agents are blind. This is infrastructure behavior, not agent behavior.

Human-in-the-loop — 3 approval levels

Level 3 — full autonomy

Agent acts immediately. No human gate. Scope: read-only ops, low-blast-radius reversible actions. Examples: pulling metrics, querying logs, scaling dev by 1 replica.

Level 2 — notify + proceed

Agent acts + notifies simultaneously. Human can interrupt within 5min window. Examples: service restarts, cache flushes on non-critical services. Via Teams adaptive card.

Level 1 — approval required

Agent prepares action, presents evidence, blocks until approved. Timeout: 10min → escalate. Examples: rollback, DNS failover, credential rotation, any PCI-zone action.

Azure AI Foundry mapping — what goes where

Azure AI Foundry covers ~40% of this solution natively. The remaining 60% maps to adjacent Azure services that sit alongside Foundry. One piece — the MCP policy gateway — requires a custom build on Azure infrastructure.

Native Azure AI Foundry — 40% of the solution

Orchestrator runtime + 18 agents

→

AI Foundry Agent Service — 19 projects (1 orchestrator + 18 agents)

Primary LLM (heavy reasoning)

→

Azure OpenAI GPT-4o via Foundry model catalog

Fast LLM (routing, classification, summaries)

→

Azure OpenAI GPT-4o-mini via Foundry model catalog

Embeddings for vector memory

→

Azure OpenAI text-embedding-3-large + AI Search reranker

Episodic memory + semantic KB

→

Azure AI Search (vector + hybrid) — native Foundry knowledge store

System prompt versioning

→

Foundry prompt management — versioned per agent project

Offline evaluation + shadow mode

→

Foundry evaluations — eval datasets per agent, accuracy tracking

Reasoning traces + tool spans

→

Foundry tracing → Azure Monitor + App Insights

Content safety / responsible AI

→

Foundry content filters — built in, applied to all LLM calls

Managed credentials for tools

→

Foundry connections — secure credential management per project

Adjacent Azure services — 55% of the solution

Working memory (TTL-scoped, <1ms)

→

Azure Cache for Redis Premium (zone-redundant, clustered)

Knowledge graph (topology + causality)

→

Azure Cosmos DB — Gremlin API (global distribution)

State store (ACID, idempotency)

→

Azure Database for PostgreSQL Flexible Server

Artifact store (reasoning traces, tool logs)

→

Azure Blob Storage + Table Storage index

Skills runner (stateless, auto-scale)

→

Azure Container Apps (skill pods, scale to zero)

Agent pod hosting (independent scale)

→

Azure Kubernetes Service — Helm chart per tier

Human approval gates (L1 approval)

→

Azure Logic Apps + Teams adaptive cards + Entra ID authZ

Message bus (async agent coordination)

→

Azure Service Bus (queues + topics per tier)

Runbook library (Git-versioned YAML)

→

Azure Repos + Blob Storage (500+ runbooks)

Secrets + JIT credential access

→

Azure Key Vault (JIT access, automatic rotation)

Observability cost ledger

→

Azure Monitor custom metrics + Log Analytics workspace

Custom build on Azure — 5% — the one piece Foundry cannot do natively

MCP Gateway (deny by default, tool contracts)

→

Azure API Management + custom MCP spec layer

Policy engine (OPA/Rego, blast radius)

→

OPA sidecar on AKS — every tool call evaluated before execution

Inter-agent envelope validation

→

Custom FastAPI service — schema validation before Orchestrator reads

These three are custom because: Azure Foundry has no native OPA/Rego policy engine, no deny-by-default tool governance primitive, and no inter-agent typed envelope contract enforcement. These are the pieces that make the system safe to run in production. Without them, you have automation, not governed agentic behavior.

Build order on Azure: PostgreSQL + Blob Storage (state/artifacts) → Redis + AI Search (memory fabric) → Container Apps (skills) → AKS + Foundry agents (18 projects) → APIM + OPA (MCP gateway) → Logic Apps (human approval) → Observability pipeline. Never skip the policy engine — deploy it day one with empty rules rather than bypassing it.

Full agentic checklist — every item defined and mapped

This is the definitive test of whether a system is truly full agentic. Every item is mapped to the architecture. Nothing is aspirational — each has a specific implementation location.

✅

Every agent has a defined ReAct loop

All 18 agents have Perceive/Reason/Act/Observe/Learn defined. Max LLM call counts set. Stop conditions explicit. Defined in Tab 02 of this document and implemented as Foundry Agent projects.

defined

✅

Every agent has a typed input/output contract

Inter-agent envelope v1.0: status · confidence · result · evidence_refs · memory_writes · next_recommended_step · escalation · cost_usd. Orchestrator validates this schema before reading any result.

defined

✅

Orchestrator is a state machine with all transitions defined

9 states: IDLE→INIT→PLANNING→DISPATCHING→EXECUTING→REPLANNING→COMPLETING→AUDITING→DONE. Policy-gated transitions. LLM inner shell for plan generation only. Outer shell fully deterministic. Hosted on Azure AI Foundry Agent Service.

defined

✅

Policy engine governs every action type

OPA/Rego rules per action per context. 3 autonomy levels (L1/L2/L3). Blast radius caps. Change windows. RBAC per service. Deployed as OPA sidecar on AKS. Evaluates before every MCP tool call.

defined

✅

Human-in-the-loop has 3 levels with explicit triggers

L3: full autonomy (read-only, reversible). L2: notify + 5min interrupt window (Teams adaptive card). L1: block until approved with 10min timeout + escalation. All wired through Azure Logic Apps + Entra ID.

defined

✅

Memory fabric has 4 layers, all populated and queryable

Working (Redis <1ms) · Episodic (AI Search vectors, top-5 retrieval at incident start) · Semantic KB (500+ runbooks with confidence scores) · Knowledge Graph (Cosmos Gremlin, topology + causality edges). All 4 live and queryable.

defined

✅

Learning loop closes: incident → postmortem → KB update → better next run

Postmortem Agent → Knowledge Curator Agent → Episodic Memory (new embedding) + Knowledge Graph (confirmed edges) + Semantic KB (runbook confidence scores) + new correlation rules fed to Event Correlation Agent. Measurable MTTR reduction over time.

defined

✅

Circuit breakers defined per agent with failure modes

Per-agent timeout (30s perception, 120s analysis). Max 3 retries with exponential backoff. Circuit breaker: 5 failures in 10min → OPEN. Half-open after 2min cooldown. If RCA fails → Impact still runs. If Memory Fabric down → ALL agents degraded → immediate escalation.

defined

✅

Cost budgets enforced per incident tier

P1: $5 max LLM spend. P2: $2 max. P3: $0.50 max. P4: $0.20 max. Enforced by Policy Engine before each LLM call. Cost ledger in Azure Monitor. Per-agent, per-tool cost tracked in State Store.

defined

✅

Shadow mode for new agent versions before live promotion

New agent versions run alongside existing logic. Results compared but not acted on. Promoted to active after accuracy threshold met on last 30 days of replayed incidents. Foundry evaluation pipeline runs on every agent code change.

defined

✅

Evaluation harness to replay past incidents and test accuracy

State Store holds all incidents replayable. Offline eval pipeline in Azure AI Foundry evaluations. Per-agent eval datasets (min 20 labelled examples). RCA accuracy tracked against confirmed postmortems. Resolution success rate tracked. MTTR trending.

defined

✅

Skills as a proper layer between agents and tools

8 categories, 60+ skills. Agents call Skills. Skills call Tools. Never direct agent → tool. Skills are versioned in Azure Repos (YAML), tested in sandbox, deployed to Container Apps. Skill ID + version logged on every execution.

defined

What makes this truly full agentic (not just automated): A new type of incident arrives that no runbook covers. The platform retrieves the 5 closest past incidents from Episodic Memory, traverses the Knowledge Graph for structural context, calls the Reasoning Agent to construct a novel hypothesis under uncertainty, plans a cautious diagnostic sequence, and escalates with a structured brief explaining exactly what it found and what it does not know. This generalisation to novel situations — not the number of agents or Azure services — is what makes it full agentic.

Observability & governance — cross-cutting

You cannot improve what you cannot measure. Observability is not a tab in the platform — it is a cross-cutting layer that every other component writes to. Reasoning traces, tool spans, RAG retrieval quality, cost per incident, model accuracy, and human approval audit all flow to Azure Monitor + App Insights.

🧵

Reasoning traces

What captured

Every LLM call, full I/O

Stored in

App Insights custom events

Chain-of-thought steps

Every reasoning step logged

Confidence score

Per conclusion, per agent

Evidence refs

Every claim backed by artifact

Used for

Hallucination detection, audit

🔧

Tool call spans

Logged at

MCP Gateway (before agent)

Pre-execution

Policy check result logged

Post-execution

Full input/output + latency

Approval trace

Who approved, when, evidence

Blast radius

Estimated vs actual

Immutable

Append-only audit log

🔍

RAG retrieval quality

Query logged

Full query + collection

Top-k results

Doc ID + similarity score

Chunks used in context

Which chunks influenced LLM

Retrieval latency

Per query, trending

Why it matters

Separates bad retrieval from bad reasoning

💰

Cost per incident

LLM costs tracked

Per agent, per call

Tool costs tracked

Per tool type, per invocation

Avg P2 incident cost

~$0.08–0.15 USD

Human time saved

~47 min per incident

Cost vs manual

~$1,200 savings per P2

Budget enforcement

P1: $5 · P2: $2 · P3: $0.50

🎯

Model accuracy

RCA accuracy

87.3% (30-day rolling)

Resolution success

61% P3/P4 fully automated

Prediction MAPE

9.2% (7-day horizon)

Anomaly precision

84.7%

Anomaly recall

91.2%

Runbook match accuracy

79.4% (improving)

📊

Platform metrics

MTTD (mean time to detect)

Trending down

MTTR (mean time to resolve)

8.4 min automated

False alarm rate

6.4%

Escalation rate

Trending down as KB grows

Novel incident handling

Reasoned brief always produced

Learning signal quality

Improving per 100 incidents

Learning loop — what makes this self-improving

This is the difference between an agentic system and a full agentic system. Every resolved incident makes the platform measurably better at the next one. The loop runs automatically after every incident close. No human intervention required.

The key question: Is your platform better at incident #500 than it was at incident #1? If yes — and you can measure it in MTTR, RCA accuracy, and runbook confidence scores — you have a full agentic, self-improving system. If not, you have sophisticated automation that plateaus.

Incident closes — all agents complete

Orchestrator → COMPLETING state

All agent envelopes collected. Final answer delivered to client. State Store marks workflow COMPLETE with final status, total cost, and MTTR. Artifact Store holds every reasoning trace and tool call log.

State StoreArtifact StoreMTTR recorded

Postmortem Agent generates structured postmortem

Postmortem Agent · Tier 4

Rebuilds narrative timeline from State Store + Artifact Store. Constructs 5-why chain from RCA Agent's reasoning trace. Generates action items with owners from Knowledge Graph. Creates tickets in ITSM. Publishes to team channel.

Timeline rebuild5-why chainAction itemsITSM tickets

Knowledge Curator extracts reusable patterns

Knowledge Curator Agent · Tier 4 — runs after every incident

Reads postmortem + all agent reasoning traces + resolution actions. LLM extracts: RCA fingerprint (symptom pattern → root cause), resolution action that worked (runbook + parameters), runbook confidence delta (did it work? by how much?), new correlation rule candidates ("these 3 alerts always precede X"). Validates new knowledge for consistency before writing.

RCA fingerprintRunbook confidence deltaNew correlation rulesConsistency check

Memory Fabric updated — 4 writes

Knowledge Curator → Memory Fabric API

Four parallel writes: [1] Episodic Memory: new incident embedding stored with RCA fingerprint, resolution action, outcome, MTTR, confidence. [2] Knowledge Graph: confirmed causal edges added (caused, resolved_by). [3] Semantic KB: runbook confidence scores updated (higher if it worked, lower if it failed). [4] Event Correlation Agent's correlation rules updated with new pattern.

Episodic Memory writeGraph edge updateRunbook confidence updateCorrelation rule update

Next similar incident — platform performs measurably better

Orchestrator INIT state — future incident

Event Correlation Agent groups alerts faster using new correlation rule. Orchestrator seeds Working Memory with this incident as a top-5 Episodic Memory result. RCA Agent's first LLM call already has the resolution path as context — fewer reasoning steps needed. Auto-Remediation Agent selects the higher-confidence runbook. MTTR decreases. RCA accuracy increases. Cost per incident decreases.

Faster correlationBetter RCA contextHigher-confidence runbookMTTR decreases

Platform measures its own improvement

Observability layer — continuous measurement

RCA accuracy tracked against confirmed postmortems (TP/FP/FN per period). Resolution success rate by incident tier. MTTR trending over incident volume. Runbook confidence score distribution (are scores improving?). Prediction accuracy (MAPE trending). Cost per incident trending. These metrics are the evidence that the learning loop is working. Without them, self-improvement is a claim, not a fact.

RCA accuracy trendingMTTR trendingRunbook confidence trendingCost trending

The learning loop needs incident volume to show measurable improvement. Plan for: 50 incidents to calibrate baselines, 200 incidents to see meaningful MTTR improvement, 500+ incidents to see accuracy metrics stabilize. This is why Stage 3 (full agentic, self-improving) takes 6-12 months post-production launch — not because the architecture is incomplete, but because the learning loop needs data to learn from.