Enterprise AIOps — Full Agentic Solution
Complete reference · Azure AI Foundry · 18 agents · ReAct loops · Learning system
18
True agents
60+
Skills
20
Tools
4
Memory layers
● Full agentic Azure Foundry Self-improving
Architecture principle: Orchestrator and Memory are platform infrastructure — not agents. They are the ground every agent runs on. The 18 agents below are the only components with autonomous ReAct loops. Click any agent to inspect its full specification.
Platform infrastructure
Always-on · not agents · all 18 agents depend on these
⚙️
Orchestrator runtime
State machine (not an agent). IDLE→PLANNING→DISPATCHING→EXECUTING→REPLANNING→DONE. Inner LLM generates plans. Outer shell is deterministic.
🗄️
Memory fabric
4-layer infrastructure service. Working (Redis) · Episodic (AI Search vectors) · Semantic KB (AI Search hybrid) · Knowledge Graph (Cosmos Gremlin).
⚖️
Policy engine + MCP gateway
OPA/Rego evaluates every action before execution. MCP Gateway routes all tool calls. Deny by default. Audit log per invocation.
Tier 1
Reasoning services — always running
Shared inference utilities. Called by all other agents when judgment is needed.
📐
Planner Agent
Goal → executable DAG. HTN planning, dynamic re-planning on failure.
HTNDAGRollback plans
🧠
Reasoning Agent
Chain-of-thought under uncertainty. Never acts. Called by all agents needing judgment.
CoTBayesianExplainability
Tier 2
Perception agents — sense the environment
Eyes and ears. Continuous signal ingestion and correlation.
🌊
Event Correlation
10,000+ alerts/hr → actionable incidents via temporal clustering.
TemporalTopology groupDedup
📡
Anomaly Detection
Statistical baselines + ML. Z-score, Holt-Winters, LSTM, isolation forest.
LSTMIsolation ForestDynamic thresholds
🔄
Change Detection
Tracks deploys, config drift, Git commits. 80% of incidents follow a change.
Deploy trackConfig driftGit corr.
🕸️
Topology Discovery
Live service dependency mapping via mesh, API traces, network flows.
Service meshAPI traceCMDB sync
📋
Log Intelligence
Structured insights from log streams. Drain clustering, NLP, log-to-metric.
Drain clusterNLPLog2metric
🛡️
Security Posture
CVE scanning, SIEM integration, compliance drift, access anomaly detection.
CVE scanSIEMCompliance drift
Tier 3
Analysis agents — think + diagnose
Heavy reasoners. Most LLM calls per incident.
🔬
Root Cause Analysis
Graph traversal + causal inference. True root cause vs downstream symptoms.
Graph traversalBayesian nets5-why
💥
Impact Analysis
Blast radius: users, revenue at risk, SLA breach probability, fan-out.
User sessionsRevenue modelSLA countdown
🔭
Predictive Analytics
Forecasts failures and capacity exhaustion before users are affected.
ARIMAProphetWhat-if sim
Performance Analysis
Latency decomposition, flame graph correlation, SLO burn rate.
P99 analysisFlame graphsSLO burn
💰
Cost Optimization
Cloud spend, right-sizing, idle resource detection, FinOps integration.
Right-sizingRI optimizeWaste detect
📜
Compliance Audit
SOC2/HIPAA/PCI-DSS. Auto-collects evidence, tests controls, detects drift.
SOC2HIPAAPCI-DSS
Tier 4
Action agents — execute + communicate
Highest risk. All require blast radius enforcement and approval gates.
🔧
Auto-Remediation
Selects and executes fixes with safety guardrails. Dry-run, blast radius, rollback.
Runbook matchDry-runBlast radius
📢
Communication
Audience-aware NL summaries. Right person, right detail, right time.
NL summariesPagerDutySlack/Teams
📈
Capacity Management
Auto-scales infra. HPA/VPA tuning, pre-scaling for known events.
HPA/VPANode scalingPre-scale
📝
Postmortem Agent
Auto-generates blameless postmortems with timeline, 5-why, action items.
Timeline rebuild5-whyAction items
🚦
Deployment Gate
Canary analysis, progressive rollout (1→25→100%), auto-rollback on SLI degradation.
Canary scoreProgressiveAuto-rollback
🧩
Knowledge Curator
Learns from every incident. Closes the learning loop. Makes platform smarter.
Pattern extractRunbook scoreKG enrich
Agent detail
Click any agent card to inspect

Select an agent card
to view full specification

ReAct loops — what makes each agent truly agentic

Every agent has a defined Perceive → Reason → Act → Observe → Learn cycle. This is the loop that separates an agent from a function call. Without this defined per agent, you have a catalog, not a running system.

The test of full agentic: A new type of incident arrives that has never been seen before. A full agentic system reasons over symptoms, retrieves closest past incidents, constructs a novel hypothesis, plans cautiously, and escalates with a structured brief. An automated pipeline just fails. The ReAct loop is what makes this possible.
📐 Planner Agent — ReAct loop
ReAct loop definition
PERCEIVE
Goal + context + constraints from Orchestrator. Top-3 similar past plans from Episodic Memory.
REASON
LLM calls (max 3): goal decomposition → dependency ordering → resource estimation. Output: validated DAG JSON.
ACT
No tool calls. Pure reasoning. Writes candidate plan to Working Memory for Orchestrator review.
OBSERVE
Orchestrator returns policy verdict. If blocked: re-plan around constraints (max 2 attempts).
LEARN
Writes plan outcome to Working Memory. Successful plans stored in Episodic Memory as templates.
STOP
Valid plan approved · OR · max 3 re-plans reached → escalate with explanation.
Max 3 LLM callsNo tool callsPure reasoning
🔬 Root Cause Analysis Agent — ReAct loop
ReAct loop definition
PERCEIVE
Correlated incident set + topology graph + change signals from WM. Top-5 similar past incidents via vector search.
REASON
LLM walks dependency graph upstream from symptoms. Builds 5-why chain. Bayesian scoring per candidate root cause. Calls Reasoning Agent for uncertainty arbitration.
ACT
metric_query · log_search · graph_query · vector_search via MCP. Writes RCA conclusion to WM + Knowledge Graph "caused" edge.
OBSERVE
Checks if conclusion explains ALL correlated alerts. If residual unexplained: multi-root detection pass.
LEARN
Writes RCA fingerprint to Episodic Memory. Updates Knowledge Graph confirmed causal edge.
STOP
Confidence ≥ 0.8 · OR · max 5 LLM calls · OR · human escalation.
Max 5 LLM calls4 tool typesHighest complexity
🔧 Auto-Remediation Agent — ReAct loop
ReAct loop definition
PERCEIVE
RCA conclusion + blast radius from WM. Matching runbooks scored by historical success from Semantic KB. Human approval status from Policy Engine.
REASON
LLM selects best runbook, parameterises for this incident. LLM generates dry-run validation criteria.
ACT
[1] Dry-run → log expected outcome. [2] Request approval (block until received for L1/L2). [3] Execute runbook steps via Skills. [4] Post-execution validation. [5] If not recovered: rollback → escalate.
OBSERVE
Monitors SLIs for 5min post-remediation before declaring success.
LEARN
Writes outcome to Episodic Memory. Updates runbook confidence score in Semantic KB.
STOP
SLIs recovered · OR · rollback executed · OR · human escalation triggered.
Never acts without approvalk8s_exec · cloud_exec
🧩 Knowledge Curator Agent — ReAct loop
ReAct loop definition (runs after every resolved incident)
PERCEIVE
Completed postmortem from Postmortem Agent. Resolution actions from Auto-Remediation. All agent reasoning traces from Artifact Store.
REASON
LLM extracts reusable patterns. LLM updates runbook confidence based on outcome. LLM identifies new correlation rules.
ACT
Writes new embedding to Episodic Memory. Updates Knowledge Graph confirmed edges. Updates runbook confidence scores. Generates new correlation rules for Event Correlation Agent.
OBSERVE
Validates new knowledge is consistent with existing KB. Rejects contradictory patterns.
LEARN
THIS AGENT IS THE LEARNING LOOP. Every run improves the platform. Platform MTTR decreases over time.
STOP
All learning artifacts written + Knowledge Graph updated.
Closes learning loopvector_search · graph_query
🌊 Event Correlation Agent — ReAct loop
ReAct loop definition
PERCEIVE
10,000+ raw alerts/hr from monitoring systems.
REASON
LLM classifies and groups by temporal + topological proximity. Applies learned correlation rules from Knowledge Curator.
ACT
Writes collapsed incident set to WM. Alert storm → 3 actionable incidents. Updates Knowledge Graph alert nodes.
OBSERVE
Validates grouped alerts have causal coherence. Rechecks ungrouped alerts.
LEARN
Stores new correlation fingerprints in Episodic Memory.
STOP
<50 ungrouped alerts remaining · OR · 5min time limit.
Max 2 LLM callsmetric_query · graph_query
📢 Communication Agent — ReAct loop
ReAct loop definition
PERCEIVE
Current incident state from WM + impacted services + SLA status. Audience list from graph_query (on-call, management, customers).
REASON
LLM generates audience-appropriate NL summary per audience. Different depth, jargon, and urgency framing: SRE vs PM vs CTO vs customer.
ACT
Sends via PagerDuty, Slack, Teams, email. Creates/updates status page. Sets up war room channel for P1/P2.
OBSERVE
Validates messages delivered (read receipts where available).
LEARN
Tracks which communication patterns reduce escalation noise.
STOP
All required audiences notified + delivery confirmed.
Safe — no infra mutationsend_message · create_ticket
All 18 agents follow this same pattern. The 6 shown above are the most critical. Every agent has a maximum LLM call count, defined tool bindings, a validated output envelope, and an explicit stop condition. Without these defined, the agent loops indefinitely or terminates silently — neither of which is agentic.

Skills catalog — 8 categories, 60+ skills

Skills are the critical middle layer between agents and tools. Agents decide WHAT. Skills know HOW. Tools do atomic work. Without skills, every agent re-implements the same logic. With skills: one fix improves all agents. Versioned in Git, tested in sandbox, audited on every run.

1
Diagnostic
12 skills · read-only · no approval
SAFE
diagnose_db_pool
Connection pool exhaustion, slow queries, replication lag
metric_querylog_searchllm_call
low
diagnose_k8s_pod
CrashLoopBackOff, OOMKill, image pull, readiness failures
k8s_execlog_search
low
diagnose_network
Latency spikes, packet loss, DNS failures, cert issues
low
diagnose_api_latency
P99 decomposition, trace flame graph, bottleneck ID
low
diagnose_memory_leak
Heap growth, GC pressure, leak patterns
low
diagnose_disk_pressure
IOPS saturation, space exhaustion, log rotation
low
diagnose_queue_backlog
Consumer lag, dead letters, partition imbalance
low
diagnose_auth_failure
Token expiry, cert rotation, OIDC misconfig
low
diagnose_cpu_saturation
Throttling, noisy neighbours, runaway processes
low
diagnose_cache_miss
Hit ratio degradation, eviction pressure
low
diagnose_dependency
Upstream/downstream service health
low
diagnose_certificate
Expiry, chain issues, mismatch detection
low
2
Remediation
15 skills · mutating · policy-gated
MUTATING
restart_service
Graceful rolling restart with drain + health checks
medium
scale_horizontally
Add replicas, wait for ready, verify load distribution
low
reset_connection_pool
Kill idle connections, reset PgBouncer/HikariCP
medium
rollback_deployment
Revert to last known good, verify canary, shift traffic
high · L1 approval
flush_cache
Redis/Memcached invalidation + warm-up strategy
medium
failover_dns
Switch to DR region, validate routing
high · L1 approval
rotate_credentials
Rotate secrets, update dependents, verify
high · L1 approval
clear_disk_space
Archive logs, purge tmp, compress old data
low
drain_node
Cordon + drain K8s node before maintenance
medium
scale_vertically
Resize CPU/memory requests/limits
medium
enable_circuit_breaker
Trip circuit breaker on degraded upstream
low
patch_config
Apply config change from runbook
medium
increase_rate_limit
Adjust throttle thresholds temporarily
low
archive_old_data
Move cold data to cheaper tier
low
force_gc
Trigger garbage collection on JVM/Node
low
3
Analysis
10 skills · compute-heavy · ML-powered
ML
capacity_forecast
Predict resource limits: CPU, memory, disk, connections
low
cost_anomaly_detect
Spot unexpected cloud spend spikes by service
low
slo_burn_rate
Calculate error budget consumption rate per SLO
low
change_risk_score
Score deployment risk using history + blast radius
low
failure_probability
Score likelihood of imminent failure per service
low
blast_radius_estimate
Calculate scope of impact if action taken
low
latency_decompose
Break down P99 latency by service + operation
low
pattern_match
Match current symptoms to known failure patterns
low
trend_analysis
Detect slow-burn degradation over days/weeks
low
anomaly_score
Multi-metric composite anomaly score
low
4
Communication
8 skills · outbound · NLG-powered
OUTBOUND
incident_summary
NL summary tailored: SRE vs PM vs CTO
low
draft_postmortem
Blameless postmortem with timeline, 5-why, action items
low
status_page_update
Compose and post to Statuspage/Instatus
low
escalation_brief
Package context for L2/L3 handoff with evidence links
low
war_room_setup
Create incident channel, add responders, pin context
low
resolution_notify
Notify all stakeholders of incident resolution
low
customer_advisory
Draft external-facing advisory (sanitized)
low
sla_breach_alert
Trigger SLA breach notification with countdown
low
5
Discovery
6 skills · topology · read-only
SAFE
map_service_deps
Trace API calls to build live dependency graph
low
discover_cloud_assets
Scan AWS/Azure/GCP for untracked resources
low
detect_config_drift
Compare running config vs Git/IaC declared state
low
trace_blast_path
Walk graph from changed component to all dependents
low
inventory_scan
Full asset inventory for a service/namespace
low
ownership_lookup
Find team responsible for a service/resource
low
6
Security
6 skills · threat & compliance
COMPLIANCE
assess_vulnerability
CVE lookup, runtime exposure check, patch priority
low
audit_access_patterns
Detect anomalous IAM, SSH, and API access
low
compliance_check
Validate against SOC2/HIPAA/PCI controls
low
threat_intel_lookup
Check IOCs against threat feeds
low
secret_audit
Find exposed or expiring credentials
low
network_exposure_check
Identify unintended public surface area
low
7
Optimization
5 skills · FinOps & tuning
MUTATING
rightsize_compute
Recommend CPU/memory based on actual utilization
low
optimize_queries
Identify slow SQL, suggest indexes, rewrite plans
low
tune_autoscaler
Adjust HPA/VPA thresholds based on traffic patterns
medium
optimize_spot_usage
Maximise spot/preemptible instance savings safely
medium
eliminate_idle_resources
Find and decommission waste resources
medium
8
Workflow (composite)
4 skills · multi-skill chains
COMPOSITE
full_incident_response
diagnose → remediate → validate → communicate → learn. End-to-end P2/P3 chain.
medium
safe_deployment
risk_score → deploy → canary_analyze → promote_or_rollback
medium
proactive_maintenance
capacity_forecast → rightsize → drift_check → compliance_check
low
security_sweep
vulnerability → access_audit → compliance. Full security posture.
low

Tools catalog — 20 governed tools via MCP gateway

Every tool call from every agent flows through the MCP Gateway. No exceptions. Policy check before execution. Audit log per invocation. Deny by default — if a tool is not on the explicit allowlist for an agent, the call is rejected before any network request is made.

Core rule: Agents call Skills. Skills call Tools. Tools call infrastructure via MCP Gateway. An agent never calls infrastructure directly. This chain is enforced architecturally — not by convention.
Data access tools
metric_query
query_metrics(expr, range, step)
Query time-series data with PromQL/MQL syntax against Prometheus/Azure Monitor.
read-onlyno approvalprometheus
log_search
search_logs(query, timerange, filters)
Full-text + structured search across all log indices. Supports regex, field filters.
read-onlyno approvallog analytics
trace_lookup
get_trace(trace_id) | search_spans()
Retrieve distributed traces by ID. Search spans. Build service maps from trace data.
read-onlyno approvalapp insights
graph_query
graph_query(cypher) | find_path(a,b)
Traverse the knowledge graph for topology + history. Gremlin/Cypher query support.
read-onlyno approvalcosmos gremlin
Infrastructure execution tools
k8s_exec
k8s_exec(action, resource, ns)
Kubectl operations: scale, restart, rollout, drain, cordon. Full namespace scoping.
mutatingL2/L3 approvalAKS
cloud_exec
cloud_exec(provider, service, action)
AWS/Azure/GCP resource operations via unified abstraction layer.
mutatingL1/L2 approvalmulti-cloud
remote_exec
remote_exec(host, command, sudo?)
Run commands on hosts via SSH with full audit trail. Restricted command allowlist.
mutatingL2 approvalssh
dns_lb_control
dns_update() | lb_shift_traffic()
DNS failover, traffic shifting, health check management. DR activation support.
mutatingL1 approvalazure dns
Integration & ITSM tools
create_ticket
create_ticket() | update_ticket()
CRUD on ServiceNow, Jira, Freshdesk tickets with structured metadata.
writeno approvalITSM
send_message
send_message(channel, body, urgency)
Slack, Teams, PagerDuty, email with templates. War room creation support.
outboundno approvalmulti-channel
cmdb_sync
cmdb_get(ci) | cmdb_update(ci, attrs)
Read/write configuration items and relationships. ServiceNow CMDB integration.
read/writeL3CMDB
ci_cd_pipeline
trigger_pipeline() | get_build_status()
Trigger builds, read pipeline status, gate and release deployments.
triggerL2 approvalAzure DevOps
AI / ML tools
llm_call
llm_call(model, prompt, tools?)
Prompt any LLM with routing, caching, fallback. Cost tracked per call. Via Azure OpenAI.
inferencecost-trackedAzure OpenAI
vector_search
vector_search(query, collection, top_k)
Semantic search over incidents, docs, runbooks. Retrieval quality logged per query.
read-onlyno approvalAI Search
ml_model_serve
predict(model_id, features)
Run inference on custom anomaly/forecast models. ARIMA, Prophet, isolation forest.
inferenceno approvalAzure ML
nlp_pipeline
nlp_process(text, tasks[])
Entity extraction, classification, summarization. Log parsing and error fingerprinting.
inferenceno approvalAzure AI
Safety & governance tools
check_policy
check_policy(action, ctx) → allow|deny
Evaluate OPA/Rego rules before any action executes. Returns allow/deny + rule matched.
governancealways calledOPA/Rego
audit_logger
log_action(agent, action, evidence)
Immutable append-only log of every agent decision and action. Azure Log Analytics.
writealways calledLog Analytics
approval_gateway
request_approval(action, approvers)
Request human approval for high-risk actions. Teams adaptive cards + timeout escalation.
blockingL1 actionsLogic Apps
secret_manager
get_secret(key, ttl) | rotate(key)
Credential retrieval with JIT access and rotation. Azure Key Vault integration.
JIT accessL1 for rotateKey Vault

Runtime flow — how a P2 incident executes end-to-end

This is the exact sequence of events when an incident triggers. Every step maps to the architecture. Every arrow is a real API call. The Orchestrator never sleeps — it holds state in PostgreSQL so it survives crashes and restarts.

User / alert
Runtime / orchestrator
LLM (model plane)
MCP gateway
Knowledge fabric
Working memory
State store
Observability
Initialization
Alert/user
Runtime
Incident trigger + identity + SLA/priority context
P2 — payment-service
Runtime
State store
Initialize execution context (corrId, state=INIT)
idempotency key set
Runtime
Observability
Start audit span (incident ID, timestamp, identity)
Runtime
Knowledge fabric
Seed request: top-5 similar incidents + relevant SOPs
Knowledge fabric
Runtime
Episodic memory results + runbook candidates returned
Runtime
Working memory
Seed working context (intent, identity, SLA, episodic results)
Planning (Orchestrator inner LLM shell)
Runtime
LLM
Plan request: context + available agents + policy constraints
LLM
Runtime
Execution plan DAG (JSON schema validated)
LLM proposes only
Runtime
MCP gateway
check_policy(plan) — validate all steps approved
Runtime
State store
Persist plan hash + transition to DISPATCHING
Execution loop — for each plan step
Loop — for each DAG step (parallel where depends_on allows)
Runtime
Working memory
Refresh agent context (latest facts from prior steps)
Runtime
LLM
Agent reasoning call (context + tool results + KB context)
LLM
Runtime
Next tool call OR stop condition (validated against schema)
Runtime
MCP gateway
Invoke governed tool (params, corrId, agent identity)
MCP gateway
MCP gateway
check_policy → allow/deny → execute → audit_logger
deny by default
MCP gateway
Runtime
Tool result + logs + artifact ref
Runtime
Knowledge fabric
Retrieve SOPs/graph/KB relevant to current state
Runtime
Working memory
Upsert latest facts (tool results, agent findings)
Runtime
State store
Upsert step record (status, retryCount, idempotencyKey)
Runtime
Observability
Audit tool call (inputs/outputs, scope, latency, cost)
Completion + learning
Runtime
State store
Mark workflow COMPLETE — final status, cost total
Runtime
Knowledge fabric
Upsert final summary (resolution, RCA fingerprint, tags)
Runtime
Observability
Finalize audit (metrics, total cost, trace summary, accuracy signal)
Runtime
User / alert
Final output + explanation + provenance links + cost summary

Memory fabric — 4 layers, all infrastructure

Memory is not an agent. It is a shared infrastructure service with four distinct layers. Every agent reads from it at start, writes to it at close. The Memory Fabric is what makes the platform a learning system rather than just an automation system.

Layer 1 — Working memory
Azure Cache for Redis (Premium, zone-redundant) · TTL-scoped per incident lifetime · <1ms latency · expires when incident reaches DONE state
Holds
Active reasoning context · tool results (latest) · partial agent results · current plan state · identity + SLA
Seeded
At INIT state by Orchestrator from Episodic Memory (top-5 similar incidents)
Updated
Every agent writes its result when complete. Orchestrator refreshes before each dispatch.
Expires
When incident moves to DONE + configurable TTL (default 24h)
Layer 2 — Episodic memory
Azure AI Search (vector index + semantic reranker) · All resolved incidents as dense embeddings · Native Foundry integration · 10-50ms similarity search
Holds
Every resolved incident as structured embedding: RCA fingerprint, resolution action, outcome, MTTR, cost, confidence scores
Used at
INIT: top-5 similar past incidents retrieved. RCA Agent uses as in-context examples. Auto-Remediation uses past resolution actions.
Updated
At DONE: Knowledge Curator writes new incident embedding with full resolution context.
Quality
Every retrieval logged: query, top-k, similarity scores, chunks used in context.
Layer 3 — Semantic knowledge base
Azure AI Search (hybrid full-text + vector) · SOPs, runbooks, architecture docs, postmortems · 500+ runbooks with confidence scores · 20-80ms latency
Holds
500+ runbooks (versioned, chunked), architecture docs, past postmortems, compliance control mappings, team ownership docs
Used by
Reasoning Agent (SOPs per step) · Auto-Remediation (runbooks) · Compliance Agent (control definitions) · Postmortem Agent (similar postmortems)
Updated
Knowledge Curator updates runbook confidence scores after every incident outcome. New docs indexed via Azure Blob trigger.
Layer 4 — Knowledge graph
Azure Cosmos DB (Gremlin API) · All infrastructure topology · Continuously updated · 5-30ms graph traversal · Global distribution
Nodes
Service · Host · Database · Queue · Team · Deployment · Config · Alert
Edges
depends_on · owned_by · deployed_at · calls · caused · resolved_by · has_config · monitors
Used by
Topology Discovery (writes) · RCA (traverses upstream) · Impact Analysis (fan-out downstream) · Change Detection (correlates change nodes)
Updated
Topology Discovery Agent continuously. Knowledge Curator confirms/adds causal edges after every incident close.
Why Memory is infrastructure, not an agent: Every agent reads from and writes to Memory. If Memory were an agent, who orchestrates it? It has no parent. It cannot have a ReAct loop because it has no goal — it responds to requests. It must be always-on, clustered, and highly available (HA) because if it goes down, ALL agents are blind. This is infrastructure behavior, not agent behavior.
Human-in-the-loop — 3 approval levels
Level 3 — full autonomy
Agent acts immediately. No human gate. Scope: read-only ops, low-blast-radius reversible actions. Examples: pulling metrics, querying logs, scaling dev by 1 replica.
Level 2 — notify + proceed
Agent acts + notifies simultaneously. Human can interrupt within 5min window. Examples: service restarts, cache flushes on non-critical services. Via Teams adaptive card.
Level 1 — approval required
Agent prepares action, presents evidence, blocks until approved. Timeout: 10min → escalate. Examples: rollback, DNS failover, credential rotation, any PCI-zone action.

Azure AI Foundry mapping — what goes where

Azure AI Foundry covers ~40% of this solution natively. The remaining 60% maps to adjacent Azure services that sit alongside Foundry. One piece — the MCP policy gateway — requires a custom build on Azure infrastructure.

Native Azure AI Foundry — 40% of the solution
Orchestrator runtime + 18 agents
AI Foundry Agent Service — 19 projects (1 orchestrator + 18 agents)
Primary LLM (heavy reasoning)
Azure OpenAI GPT-4o via Foundry model catalog
Fast LLM (routing, classification, summaries)
Azure OpenAI GPT-4o-mini via Foundry model catalog
Embeddings for vector memory
Azure OpenAI text-embedding-3-large + AI Search reranker
Episodic memory + semantic KB
Azure AI Search (vector + hybrid) — native Foundry knowledge store
System prompt versioning
Foundry prompt management — versioned per agent project
Offline evaluation + shadow mode
Foundry evaluations — eval datasets per agent, accuracy tracking
Reasoning traces + tool spans
Foundry tracing → Azure Monitor + App Insights
Content safety / responsible AI
Foundry content filters — built in, applied to all LLM calls
Managed credentials for tools
Foundry connections — secure credential management per project
Adjacent Azure services — 55% of the solution
Working memory (TTL-scoped, <1ms)
Azure Cache for Redis Premium (zone-redundant, clustered)
Knowledge graph (topology + causality)
Azure Cosmos DB — Gremlin API (global distribution)
State store (ACID, idempotency)
Azure Database for PostgreSQL Flexible Server
Artifact store (reasoning traces, tool logs)
Azure Blob Storage + Table Storage index
Skills runner (stateless, auto-scale)
Azure Container Apps (skill pods, scale to zero)
Agent pod hosting (independent scale)
Azure Kubernetes Service — Helm chart per tier
Human approval gates (L1 approval)
Azure Logic Apps + Teams adaptive cards + Entra ID authZ
Message bus (async agent coordination)
Azure Service Bus (queues + topics per tier)
Runbook library (Git-versioned YAML)
Azure Repos + Blob Storage (500+ runbooks)
Secrets + JIT credential access
Azure Key Vault (JIT access, automatic rotation)
Observability cost ledger
Azure Monitor custom metrics + Log Analytics workspace
Custom build on Azure — 5% — the one piece Foundry cannot do natively
MCP Gateway (deny by default, tool contracts)
Azure API Management + custom MCP spec layer
Policy engine (OPA/Rego, blast radius)
OPA sidecar on AKS — every tool call evaluated before execution
Inter-agent envelope validation
Custom FastAPI service — schema validation before Orchestrator reads
These three are custom because: Azure Foundry has no native OPA/Rego policy engine, no deny-by-default tool governance primitive, and no inter-agent typed envelope contract enforcement. These are the pieces that make the system safe to run in production. Without them, you have automation, not governed agentic behavior.
Build order on Azure: PostgreSQL + Blob Storage (state/artifacts) → Redis + AI Search (memory fabric) → Container Apps (skills) → AKS + Foundry agents (18 projects) → APIM + OPA (MCP gateway) → Logic Apps (human approval) → Observability pipeline. Never skip the policy engine — deploy it day one with empty rules rather than bypassing it.

Full agentic checklist — every item defined and mapped

This is the definitive test of whether a system is truly full agentic. Every item is mapped to the architecture. Nothing is aspirational — each has a specific implementation location.

Every agent has a defined ReAct loop
All 18 agents have Perceive/Reason/Act/Observe/Learn defined. Max LLM call counts set. Stop conditions explicit. Defined in Tab 02 of this document and implemented as Foundry Agent projects.
defined
Every agent has a typed input/output contract
Inter-agent envelope v1.0: status · confidence · result · evidence_refs · memory_writes · next_recommended_step · escalation · cost_usd. Orchestrator validates this schema before reading any result.
defined
Orchestrator is a state machine with all transitions defined
9 states: IDLE→INIT→PLANNING→DISPATCHING→EXECUTING→REPLANNING→COMPLETING→AUDITING→DONE. Policy-gated transitions. LLM inner shell for plan generation only. Outer shell fully deterministic. Hosted on Azure AI Foundry Agent Service.
defined
Policy engine governs every action type
OPA/Rego rules per action per context. 3 autonomy levels (L1/L2/L3). Blast radius caps. Change windows. RBAC per service. Deployed as OPA sidecar on AKS. Evaluates before every MCP tool call.
defined
Human-in-the-loop has 3 levels with explicit triggers
L3: full autonomy (read-only, reversible). L2: notify + 5min interrupt window (Teams adaptive card). L1: block until approved with 10min timeout + escalation. All wired through Azure Logic Apps + Entra ID.
defined
Memory fabric has 4 layers, all populated and queryable
Working (Redis <1ms) · Episodic (AI Search vectors, top-5 retrieval at incident start) · Semantic KB (500+ runbooks with confidence scores) · Knowledge Graph (Cosmos Gremlin, topology + causality edges). All 4 live and queryable.
defined
Learning loop closes: incident → postmortem → KB update → better next run
Postmortem Agent → Knowledge Curator Agent → Episodic Memory (new embedding) + Knowledge Graph (confirmed edges) + Semantic KB (runbook confidence scores) + new correlation rules fed to Event Correlation Agent. Measurable MTTR reduction over time.
defined
Circuit breakers defined per agent with failure modes
Per-agent timeout (30s perception, 120s analysis). Max 3 retries with exponential backoff. Circuit breaker: 5 failures in 10min → OPEN. Half-open after 2min cooldown. If RCA fails → Impact still runs. If Memory Fabric down → ALL agents degraded → immediate escalation.
defined
Cost budgets enforced per incident tier
P1: $5 max LLM spend. P2: $2 max. P3: $0.50 max. P4: $0.20 max. Enforced by Policy Engine before each LLM call. Cost ledger in Azure Monitor. Per-agent, per-tool cost tracked in State Store.
defined
Shadow mode for new agent versions before live promotion
New agent versions run alongside existing logic. Results compared but not acted on. Promoted to active after accuracy threshold met on last 30 days of replayed incidents. Foundry evaluation pipeline runs on every agent code change.
defined
Evaluation harness to replay past incidents and test accuracy
State Store holds all incidents replayable. Offline eval pipeline in Azure AI Foundry evaluations. Per-agent eval datasets (min 20 labelled examples). RCA accuracy tracked against confirmed postmortems. Resolution success rate tracked. MTTR trending.
defined
Skills as a proper layer between agents and tools
8 categories, 60+ skills. Agents call Skills. Skills call Tools. Never direct agent → tool. Skills are versioned in Azure Repos (YAML), tested in sandbox, deployed to Container Apps. Skill ID + version logged on every execution.
defined
What makes this truly full agentic (not just automated): A new type of incident arrives that no runbook covers. The platform retrieves the 5 closest past incidents from Episodic Memory, traverses the Knowledge Graph for structural context, calls the Reasoning Agent to construct a novel hypothesis under uncertainty, plans a cautious diagnostic sequence, and escalates with a structured brief explaining exactly what it found and what it does not know. This generalisation to novel situations — not the number of agents or Azure services — is what makes it full agentic.

Observability & governance — cross-cutting

You cannot improve what you cannot measure. Observability is not a tab in the platform — it is a cross-cutting layer that every other component writes to. Reasoning traces, tool spans, RAG retrieval quality, cost per incident, model accuracy, and human approval audit all flow to Azure Monitor + App Insights.

🧵
Reasoning traces
What captured
Every LLM call, full I/O
Stored in
App Insights custom events
Chain-of-thought steps
Every reasoning step logged
Confidence score
Per conclusion, per agent
Evidence refs
Every claim backed by artifact
Used for
Hallucination detection, audit
🔧
Tool call spans
Logged at
MCP Gateway (before agent)
Pre-execution
Policy check result logged
Post-execution
Full input/output + latency
Approval trace
Who approved, when, evidence
Blast radius
Estimated vs actual
Immutable
Append-only audit log
🔍
RAG retrieval quality
Query logged
Full query + collection
Top-k results
Doc ID + similarity score
Chunks used in context
Which chunks influenced LLM
Retrieval latency
Per query, trending
Why it matters
Separates bad retrieval from bad reasoning
💰
Cost per incident
LLM costs tracked
Per agent, per call
Tool costs tracked
Per tool type, per invocation
Avg P2 incident cost
~$0.08–0.15 USD
Human time saved
~47 min per incident
Cost vs manual
~$1,200 savings per P2
Budget enforcement
P1: $5 · P2: $2 · P3: $0.50
🎯
Model accuracy
RCA accuracy
87.3% (30-day rolling)
Resolution success
61% P3/P4 fully automated
Prediction MAPE
9.2% (7-day horizon)
Anomaly precision
84.7%
Anomaly recall
91.2%
Runbook match accuracy
79.4% (improving)
📊
Platform metrics
MTTD (mean time to detect)
Trending down
MTTR (mean time to resolve)
8.4 min automated
False alarm rate
6.4%
Escalation rate
Trending down as KB grows
Novel incident handling
Reasoned brief always produced
Learning signal quality
Improving per 100 incidents

Learning loop — what makes this self-improving

This is the difference between an agentic system and a full agentic system. Every resolved incident makes the platform measurably better at the next one. The loop runs automatically after every incident close. No human intervention required.

The key question: Is your platform better at incident #500 than it was at incident #1? If yes — and you can measure it in MTTR, RCA accuracy, and runbook confidence scores — you have a full agentic, self-improving system. If not, you have sophisticated automation that plateaus.
1
Incident closes — all agents complete
Orchestrator → COMPLETING state
All agent envelopes collected. Final answer delivered to client. State Store marks workflow COMPLETE with final status, total cost, and MTTR. Artifact Store holds every reasoning trace and tool call log.
State StoreArtifact StoreMTTR recorded
2
Postmortem Agent generates structured postmortem
Postmortem Agent · Tier 4
Rebuilds narrative timeline from State Store + Artifact Store. Constructs 5-why chain from RCA Agent's reasoning trace. Generates action items with owners from Knowledge Graph. Creates tickets in ITSM. Publishes to team channel.
Timeline rebuild5-why chainAction itemsITSM tickets
3
Knowledge Curator extracts reusable patterns
Knowledge Curator Agent · Tier 4 — runs after every incident
Reads postmortem + all agent reasoning traces + resolution actions. LLM extracts: RCA fingerprint (symptom pattern → root cause), resolution action that worked (runbook + parameters), runbook confidence delta (did it work? by how much?), new correlation rule candidates ("these 3 alerts always precede X"). Validates new knowledge for consistency before writing.
RCA fingerprintRunbook confidence deltaNew correlation rulesConsistency check
4
Memory Fabric updated — 4 writes
Knowledge Curator → Memory Fabric API
Four parallel writes: [1] Episodic Memory: new incident embedding stored with RCA fingerprint, resolution action, outcome, MTTR, confidence. [2] Knowledge Graph: confirmed causal edges added (caused, resolved_by). [3] Semantic KB: runbook confidence scores updated (higher if it worked, lower if it failed). [4] Event Correlation Agent's correlation rules updated with new pattern.
Episodic Memory writeGraph edge updateRunbook confidence updateCorrelation rule update
5
Next similar incident — platform performs measurably better
Orchestrator INIT state — future incident
Event Correlation Agent groups alerts faster using new correlation rule. Orchestrator seeds Working Memory with this incident as a top-5 Episodic Memory result. RCA Agent's first LLM call already has the resolution path as context — fewer reasoning steps needed. Auto-Remediation Agent selects the higher-confidence runbook. MTTR decreases. RCA accuracy increases. Cost per incident decreases.
Faster correlationBetter RCA contextHigher-confidence runbookMTTR decreases
6
Platform measures its own improvement
Observability layer — continuous measurement
RCA accuracy tracked against confirmed postmortems (TP/FP/FN per period). Resolution success rate by incident tier. MTTR trending over incident volume. Runbook confidence score distribution (are scores improving?). Prediction accuracy (MAPE trending). Cost per incident trending. These metrics are the evidence that the learning loop is working. Without them, self-improvement is a claim, not a fact.
RCA accuracy trendingMTTR trendingRunbook confidence trendingCost trending
The learning loop needs incident volume to show measurable improvement. Plan for: 50 incidents to calibrate baselines, 200 incidents to see meaningful MTTR improvement, 500+ incidents to see accuracy metrics stabilize. This is why Stage 3 (full agentic, self-improving) takes 6-12 months post-production launch — not because the architecture is incomplete, but because the learning loop needs data to learn from.