Cosmos Operations Center โ realtime, predictive, AI-driven
The DBA dashboard that tells you what to fix before the customer calls. Realtime charts with anomaly detection, click any spike to drill into an AI-narrated root cause walk, predictive ETAs that warn before saturation, configurable policies that fire automatically โ and full recordable timeline so you can post-mortem yesterdayโs incident in 30 seconds.
What it shows you, the moment you open it
4 realtime charts (1s tick)
RU/s consumed ยท Throttle 429 events ยท Latency p99 ยท Error count. All four updating every second from the Query Cost ring buffer. Last hour in memory.
Red dots = anomalies, clickable
The anomaly detector flags throttle storms, RU saturation, latency spikes, partition skew the moment they happen. Each anomaly is a red dot pinned to its exact timestamp.
Predictive alerts banner
Linear regression over the last 3 min tells you "RU will saturate in ~14 min if trend continues" โ actionable BEFORE the breach, not after.
Recording toggle (history-store)
One click and every snapshot is persisted to the same MongoDB management instance you configured for mongostat/mongotop. Job Manager shows it alongside the others.
Every red dot opens an AI-narrated Root Cause Walk
Click the anomaly โ a side drawer slides in with three stops, each backed by deterministic context (Query Cost ring buffer + cross-scanner caches) + optional AI narration in your language.
WHAT happened
Exact timestamp, metric value, severity, top query shape running at that minute, partition state from the scanner cache. Deterministic โ zero AI cost.
WHY (most likely cause)
AI looks at the structured context and writes the single most probable root cause + the evidence in the data that supports it. Speaks the DBAโs language. Cached by signature.
HOW to fix
Quick mitigation (5min) + permanent fix (this week), each linked to the scanner pane that owns the apply command. One click and youโre in the right place.
The DBA picks the AI model per analysis
Routine cost incident? Use gpt-4o-mini ($0.001). High-severity production outage? Switch to Claude Sonnet ($0.018) for the same incident. Provider picker is inside the drawer โ cost shown upfront, no surprises.
โ OpenAI gpt-4o-mini ($0.001)
โ Anthropic Claude Sonnet ($0.018) โ
โ Google Gemini ($0.001)
โ Groq Llama ($0.002)
โ Ollama (local ยท free)
Results cached by incident signature ร provider ร locale (4h TTL) โ same incident clicked twice = zero re-charge.
Forecasts that warn BEFORE the breach
Linear regression over the last 3 minutes of telemetry โ Rยฒ-tagged for confidence, severity-scaled by ETA.
ru-saturation-eta~14 minRU/s growing at 8.2/s โ will saturate in ~14 min if trend continues
throttle-trending-upin progressThrottle events climbing โ 23 in last 5min, rate +0.04/sยฒ
partition-skew-growing~3.2 hoursTop partition share growing (now 52%) โ will cross 70% in ~3.2 hours
storage-saturation-eta12 daysStorage at 78% โ extrapolated to hit 90% in 12 days at current growth
Policies the DBA defines โ fire automatically
Each Cosmos connection has its own policies. Triggers (threshold + duration + ns pattern) bound to actions (in-app notify, Slack webhook, generic webhook, pre-stage scale-up, pre-stage path exclusion, AI analyze).
Throttle storm โ 5/min
When throttle events > 5/min sustained 2min, fires in-app notification + pre-stages a +50% scale-up (requires DBA approval).
RU saturation โ 85%
When RU/s crosses 85% of observed ceiling for 5min, notifies with suggested autoscale switch.
Latency p99 โ 500ms
When p99 exceeds 500ms for 1min, notifies + triggers AI analysis on the slow query path.
Partition skew โ top > 50%
When one partition holds > 50% of docs, pre-stages a re-partition plan (approval required).
Every fire is deduped per (policy ร ns ร 5min bucket) โ no alert fatigue. Snooze 1h with one click. Audit log preserves every fire with the metric value, the policy that triggered, and what action was queued.
Post-mortem yesterdayโs incident in 30 seconds
One click on "Start recording" โ every snapshot persists to the same MongoDB management connection mongostat/mongotop already use. Open Job Manager โ click any "Rec Cosmos Ops" job โ unified Timeline scrubber lets you replay the clusterโs behavior alongside the MongoDB recordings from the same window.
Nobody else correlates Cosmos + MongoDB on one scrubber.
Why this is uncopyable by Datadog / Grafana / Azure Monitor
| Capability | Operations Center | Datadog | Grafana | Azure Monitor |
|---|---|---|---|---|
| Realtime cluster charts | โ | โ | โ | โ |
| Click spike โ root cause walk | โ | โ link only | โ | โ link only |
| AI narration in your language | โ BYO LLM | โ closed AI | โ | โ closed AI |
| Pick AI model per incident | โ | โ | โ | โ |
| Apply command pre-staged + Shadow-validated | โ | โ | โ | โ |
| Cosmos $indexStats / GetPartitionStats correlated | โ | โ | โ | โ |
| Cross-correlate with MongoDB monitoring | โ | โ | โ if both ingested | โ |
| Price | $99-499/mo | $15+/host/mo | self-host | per-GB ingested |
Stop scavenging templates. Open the Operations Center and see.
NoSqlStudio for Cosmos DB is free to try โ no card, no signup. Open any Cosmos connection, click Operations Center, watch the charts come alive.