Skip to content
๐ŸŽฏ Unique in the market
Killer feature

Cosmos Operations Center โ€” realtime, predictive, AI-driven

The DBA dashboard that tells you what to fix before the customer calls. Realtime charts with anomaly detection, click any spike to drill into an AI-narrated root cause walk, predictive ETAs that warn before saturation, configurable policies that fire automatically โ€” and full recordable timeline so you can post-mortem yesterdayโ€™s incident in 30 seconds.

What it shows you, the moment you open it

4 realtime charts (1s tick)

RU/s consumed ยท Throttle 429 events ยท Latency p99 ยท Error count. All four updating every second from the Query Cost ring buffer. Last hour in memory.

Red dots = anomalies, clickable

The anomaly detector flags throttle storms, RU saturation, latency spikes, partition skew the moment they happen. Each anomaly is a red dot pinned to its exact timestamp.

Predictive alerts banner

Linear regression over the last 3 min tells you "RU will saturate in ~14 min if trend continues" โ€” actionable BEFORE the breach, not after.

Recording toggle (history-store)

One click and every snapshot is persisted to the same MongoDB management instance you configured for mongostat/mongotop. Job Manager shows it alongside the others.

Click โ†’ drill โ†’ fix

Every red dot opens an AI-narrated Root Cause Walk

Click the anomaly โ†’ a side drawer slides in with three stops, each backed by deterministic context (Query Cost ring buffer + cross-scanner caches) + optional AI narration in your language.

1

WHAT happened

Exact timestamp, metric value, severity, top query shape running at that minute, partition state from the scanner cache. Deterministic โ€” zero AI cost.

2

WHY (most likely cause)

AI looks at the structured context and writes the single most probable root cause + the evidence in the data that supports it. Speaks the DBAโ€™s language. Cached by signature.

3

HOW to fix

Quick mitigation (5min) + permanent fix (this week), each linked to the scanner pane that owns the apply command. One click and youโ€™re in the right place.

BYO LLM โ€” pick per incident

The DBA picks the AI model per analysis

Routine cost incident? Use gpt-4o-mini ($0.001). High-severity production outage? Switch to Claude Sonnet ($0.018) for the same incident. Provider picker is inside the drawer โ€” cost shown upfront, no surprises.

// Inside the Incident Drawer
Analyze with: [ Anthropic Claude Sonnet โ–ผ ] $0.018
โ”œ OpenAI gpt-4o ($0.012)
โ”œ OpenAI gpt-4o-mini ($0.001)
โ”œ Anthropic Claude Sonnet ($0.018) โœ“
โ”œ Google Gemini ($0.001)
โ”œ Groq Llama ($0.002)
โ”” Ollama (local ยท free)
[ Run analysis ($0.018) โ†’ ]

Results cached by incident signature ร— provider ร— locale (4h TTL) โ€” same incident clicked twice = zero re-charge.

Predictive

Forecasts that warn BEFORE the breach

Linear regression over the last 3 minutes of telemetry โ€” Rยฒ-tagged for confidence, severity-scaled by ETA.

ru-saturation-eta~14 min

RU/s growing at 8.2/s โ€” will saturate in ~14 min if trend continues

throttle-trending-upin progress

Throttle events climbing โ€” 23 in last 5min, rate +0.04/sยฒ

partition-skew-growing~3.2 hours

Top partition share growing (now 52%) โ€” will cross 70% in ~3.2 hours

storage-saturation-eta12 days

Storage at 78% โ€” extrapolated to hit 90% in 12 days at current growth

Configurable per-instance

Policies the DBA defines โ€” fire automatically

Each Cosmos connection has its own policies. Triggers (threshold + duration + ns pattern) bound to actions (in-app notify, Slack webhook, generic webhook, pre-stage scale-up, pre-stage path exclusion, AI analyze).

Throttle storm โ€” 5/min

When throttle events > 5/min sustained 2min, fires in-app notification + pre-stages a +50% scale-up (requires DBA approval).

RU saturation โ€” 85%

When RU/s crosses 85% of observed ceiling for 5min, notifies with suggested autoscale switch.

Latency p99 โ€” 500ms

When p99 exceeds 500ms for 1min, notifies + triggers AI analysis on the slow query path.

Partition skew โ€” top > 50%

When one partition holds > 50% of docs, pre-stages a re-partition plan (approval required).

Every fire is deduped per (policy ร— ns ร— 5min bucket) โ€” no alert fatigue. Snooze 1h with one click. Audit log preserves every fire with the metric value, the policy that triggered, and what action was queued.

Recordable

Post-mortem yesterdayโ€™s incident in 30 seconds

One click on "Start recording" โ†’ every snapshot persists to the same MongoDB management connection mongostat/mongotop already use. Open Job Manager โ†’ click any "Rec Cosmos Ops" job โ†’ unified Timeline scrubber lets you replay the clusterโ€™s behavior alongside the MongoDB recordings from the same window.

Timeline Replay ยท 2026-05-26 14:00 โ†’ 15:30
[|โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ—โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”|] 14:23 โ† scrubber
Cosmos RU/s โ–โ–‚โ–ƒโ–„โ–…โ–ˆโ–†โ–…โ–ƒโ–‚โ–โ–‚ โ† peak @ 14:23 โš 
Cosmos 429s โ–โ–โ–โ–โ–โ–ˆโ–โ–โ–โ–
MongoStat ops โ–ƒโ–„โ–…โ–„โ–…โ–„โ–ˆโ–…โ–„โ–ƒโ–„ โ† spike @ 14:23 (correlated)
CurrentOp โ–โ–โ–‚โ–ƒโ–†โ–ˆโ–‡โ–†โ–ƒโ–‚ โ† long-running queries
AI insight @ 14:23: "Throttle storm coincided with MongoTop collScan spike on prod-mongo replicaset. Same query path running against both clusters during failover test."

Nobody else correlates Cosmos + MongoDB on one scrubber.

Why this is uncopyable by Datadog / Grafana / Azure Monitor

CapabilityOperations CenterDatadogGrafanaAzure Monitor
Realtime cluster chartsโœ…โœ…โœ…โœ…
Click spike โ†’ root cause walkโœ…โš  link onlyโŒโš  link only
AI narration in your languageโœ… BYO LLMโš  closed AIโŒโš  closed AI
Pick AI model per incidentโœ…โŒโŒโŒ
Apply command pre-staged + Shadow-validatedโœ…โŒโŒโŒ
Cosmos $indexStats / GetPartitionStats correlatedโœ…โŒโŒโš 
Cross-correlate with MongoDB monitoringโœ…โŒโš  if both ingestedโŒ
Price$99-499/mo$15+/host/moself-hostper-GB ingested

Stop scavenging templates. Open the Operations Center and see.

NoSqlStudio for Cosmos DB is free to try โ€” no card, no signup. Open any Cosmos connection, click Operations Center, watch the charts come alive.