Roadmap#
The goal of materialize-monitoring is first-class, opt-in observability for self-managed Materialize — logs, metrics, events, and alerts — for customers who want a one-stop-shop, without forcing our stack on customers who already run their own.
This page is the current source of truth for what is built, what is in flight, and what is planned next.
It supersedes the original Linear project (internal), which captured an earlier architecture that has since diverged (see below).
How this maps to the original plan#
The May 2026 plan assumed a different shape than what was actually built. Where docs, tickets, or comments disagree with the repo, the repo wins and this page records why:
| Original plan (May 2026) | As built |
|---|---|
Grafana dashboards via Jsonnet/Grafonnet (sources/jsonnet/) | Python + grafana-foundation-sdk + py-mzmon-lib (packages/grafana-dashboards/) |
crates/ Rust workspace + sources/ input tree | packages/ monorepo where Rust (mzmon-lib) and Python coexist |
| Datadog dashboards via a Datadog Rust SDK | Not pursued; OTLP-forward is the export path (see Pipelines) |
Four fixed profiles, incl. a datadog-agent profile | Profile set deliberately left open; no datadog-agent profile |
Cadence and milestones#
Releases track a monthly cadence aligned to the 15th.
| Milestone | Date/Coverage | Deliverables |
|---|---|---|
| Milestone 1 (M1) | June 15 (baseline) | env-top overview dashboard (Summary, Kubernetes, Cluster, Connections, Compute, Storage — including Hydration / Freshness / Sources / Sinks summaries); cloud ↔ self-managed convergence via $sqlMetricPrefix; typed Alloy agent pipeline; ScrapeConfigs + ServiceMonitors for metric collection (synced to charts and docs); Hugo docsite; pre-commit suite; Grafana dashboard v1/v2 API support |
| Milestone 2 (M2) | July 15 — required | Native OTLP exporter support; productionalized stack (Thanos + Loki + Alloy); product observability documentation fully replaced; Logs & Events + Networking dashboards; Grafana 11 (dashboard v1) parity for publicly hosted dashboards (Grafana public dashboards gallery); Helm subchart bundling; renovate for dependency bumps |
| Milestone 3 (M3) | July 15 — stretch | Day 2 drilldown dashboards: Hydration, Freshness, Sources, Sinks (⛓️ gated on upstream Tier 2 instrumentation) |
| Milestone 4 (M4) | August 15+ | Day 1 dashboards (Dependencies, Sizing); Day 2 ops dashboards (upgrades, resizing, changing sources/destinations, managing users); Tier 2 upstream metric instrumentation; Helm completeness tail; profile-set finalization; Terraform wrapper; v2 items (BYOC, trace correlation, Polar Signals, formal deprecation policy) |
Item tables below reference milestones by number (M1–M4); the dates live only in the table above.
Status legend#
- ✅ Done · 🔨 In progress · ⬜ Planned
- ⛓️ Blocked on an upstream metric-contract dependency (see Metrics contract)
Workstreams#
Dashboards#
The env-top overview is shipped and carries the cloud ↔ self-managed convergence work.
Grafana 11 (dashboard v1) parity is a hard requirement for publicly hosted dashboards — the dashboard sources must continue to render against the v1 dashboard API, not only newer versions — so that the dashboards can be managed in the Grafana public dashboards gallery.
| Item | Milestone | Status |
|---|---|---|
env-top overview (6 tabs, incl. Hydration/Freshness/Sources/Sinks summaries) | M1 | ✅ |
Cloud ↔ self-managed convergence ($sqlMetricPrefix) | M1 | ✅ |
| Improved Grafana 11 (dashboard v1) support for the public dashboards gallery | M2 | 🔨 |
| Logs & Events (requires Loki + Alloy + logs) | M2 | ⬜ |
| Networking | M2 | ⬜ |
| Hydration Drilldown | M3 | ⛓️ |
| Freshness Drilldown | M3 | ⛓️ |
| Sources Drilldown | M3 | ⛓️ |
| Sinks Drilldown | M3 | ⛓️ |
| Dependencies (Day 1: are Materialize + o11y requirements satisfied?) | M4 | ⬜ |
| Sizing | M4 | ⬜ |
We weight Day 2 operations over Day 1: upgrades, resizing, changing sources, changing external destinations, and managing users are the operations that matter most for a running deployment. Day 2 ops dashboards covering these land in the M4 window.
Pipelines (Alloy)#
Alloy carries both metrics and logs. The agent pipeline is in place; the gateway pipeline and the OTLP export path are the near-term work.
| Item | Milestone | Status |
|---|---|---|
| Typed Alloy agent pipeline | M1 | ✅ |
| Native OTLP exporter (forwarding workflows evaluated for Honeycomb, Datadog, Google Cloud Observability) | M2 | ⬜ |
| Gateway pipeline (port the real processor from the Python reference) | M2 | 🔨 |
| Loki (logs) + Thanos (metrics) wiring | M2 | ⬜ |
Scraping (ScrapeConfigs & ServiceMonitors)#
Metric collection is configured through two surfaces: ScrapeConfigs (consumed manually, e.g. dropped into a Prometheus/Agent config) and ServiceMonitors (to be consumed by prometheus-operator, or by Alloy via prometheus.operator.servicemonitor).
Both are needed ASAP — they were targeted for the M1 baseline — and must be synced into the charts and the docs.
| Item | Milestone | Status |
|---|---|---|
| ScrapeConfigs (consumed manually) | M1 (ASAP) | ✅ |
ServiceMonitors (to be consumed by prometheus-operator or Alloy prometheus.operator.servicemonitor) | M1 (ASAP) | 🔨 |
| Sync ScrapeConfigs + ServiceMonitors into the charts and docs | M1 (ASAP) | 🔨 |
Move ServiceMonitors to the materialize-operator Helm chart | M4 | ⬜ (long-term) |
Long term, ServiceMonitors belong in the materialize-operator Helm chart rather than here.
This repo carries them now to fill the gap, with the intent to hand them off once the operator owns that surface.
Charts / Helm#
Helm is prioritized over Terraform. The umbrella chart loads pre-rendered artifacts and bundles the productionalized stack as subcharts.
| Item | Milestone | Status |
|---|---|---|
| Subchart bundling: Loki, Thanos, Alertmanager, Grafana, kube-state-metrics | M2 | ⬜ |
Distroless Alloy image + pre-install/pre-upgrade alloy fmt validation hook | M2 | ⬜ |
helm-readme-sync (values.yaml → generated README) | M2 | ⬜ |
| Terraform wrapper module (pins a chart version, own cadence) | M4 | ⬜ |
Rules & alerts#
| Item | Milestone | Status |
|---|---|---|
| Base alert set (severity profiles + runbook stubs) | M2 | ⬜ |
| Loki / Thanos rule sets (recording rules first-class) | M2 | ⬜ |
Profiles#
The profile set is deliberately not finalized — it is a stub until the common deployment shapes settle.
There is no datadog-agent profile; OTLP forwarding is the export path.
Profile finalization is an M4 activity.
Testing / CI & DevEx#
| Item | Milestone | Status |
|---|---|---|
| Pre-commit suite (ruff, pyright, shellcheck, yamllint, cargo fmt, helm-docs) | M1 | ✅ |
renovate for automated dependency bumps | M2 | ⬜ |
| Synthetic-data end-to-end smoke test (metrics flow through the chart) | M4 | ⬜ |
| kind / ArgoCD / FluxCD CI matrix | M4 | ⬜ (very low priority) |
Adoption / productionalization#
The M2 target is a productionalized deployment for Cloud, an internal team, and initial external adopters. (Specific adopter commitments are tracked out-of-band, not in this public roadmap.)
| Item | Milestone | Status |
|---|---|---|
| Productionalized for Cloud + internal + initial external adopters | M2 | 🔨 |
| Product observability documentation fully replaced (rewrite the recommended path; migration guide off the legacy SQL-exporter surface) | M2 | ⬜ |
Internal monitoring migrated to consume this repo via values.yaml | M4 | ⬜ |
| Fork source repo and archive the original | M4 | ⬜ |
Metrics contract (upstream dependency)#
Several dashboards depend on metric instrumentation that lives upstream in the materialize repo, not in this repository.
The metric/label contract is the public API for everything here, so this dependency shapes the dashboard roadmap directly.
The environmentd-native public metrics endpoint delivered Tier 1 (pre-aggregating clusterd counters into environmentd).
The carry-over is Tier 2: roughly 39 signal families that today exist only via the SQL-on-scrape sources slated for deletion (legacy /metrics/mz_* and the v2_mz_* exporter).
To retire those sources, environmentd must emit these natively.
High-leverage asks, in priority order:
mz_object_info(id → fully-qualified name → type) — the single highest-leverage item. It gives every other metric a stablegroup_leftjoin target for names.- A family of
_infometrics (mz_cluster_info,mz_replica_info,mz_source_info,mz_sink_info, …) carrying names and parent-id references. - Native source/sink status metrics (no genuine source exists today).
- Native hydration and frontier/freshness signals.
- Label-family harmonization (short vs long vs very-long forms).
The Hydration / Freshness / Sources / Sinks drilldowns above are ⛓️ gated on this work, which is why they are M3 (stretch) rather than M2 (required).
Release and changelog strategy#
Mechanics live in Releasing; the strategy is:
- Release unit: the umbrella Helm chart, on the monthly 15th cadence.
- Versioning keyed to the contract.
Chart SemVer reflects changes to the customer-facing surface — labels, metric names, profile semantics, alert names, dashboard JSON structure — not internal churn.
Subchart bumps flow through
Chart.lock(renovate-assisted). - Deprecation policy. Any breaking change to that surface gets at least one minor-release deprecation cycle. The release process check requires a PR touching the customer-facing surface to either follow the policy or carry an explicit justification. The label/metric contract is load-bearing for every customer dashboard built on top of it; the time to establish this discipline is before broad adoption, not after.
- Changelog.
A
changelog.mdin keep-a-changelog style, with a called-out “Customer-facing surface” subsection per release so consumers can scan for contract-affecting changes at a glance. - Downstream pinning. The Terraform wrapper pins a specific chart version with its own update cadence, so Terraform never tracks a moving chart target. Helm-first; the wrapper is an M4 item.
Follow-up documentation#
Tracked, not detailed here:
- Flesh out Releasing with the release mechanics.
- Create
changelog.md. - Write the contract
versioning.md(deprecation policy, in customer-facing terms). - Refresh Repo Layout as the layout evolves.