Architecture Overview#
This section provides an overview of the architecture of materialize-monitoring and its components.
materialize-monitoring Helm Umbrella Chart#
An Umbrella Helm Chart is a Helm Chart that orchestrates the installation of multiple dependent charts.
The materialize-monitoring Helm Chart is an Umbrella Chart that orchestrates the installation of the following dependent charts:
alloy-agent(Grafana Alloy, Agent DaemonSet): o11y Pipelinesalloy-gateway(Grafana Alloy, Gateway Deployment): o11y Pipelinesmetrics-server(metrics-server): cAdvisor/container runtime Metricskube-state-metrics(kube-state-metrics): Kubernetes Metricsnode-exporter(node-exporter): Node Metricsloki(Grafana Loki): Default Logging Infrastructurethanos(Thanos): Default Metrics Storage and Querying Infrastructuregrafana(Grafana): Default Dashboarding and Visualization Infrastructuregrafana-operator(Grafana Operator): Dashboards-as-Code Infrastructurealertmanager(Prometheus Alertmanager): Default Alerting Infrastructure
In addition to these dependent charts, materialize-monitoring
also provides many opionated configurations such as o11y pipelines, Grafana dashboards, Scrape configurations, and Prometheus recording and alerting rules.
alloy-agent: Grafana Alloy Agent DaemonSet#
alloy-agent is a Grafana Alloy Agent DaemonSet that runs on every node in the cluster and is responsible for collecting logs from the node and forwarding them to the alloy-gateway.
alloy-gateway: Grafana Alloy Gateway Deployment#
alloy-gateway is a Grafana Alloy Gateway Deployment that is responsible for the main observability pipeline processing and forwarding.
Logging responsibilities of alloy-gateway include:
- A
loki.source.apicomponent receives logs fromalloy-agentand processes them as logs. - A
loki.source.kubernetes_eventscomponent collects Kubernetes events and processes them as logs. - A
loki.processpipeline performs log processing - A
loki.writecomponent forwards logs to log storage (e.g., Grafana Loki)
Metrics responsibilities of alloy-gateway include:
prometheus.operator.servicemonitorsandprometheus.operator.podmonitorscomponents read ServiceMonitors and PodMonitors in order to determine what targets to scrape for metrics and then scrapes those targets.- A
prometheus.enrichpipeline performs metric processing and enrichment on scraped metrics. - A
prometheus.remote_writecomponent forwards metrics to metric storage (e.g., Thanos). - An
otelcol.exporter.otlpcomponent supports forwarding to an external OTLP endpoint (e.g., Honeycomb, Datadog, New Relic, etc.) for metrics and logs.
Alloy supports further customization to integrate with an existing observability infrastructure.
metrics-server: Container Metrics API#
metrics-server is a Kubernetes Metrics API implementation that collects resource usage metrics from the kubelet on each node and exposes them via the Kubernetes Metrics API.
Do note that the metrics-server is primarily intended for decision-based components (like Horizontal Pod Autoscaler) and does not store historical metrics data.
Nonetheless, Materialize relies on cluster-local metrics about its containers
so this is required to not rely on external metrics sources for this data.
kube-state-metrics: Kubernetes Metrics#
kube-state-metrics is a service that listens to the Kubernetes API server and generates metrics about the state of the objects in the cluster (e.g., Deployments, Pods, Services, etc.).
This does not provide information about resource usage of individual containers.
node-exporter: Node Metrics#
node-exporter is a Prometheus exporter that collects hardware and OS metrics from the nodes in the cluster.
loki: Grafana Loki#
Grafana Loki is a fully functional log aggregation system.
Loki is included in materialize-monitoring as its default logging
backend.
Refer to Loki Architecture for more details on the architecture of Loki.
The Loki Write path includes:
- A
loki-writestatefulset that receives logs- The
Distributorsubcomponent receives logs and distributes them to theIngestersubcomponents. - The
Ingestersubcomponent processes incoming logs and writes them to storage. It can also serve recent logs for queries.
- The
The Loki Read path includes:
- An optional
loki-query-frontenddeployment that runs theQuery Frontend.- The
Query Frontendsubcomponent receives queries and performs query splitting and fan-out to theQueriersubcomponents. It may consult theIndex Gatewayfor query sharding.
- The
- A
loki-readscalable deployment that receives queries via the Loki API and reads them from storage.- The
Queriersubcomponent handles LogQL queries. It talks toIngestersfor recent logs and to the storage layer for historical logs.
- The
- Additional cache components can be used for query performance (
chunks-cache,results-cache).
The other parts of the Loki Backend include:
- A
loki-gatewaydeployment serves metadata queries.- The
Index Gatewaysubcomponent maintains an index of log metadata
- The
- A
loki-backendstatefulset runs backend components.- The
Compactorsubcomponent compacts log data in the storage layer to optimize for cost and performance. It also handles retention and deletion of older logs. - The
Rulersubcomponent evaluates alerting and recording rules against incoming logs.
- The
Loki writes its data to object storage (e.g., S3, GCS, Azure Blob Storage, etc.) for long-term storage and scalability.
thanos: Thanos#
Thanos is a highly available Prometheus setup with long-term storage capabilities.
Thanos is included in materialize-monitoring as its default metrics storage and querying backend.
Refer to Thanos Design for more details on the architecture of Thanos.
The Thanos Receiver path includes:
- A
thanos-receivestatefulset that receive metrics in Prometheus Remote Write format.- The
Shippersubcomponent writes metrics to the object storage layer. - The
Store APIsubcomponent provides an API for querying recent metrics.
- The
The Thanos Query path includes:
- An optional
thanos-query-frontenddeployment is an optional caching and fan-out layer for queries. - A
thanos-queryscalable deployment that receives queries.- The
Query APIsubcomponent handles PromQL queries. - The
Store APIcomponent is used for gRPC internal communication between components.
- The
- A
thanos-storegatewaydeployment that serves metrics from the object storage layer.
Additional components include:
- A
thanos-compactorsingleton deployment that operates against the storage layer to compact, manage retention, and downsample metrics. - A
thanos-rulerdeployment that runs theRulercomponent for alerting and recording rules.- The
Rulersubcomponent evaluates alerting and recording rules against incoming metrics.
- The
grafana: Grafana#
Grafana is a multi-platform open source analytics and interactive visualization web application.
Grafana is included in materialize-monitoring as its main dashboarding and visualization tool.
Grafana is mainly deployed as a Deployment and is recommended to be backed with a compatible database for durability and scalability.
We use grafana-operator to manage resources on a Grafana deployment.
grafana-operator: Grafana Operator#
Grafana Operator is a Kubernetes Operator that manages Grafana instances and their resources (e.g., Dashboards, Datasources, etc.) as Kubernetes Custom Resources.
The operator itself is just a simple Kubernetes Deployment named grafana-operator that watches for Grafana Custom Resources and applies them to the Grafana instance.
It manages these kinds of resources:
- A
Grafanadefines how to set up a Grafana instance or connect to an existing Grafana instance. - A
GrafanaManifestdefines a k8s-style (12+) Grafana Dashboard that can be applied to the Grafana instance. - A
GrafanaDashboarddefines an old-style (<12) Grafana Dashboard that can be applied to the Grafana instance. - A
GrafanaDatasourcedefines a Grafana Datasource that can be applied to the Grafana instance. We typically configure a datasource for Thanos and Loki.
alertmanager: Prometheus Alertmanager#
Prometheus Alertmanager is a tool that handles alerts sent by Prometheus and other monitoring systems.
TODO: determine architecture and integration of Alertmanager in materialize-monitoring.