Observability Internals¶
This page covers magic's observability design. For setup and usage, see Observability.
Overview¶
This document defines a new observability module for the magic library that provides:
- Distributed tracing via OpenTelemetry (OTEL)
-
Metrics via one of two modes:
-
Prometheus
/metricsendpoint (default) - OTEL metrics over OTLP
-
Automatic instrumentation of core
magicpackages: -
HTTP / chi router
storage(operations that pass acontext.Context)pubsub(publish-only in v1)- Custom metrics support for service authors
- Automatic trace/span correlation on
slog*Contextlog calls (zero-touch oncelogger.Initruns) - A backend-neutral metrics abstraction so services and
magicpackages do not need to care whether metrics are exported via Prometheus, OTEL, or both
The goal is to provide a near-zero-touch observability stack for services built on magic:
- initialize once
- add one chi middleware
- optionally expose
/metrics - automatically get tracing and metrics across HTTP, storage (contextual), and pubsub publish
- easily define additional business metrics
This design assumes most consumers of magic use chi as their HTTP router and want a simple way to enable tracing and metrics with minimal service-level code changes.
Compatibility Constraint¶
Breaking changes to existing magic packages are off the table. The design therefore:
- Preserves every existing
StorageAdapterandPublishermethod signature. - Adds new context-aware methods on an extension interface (
ContextualStorageAdapter) and a newPublishContextmethod onPublisher, following thedatabase/sql…Contextnaming convention. - Falls back gracefully when an adapter has not yet been migrated to the contextual interface.
Goals¶
- Provide easy observability enablement for services built on
magic -
Support distributed tracing end-to-end across:
-
HTTP requests
- storage operations (when called through
ContextualStorageAdapter) - pubsub publish operations
-
Support metrics for:
-
Go runtime
- process
- HTTP requests
- storage operations
- pubsub publish operations
- custom business metrics
- Provide automatic instrumentation inside
magicpackages so consumers get value with minimal code changes - Support Prometheus scrape and OTLP push as interchangeable metrics export modes without changing instrumentation code
- Automatically correlate application logs with the active trace/span by wrapping the
sloghandler in theloggerpackage - Provide first-class support for unit testing instrumented code
Non-Goals (v1)¶
- OTEL logs integration
- PubSub consume / process / ack / nack instrumentation (no Consumer interface exists in the repo yet; this is deferred to a follow-up design)
- Auto-instrumentation of arbitrary third-party libraries not wrapped by
magic - Built-in dashboards or Grafana assets
- Advanced sampling controls beyond parent-based ratio sampling
- Tenant-level or user-level metric labels by default
- Automatic instrumentation of application business code
Design Principles¶
1. Near-zero-touch for shared infrastructure¶
If a service uses magic packages like storage and pubsub, those packages should emit traces and metrics automatically once observability is enabled — provided callers use the contextual methods.
2. Explicit bootstrap, implicit package instrumentation¶
A service must explicitly initialize observability. Once initialized, magic packages instrument themselves automatically through package-level telemetry hooks.
3. Backend-neutral metrics instrumentation¶
Metrics instrumentation code should not depend on Prometheus or OTEL implementation details. Services and magic packages use a stable internal metrics abstraction. Tracing uses the OTEL API directly — there is only one viable tracing backend and wrapping OTEL's tracer adds surface without benefit.
4. Safe metric cardinality¶
Metric labels must be intentionally constrained. Route labels must use chi route patterns, not raw URL paths. Custom metrics require declared label keys, and runtime labels outside the declared set are rejected by default.
5. No hidden behavior change for non-migrated code¶
Services or adapters that have not adopted the contextual APIs must continue to work unchanged. They lose tracing (no parent context) but keep metrics wherever possible, and the system warns once about the missing coverage.
6. Keep the service developer experience simple¶
For most services, the happy path should look like this:
obs, err := observability.Init(ctx, cfg)
if err != nil {
log.Fatal(err)
}
defer obs.Shutdown(ctx)
r := chi.NewRouter()
r.Use(middlewares.Observability(obs))
r.Handle("/metrics", obs.MetricsHandler())
High-Level Design¶
The observability module consists of:
- A bootstrap package,
observability, responsible for initializing tracing and metrics backends - A neutral package,
telemetry, that hosts the abstractions shared bymagicpackages - Package-level instrumentation in
magiccore packages - Configurable export backends for metrics
Instrumentation happens at three levels:
-
HTTP level
-
request tracing
- request metrics
-
propagation of trace context into handlers
-
Shared infrastructure level
-
storage tracing (contextual adapters) and metrics (all adapters)
-
pubsub publish tracing and metrics (contextual publishers)
-
Application business level
-
custom counters, histograms, gauges, and up-down counters defined by service authors
This gives consumers infrastructure telemetry by default and business telemetry when needed.
Package Layering & Imports¶
To avoid an import cycle between observability and instrumented packages (storage, pubsub), and to keep magic core packages free of Prometheus and OTEL dependencies, the design introduces a neutral telemetry package.
Dependency Direction¶
- The
telemetrypackage is exported asgithub.com/<org>/magic/telemetry(notinternal/), so third-party storage/pubsub adapters built outside this repo can implementContextualStorageAdapterand emit through the same pipeline as in-repo adapters. Its API is part ofmagic's stability contract; new capabilities are added via extension interfaces rather than breaking changes. telemetrydefines the interfaces (MetricsBackend,Counter,Histogram,Gauge,UpDownCounter,MetricDefinition, etc.) and a package-levelGlobal()accessor.storageandpubsubimporttelemetryand usetelemetry.Global()for their instrumentation.observabilityimportstelemetry, installs concrete implementations viatelemetry.SetGlobal(...)duringInit, and owns the Prometheus/OTEL backend code.
Dependency Discipline¶
To keep go.mod lean for services that use storage/pubsub/logger without enabling observability, the following rules apply to magic core packages (storage, pubsub, logger, and any future instrumented package):
- Metrics must go through
telemetry. Core packages may not importprometheus/client_golang,go.opentelemetry.io/otel/sdk/metric, or any OTEL metric exporter package. -
Tracing uses the OTEL trace API directly, not the SDK. Core packages may import only:
-
go.opentelemetry.io/otel/trace— the interface-level trace API (used bystorage,pubsub, andloggerfor span-context extraction) go.opentelemetry.io/otel/semconv/...— semantic convention constantsgo.opentelemetry.io/otel/propagation— only where a package needs to inject/extract context (currentlypubsub)
These packages are small, interface-heavy, and pull in no exporters or SDK.
* Everything heavy lives in observability. The OTEL SDK (sdk/trace, sdk/metric), all exporters (OTLP, Prom), prometheus/client_golang, and sdk/resource are imported only by observability. A service that never calls observability.Init pays only the cost of the trace API (tens of KB of interface code) when it uses storage, pubsub, or logger.
This is enforced by a CI check (go list -deps ./storage/... ./pubsub/... ./logger/... ./telemetry/... must not contain any of the forbidden import paths).
Why a Neutral Package¶
- Removes the circular dependency between
observabilityandstorage/pubsub. - Keeps the core packages' dependency graph lean — consumers who never enable observability don't pull in the OTEL SDK or Prometheus.
- Allows the
telemetryinterfaces to evolve independently of the bootstrap package, and lets third-party adapters implement them without importingobservability.
Global vs. Scoped Telemetry¶
telemetry.Global() returns a process-wide default that is a no-op until observability.Init(...) replaces it. This matches the doc's bootstrap-once-use-everywhere model.
Known limitations, documented for users:
- Tests that need isolated telemetry must use
telemetry.WithContext(ctx, t)to scope aTelemetryto a context, or use the test harness inobservability/obstestwhich resets global state between tests. - Multiple calls to
observability.Initreplace the global. This is intentional but must be done carefully; concurrentInitcalls are not supported. - Adapters read the global on each operation, not at construction. Services that construct a storage adapter before calling
Initstill get instrumentation onceInitruns.
Proposed Package Structure¶
telemetry/
telemetry.go // Telemetry struct + Global/SetGlobal
tracer.go // re-exports or aliases OTEL trace types used by magic
metrics.go // MetricsBackend, Counter, Histogram, Gauge, UpDownCounter, MetricDefinition
labels.go // Label, Labels helper
noop.go // no-op implementations (the zero-value default)
context.go // WithContext / FromContext for scoped telemetry
observability/
config.go
init.go
shutdown.go
tracing.go
metrics_backend_prom.go
metrics_backend_otel.go
chi.go
response_writer.go
custom_metrics.go
defaults.go
logger.go // LoggerFromContext escape-hatch helper for non-slog loggers
obstest/
observer.go // NewTestObserver + assertions
storage/
telemetry.go // ContextualStorageAdapter interface + instrumented wrapper
instrumented_adapter.go
pubsub/
telemetry.go // PublishContext extension + instrumented wrapper
instrumented_publisher.go
logger/
trace_handler.go // slog handler wrap that auto-injects trace_id/span_id
Potential future additions (explicitly out of v1 scope):
pubsub/
consumer.go // Consumer/Subscriber interface (future design)
instrumented_consumer.go
health/
telemetry.go
Storage Context Migration¶
Problem¶
The existing storage.StorageAdapter interface does not take context.Context. Without context, the instrumented wrapper cannot extract the parent span and storage operations would be orphan root spans, defeating end-to-end tracing.
Solution: ContextualStorageAdapter Extension Interface¶
Add a new extension interface in storage/telemetry.go. Every method on StorageAdapter that performs I/O gets a …Context sibling.
package storage
import "context"
type ContextualStorageAdapter interface {
StorageAdapter
ExecuteContext(ctx context.Context, statement string) error
PingContext(ctx context.Context) error
CreateContext(ctx context.Context, item any, params ...map[string]any) error
GetContext(ctx context.Context, dest any, filter map[string]any, params ...map[string]any) error
UpdateContext(ctx context.Context, item any, filter map[string]any, params ...map[string]any) error
DeleteContext(ctx context.Context, item any, filter map[string]any, params ...map[string]any) error
ListContext(ctx context.Context, dest any, sortKey string, filter map[string]any, limit int, cursor string, params ...map[string]any) (string, error)
SearchContext(ctx context.Context, dest any, sortKey string, query string, limit int, cursor string, params ...map[string]any) (string, error)
CountContext(ctx context.Context, dest any, filter map[string]any, params ...map[string]any) (int64, error)
QueryContext(ctx context.Context, dest any, statement string, limit int, cursor string, params ...map[string]any) (string, error)
}
Schema/migration methods (CreateSchema, CreateMigrationTable, UpdateMigrationTable, GetLatestMigration) deliberately do not gain …Context variants. They are one-shot startup operations that run before any request-scoped context exists, they are not part of any distributed trace, and their failures already surface as fatal startup errors. The instrumented wrapper passes them straight through with no span and no metric — adding context or instrumentation would be pure surface-area churn without operational value.
Delegation Pattern¶
Every adapter implements the Context variants as the primary methods. The non-Context variants delegate with context.Background():
func (a *sqlAdapter) Get(dest any, filter map[string]any, params ...map[string]any) error {
return a.GetContext(context.Background(), dest, filter, params...)
}
func (a *sqlAdapter) GetContext(ctx context.Context, dest any, filter map[string]any, params ...map[string]any) error {
// real implementation lives here
}
This prevents drift: the non-ctx variants have no logic of their own.
Instrumented Wrapper Capability Check¶
StorageAdapterFactory.GetInstance(...) returns an instrumented wrapper when observability is active. The wrapper checks whether the underlying adapter implements ContextualStorageAdapter:
type instrumentedAdapter struct {
inner StorageAdapter
ctxInner ContextualStorageAdapter // nil if inner is legacy
telemetry telemetry.Telemetry
providerLabel string
}
func wrap(inner StorageAdapter, t telemetry.Telemetry) StorageAdapter {
ctxInner, _ := inner.(ContextualStorageAdapter)
if ctxInner == nil {
telemetry.WarnOnce("storage adapter %T does not implement ContextualStorageAdapter; traces will not be linked", inner)
}
return &instrumentedAdapter{inner: inner, ctxInner: ctxInner, telemetry: t, ...}
}
Fallback Behavior for Legacy Adapters (metrics-only)¶
When a caller uses the non-ctx method or the adapter is not contextual:
- Spans are skipped. No orphan root spans are created. This avoids polluting trace UIs with unparented storage spans.
- Metrics are still recorded.
magic_storage_operations_total,magic_storage_operation_duration_seconds, andmagic_storage_operation_errors_totalare emitted normally. - A warn-once log is issued at the first operation against a non-contextual adapter, naming the adapter type and recommending migration.
This preserves full metric coverage across the legacy path while keeping the trace UI clean.
Migration Plan¶
All in-repo adapters must implement ContextualStorageAdapter during Phase 2 of the implementation plan:
storage/sql.gostorage/dynamodb.gostorage/cosmosdb.gostorage/memory.go- Any Cassandra adapter that lands before Phase 2 completes
No adapter in the repository will remain legacy after Phase 2 ships. The legacy path exists for third-party adapters outside the magic repo and for binary-compatibility with pre-observability versions of magic.
PubSub: Same Pattern¶
The pubsub.Publisher interface gets a PublishContext method via a sibling ContextualPublisher interface:
package pubsub
import "context"
type ContextualPublisher interface {
Publisher
PublishContext(ctx context.Context, topic, message string, params map[string]any) error
}
Publish delegates to PublishContext with context.Background(). The instrumented wrapper applies the same capability check and fallback behavior (metrics only, no span, warn-once).
Public API¶
Package¶
Config¶
type Config struct {
ServiceName string
ServiceVersion string
Environment string
EnableTracing bool
// Tracing
TracesOTLPEndpoint string
TracesOTLPInsecure bool
// SamplingRatio is a pointer so that nil means "default" (1.0 when
// EnableTracing is true). A zero value (0.0) explicitly disables sampling
// even when tracing is enabled. Ignored when Sampler is non-nil.
SamplingRatio *float64
// Sampler is an escape hatch for advanced sampling strategies. When non-nil
// it overrides SamplingRatio and is wrapped with a parent-based sampler
// internally, so downstream sampling decisions are still respected.
// Tail sampling and per-operation sampling belong in the OTEL Collector,
// not here.
Sampler sdktrace.Sampler
// Propagator defaults to W3C tracecontext + baggage via otel.GetTextMapPropagator().
// Setting Propagator overrides the global default for magic-initiated work.
Propagator propagation.TextMapPropagator
// Metrics
MetricsMode MetricsMode
MetricsNamespace string // applied to custom metrics only; built-in metric names are not prefixed
// AllowUndeclaredLabels inverts the sense of the previous StrictMetrics field
// so the zero value (false) gives strict behavior, which is the safer default.
AllowUndeclaredLabels bool
EnableRuntimeMetrics bool
EnableProcessMetrics bool
}
Metrics Mode¶
type MetricsMode string
const (
MetricsModePrometheus MetricsMode = "prometheus"
MetricsModeOTLP MetricsMode = "otlp"
)
Observer¶
Initialization¶
Responsibilities:
- validate config
- initialize tracing provider if enabled
- configure the global propagator
- initialize the metrics backend if enabled
- register default runtime/process metrics if applicable
- register built-in HTTP, storage, and pubsub instruments
- install the telemetry global so
storageandpubsubpackage hooks become active - return an
Observer
Middleware¶
Responsibilities:
- extract incoming trace context via the configured propagator
- start a server span when tracing is enabled
- wrap the response writer to capture status code
- record metrics and finalize the span in a deferred block after
next.ServeHTTP, so the chi route pattern is populated - on panic in the inner handler, record the panic on the span, re-raise, and still record metrics (see panic policy below)
Metrics Handler¶
Responsibilities:
- in
prometheusmode, expose a standard/metricsendpoint backed by the Prometheus registry - in
otlpmode, return anhttp.Handlerthat serves 404 with a short JSON body ({"error":"metrics are exported via OTLP"}); never returnnil. This makesr.Handle("/metrics", obs.MetricsHandler())safe in either mode.
Shutdown¶
Responsibilities:
- flush exporters
- stop tracer and meter providers cleanly
- execute any registered shutdown callbacks
- reset
telemetry.Global()to the no-op implementation
Default Behavior¶
Defaults are NOT applied by Init — Config.validate() does not mutate the config. DefaultConfig() sets EnableRuntimeMetrics = true, EnableProcessMetrics = true, and MetricsPushInterval = 30s. Compose your Config from DefaultConfig() and override as needed; fields left unset take their Go zero values, interpreted as:
EnableTracing = falseSamplingRatio = nil→ treated as1.0whenEnableTracingis trueSampler = nil→ falls back to parent-based(ratio(SamplingRatio)); when set, wraps the caller's sampler in parent-based and ignoresSamplingRatioMetricsMode— required; no default.DefaultConfig()does not set it andInitrejects an empty or unknown mode.EnableRuntimeMetrics = trueEnableProcessMetrics = trueAllowUndeclaredLabels = false(strict)Propagator = otel.GetTextMapPropagator()(W3C tracecontext + baggage)
Reasoning:
- metrics are generally low-risk and high-value, so they default to enabled
- tracing usually depends on collector/exporter availability, so it is opt-in unless configured
/metricsis the standard default path- runtime and process metrics are enabled by default in scrape-based modes
- strict label validation is the safer default and prevents accidental cardinality blow-ups
Internal Telemetry Abstraction¶
Telemetry¶
Defined in the neutral telemetry package.
package telemetry
import "go.opentelemetry.io/otel/trace"
type Telemetry struct {
Tracer trace.Tracer // use OTEL directly; nil means no-op
Metrics MetricsBackend // nil means no-op
}
func Global() *Telemetry
func SetGlobal(t *Telemetry)
Tracing¶
Use OTEL trace APIs directly. Do not define custom Tracer, Span, or SpanStartOption interfaces. Callers in magic packages obtain a tracer as:
tracer := telemetry.Global().Tracer
ctx, span := tracer.Start(ctx, "storage.get", trace.WithAttributes(...))
defer span.End()
Rationale: there is only one viable tracing backend (OTEL). Wrapping the OTEL Tracer / Span types means either mirroring their full surface (events, links, baggage, typed attributes, RecordError) or forcing escape hatches. The marginal benefit does not justify the surface area.
Metrics Backend¶
type MetricsBackend interface {
Counter(def MetricDefinition) (Counter, error)
Histogram(def MetricDefinition) (Histogram, error)
Gauge(def MetricDefinition) (Gauge, error)
UpDownCounter(def MetricDefinition) (UpDownCounter, error)
}
type Counter interface {
Add(value float64, labels ...Label)
}
type Histogram interface {
Observe(value float64, labels ...Label)
}
// Gauge represents an instantaneous value that is observed, not accumulated.
// Backed by a Prometheus Gauge or an OTEL async Gauge (Float64ObservableGauge).
type Gauge interface {
Set(value float64, labels ...Label)
}
// UpDownCounter represents an additive value that can go up or down.
// Backed by a Prometheus Gauge or an OTEL Float64UpDownCounter.
type UpDownCounter interface {
Add(value float64, labels ...Label)
}
Why Gauge and UpDownCounter Are Split¶
In OTEL, Set maps to an async gauge (observed via callback) and Add maps to an UpDownCounter. Offering both operations on one type silently backs it by two unrelated instruments or makes one of them a no-op. Splitting keeps the mapping one-to-one and matches OTEL semantics, while Prometheus (which has a single Gauge supporting both) trivially implements both interfaces on top of one underlying prometheus.Gauge.
Implementations¶
prometheusMetricsBackend— usesprometheus/client_golang, served via the/metricshandlerotelMetricsBackend— uses OTEL metrics SDK with an OTLP exporter, pushed to the configured collector
HTTP Instrumentation¶
Goal¶
Provide request tracing and metrics for chi-based services with one middleware.
Middleware Ordering¶
The chi route pattern is populated during routing, which happens during next.ServeHTTP. A naive middleware that records metrics before calling next will see an empty route pattern and fall back to "unmatched" for every request.
The middleware therefore records on the trailing edge:
The middleware starts the span before next.ServeHTTP, then records metrics and finalizes the span in a deferred block after the handler returns. That trailing-edge record is what guarantees the chi route pattern is available.
Panic Policy¶
The middleware does not itself recover panics. If a downstream handler panics and no upstream middleware recovers:
- the deferred block still runs, which records an
errorstatus on the span,RecordErrorwith the recovered panic value (if reachable viarecover()inside the deferred block, which it is), and emits metrics with status"500" - the panic is then re-raised via
panic(rec)so upstream middleware or the defaultnet/httprecovery can handle it
This means middlewares.Observability(obs) is safe to place either before or after a user-supplied recover middleware. It never swallows panics.
HTTP Tracing¶
Each incoming request creates a server span.
Span name format:
Attributes use OTEL semantic conventions (semconv package):
http.request.methodhttp.routehttp.response.status_codeserver.addressurl.path
Behavior:
- extract incoming trace context via the configured propagator (default W3C tracecontext + baggage)
- start a server span
- inject the updated context into the downstream request
- mark span status as error for 5xx responses
- record panics as span errors and re-raise
HTTP Metrics¶
Built-in HTTP metrics:
http_requests_total— counterhttp_request_duration_seconds— histogram, buckets:{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
Labels:
methodroutestatus_code
The HTTP histogram bucket set adds 0.001 and 0.005 below the Prometheus default so p50/p90 for well-tuned services don't pile up at the 5 ms boundary.
These names are not prefixed by MetricsNamespace. Built-in metrics use stable, standard names so that off-the-shelf dashboards and alert rules work without modification.
Cardinality Rule¶
Route labels use the chi route pattern. Raw paths are never used.
Correct:
Incorrect:
Fallback when the route pattern is empty (unmatched routes, 404s, direct handlers outside chi):
Response Writer Wrapper¶
Minimal wrapper to capture status:
type responseWriter struct {
http.ResponseWriter
status int
wroteHeader bool
}
func (rw *responseWriter) WriteHeader(code int) {
if rw.wroteHeader {
return
}
rw.status = code
rw.wroteHeader = true
rw.ResponseWriter.WriteHeader(code)
}
func (rw *responseWriter) Write(b []byte) (int, error) {
if !rw.wroteHeader {
rw.status = http.StatusOK
rw.wroteHeader = true
}
return rw.ResponseWriter.Write(b)
}
Must also expose Flush, Hijack, and Push via interface assertions on the inner writer, to preserve chi/SSE/HTTP-2 behavior.
Storage Instrumentation¶
Storage Goal¶
Automatically instrument magic/storage so every service using storage gets tracing (on contextual calls) and metrics (on all calls) without adding code beyond the observability.Init call.
Storage Instrumentation Strategy¶
Instrumentation is implemented by wrapping the storage adapter internally in StorageAdapterFactory.GetInstance(...). The external storage API is unchanged for legacy callers. Callers that want tracing use the Context-suffixed methods from the ContextualStorageAdapter interface.
Storage Tracing¶
Each contextual storage operation creates a child span on the caller's context.
Span names:
storage.createstorage.getstorage.liststorage.searchstorage.updatestorage.deletestorage.countstorage.querystorage.executestorage.ping
Schema/migration methods (CreateSchema, CreateMigrationTable, UpdateMigrationTable, GetLatestMigration) are not instrumented — see "Storage Context Propagation".
Attributes:
db.system— e.g."postgresql","dynamodb"(OTEL semconv)magic.storage.provider— theStorageProvidersconstantmagic.storage.operation— the span name minusstorage.magic.storage.model— the concrete type name ofitem/dest, reflected at the call sitemagic.storage.limit— forlist/search/queryonlymagic.storage.sort_field— forlist/searchonly
Additional backend-specific attributes may be added if they are stable and low-cardinality.
Errors must:
- be recorded on the span via
span.RecordError(err) - set error status on the span via
span.SetStatus(codes.Error, err.Error())
Storage Metrics¶
Built-in storage metrics:
magic_storage_operations_total— countermagic_storage_operation_duration_seconds— histogram, buckets:{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}magic_storage_operation_errors_total— counter
Labels:
provideroperationstatus—"ok"or"error"
Storage duration uses the same sub-10 ms extended low-end buckets as HTTP — point reads and cache-backed operations frequently complete in single-digit milliseconds — but omits the top 10 bucket that HTTP carries.
Examples:
magic_storage_operations_total{provider="dynamodb",operation="get",status="ok"}
magic_storage_operation_duration_seconds{provider="postgresql",operation="search"}
magic_storage_operation_errors_total{provider="mysql",operation="update",status="error"}
Legacy Adapter Fallback¶
When GetInstance wraps an adapter that does not implement ContextualStorageAdapter, or when the caller uses a non-context method on any adapter:
- No span is created (avoids orphan root spans).
- Metrics are still recorded, derived from wall-clock timing and the operation name.
- A
telemetry.WarnOnceis issued at first use naming the adapter type and linking to the migration docs.
Storage Design Constraints¶
- no API changes to the existing
StorageAdapterinterface - new
ContextualStorageAdapteris the extension point - metric labels remain small and stable
- no labeling by raw query, record key, tenant ID, or user ID
PubSub Instrumentation (Publish-Only in v1)¶
PubSub Goal¶
Automatically instrument publish flows so that services emitting events participate in distributed tracing and emit useful metrics.
Consumer / subscribe / process / ack / nack instrumentation is explicitly deferred to a follow-up design once a Consumer interface exists in the pubsub package. The shape of this work is captured in Future Enhancements.
PubSub Instrumentation Strategy¶
Introduce a ContextualPublisher extension interface:
package pubsub
import "context"
type ContextualPublisher interface {
Publisher
PublishContext(ctx context.Context, topic, message string, params map[string]any) error
}
Publish delegates to PublishContext(context.Background(), ...). Every in-repo publisher (sns.go) implements both and has its real logic in PublishContext.
PublisherFactory.GetInstance(...) returns an instrumented wrapper when observability is active. The wrapper applies the same capability check as the storage wrapper and the same fallback behavior.
PubSub Tracing¶
Span name: pubsub.publish.
Behavior:
- start a client span on the caller's context
- inject trace context into outbound message metadata via the configured propagator
- record message publish errors on the span
Attributes (OTEL semconv):
messaging.system— e.g."aws_sns"messaging.destination.name— the topicmessaging.operation—"publish"magic.pubsub.provider— thePublisherTypeconstant
Optional attributes when safe:
messaging.message.body.size
Context Propagation: SNS Specifics¶
SNS message attributes are limited to 10 per message. The propagator (default W3C) adds traceparent and, when present, tracestate and baggage. This consumes up to 3 attributes.
- The wrapper reads and writes back the shared params-map key
pubsub.MessageAttributesParamKey(string value"MessageAttributes") whose value is amap[string]string. In-repo publishers (currently just SNS) translate that map into their native per-system representation. - User-supplied keys are authoritative and are never overwritten.
traceparentis protected: it is never dropped to fit the cap. When propagator keys plus user keys would exceed 10, the wrapper sheds optional propagator keys in the orderbaggage→tracestate, emitting a warn-once log event naming the dropped header.traceparentis dropped only in the degenerate case where caller-supplied attributes alone already fill all 10 slots — user keys are authoritative and are never displaced. That case is a caller misuse and is logged warn-once.- Teams using B3 or Jaeger propagation must set
cfg.Propagatorexplicitly; the wrapper defers to whatever propagator is configured.
PubSub Metrics¶
Built-in publish metrics:
magic_pubsub_messages_total— countermagic_pubsub_publish_duration_seconds— histogram, buckets:{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}(the Prometheus default)magic_pubsub_errors_total— counter
Labels:
providerdestinationoperation— always"publish"in v1status—"ok"or"error"
Publish is network-bound and rarely faster than a few milliseconds, so the Prometheus default bucket set is sufficient.
Examples:
magic_pubsub_messages_total{provider="sns",destination="orders",operation="publish",status="ok"}
magic_pubsub_publish_duration_seconds{provider="sns",destination="orders",operation="publish"}
magic_pubsub_errors_total{provider="sns",destination="orders",operation="publish",status="error"}
Cardinality Note on destination¶
SNS destinations are topic ARNs. ARNs embed the AWS account ID, which is bounded per service but adds one label value per account. Teams operating with many accounts (multi-tenant) should be aware that destination label cardinality tracks the number of distinct topics across all accounts the service publishes to. The wrapper does not strip ARNs by default; if needed, this can be addressed in a future release with a normalization hook.
PubSub Design Constraints¶
- no API changes to the existing
Publisherinterface ContextualPublisheris the extension point- context propagation is automatic when using
PublishContext - no labeling by message ID, tenant ID, or user ID by default
Custom Metrics¶
Custom Metrics Goal¶
Allow service authors to define business metrics using the same observability pipeline as built-in metrics.
Custom metrics must work in both supported metrics modes:
- Prometheus
- OTLP
Custom Metrics Design Principles¶
- One instrumentation API for consumers. Service authors should not need to import Prometheus or OTEL metric SDKs directly for normal usage.
- Backend-neutral metric definition. A custom counter, histogram, gauge, or up-down counter is defined once and emitted the same way regardless of export mode.
- Safe by default. Stable names, declared labels, reusable instruments, startup registration.
- Built-in and custom metrics share the pipeline. HTTP, storage, pubsub, runtime, and user-defined metrics all flow through the same backend abstraction.
Metric Kind¶
type MetricKind int
const (
KindCounter MetricKind = iota
KindHistogram
KindGauge
KindUpDownCounter
)
Metric Definition¶
type MetricDefinition struct {
Name string
Help string
Unit telemetry.Unit
Kind telemetry.MetricKind
Labels []string
Buckets []float64 // histogram only
}
Label¶
Usage:
Observer API¶
func (o *Observer) Counter(def telemetry.MetricDefinition) (telemetry.Counter, error)
func (o *Observer) Histogram(def telemetry.MetricDefinition) (telemetry.Histogram, error)
func (o *Observer) Gauge(def telemetry.MetricDefinition) (telemetry.Gauge, error)
func (o *Observer) UpDownCounter(def telemetry.MetricDefinition) (telemetry.UpDownCounter, error)
Example¶
ordersCreated, err := obs.Counter(telemetry.MetricDefinition{
Name: "orders_created_total",
Help: "Total number of orders created",
Kind: telemetry.KindCounter,
Labels: []string{"status", "channel"},
})
if err != nil {
log.Fatal(err)
}
checkoutLatency, err := obs.Histogram(telemetry.MetricDefinition{
Name: "checkout_duration_seconds",
Help: "Checkout duration in seconds",
Unit: telemetry.UnitSeconds,
Kind: telemetry.KindHistogram,
Labels: []string{"result"},
Buckets: []float64{0.05, 0.1, 0.25, 0.5, 1, 2, 5},
})
if err != nil {
log.Fatal(err)
}
activeConnections, err := obs.UpDownCounter(telemetry.MetricDefinition{
Name: "active_connections",
Kind: telemetry.KindUpDownCounter,
Labels: []string{"protocol"},
})
Recording:
ordersCreated.Add(1,
telemetry.Label{Key: "status", Value: "success"},
telemetry.Label{Key: "channel", Value: "web"},
)
checkoutLatency.Observe(duration.Seconds(),
telemetry.Label{Key: "result", Value: "success"},
)
Namespace Scope¶
Config.MetricsNamespace, when non-empty, is applied only to custom metrics registered through Observer.Counter/Histogram/Gauge/UpDownCounter. Built-in HTTP, storage, pubsub, Go runtime, and process metrics keep their canonical names so that shared dashboards remain portable.
Validation Rules¶
Metric registration validates:
Name¶
- non-empty
- matches
^[a-zA-Z_:][a-zA-Z0-9_:]*$ - must not collide with a built-in metric name
Labels¶
- label keys declared up front in
Labels - duplicate keys within
Labelsare rejected at registration - runtime labels must match declared keys exactly when
AllowUndeclaredLabelsis false (the default) - label ordering at observation time is irrelevant; the backend matches keys by name before lookup
- label ordering at registration time is preserved verbatim — duplicate registrations must supply the same ordered
Labels. This is required because the Prometheus backend stores labels positionally
Buckets¶
- only permitted when
Kind == telemetry.KindHistogram - defaults apply if omitted — custom histograms without explicit
Bucketsget the Prometheus default:{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} - built-in histograms (HTTP, storage, pubsub) ship with per-family bucket sets defined in their respective sections; these are not overridable in v1
Duplicate Registration¶
Two registrations are compatible when their canonical shape is equal:
NameequalKindequalLabelsequal in orderUnitequal- for histograms,
Bucketsequal in order
Compatible re-registrations return the existing instrument. MetricDefinition.Help differences are tolerated and are not reflected in the already-registered instrument (first-writer-wins).
Incompatible re-registrations return an error.
Reserved Names¶
Custom registrations fail fast when the effective name (after MetricsNamespace application) would:
- exactly match any built-in metric name (
http_*,magic_storage_*,magic_pubsub_*) - start with the reserved
go_orprocess_prefix owned by the runtime/process collectors
Use a namespace or rename the metric to avoid the collision.
Best Practices¶
- register custom metrics at startup
- store metric handles and reuse them
- avoid registering metrics dynamically inside request handlers
- avoid high-cardinality label values
Good:
var ordersCreated telemetry.Counter
func initMetrics(obs *observability.Observer) error {
var err error
ordersCreated, err = obs.Counter(telemetry.MetricDefinition{
Name: "orders_created_total",
Kind: telemetry.KindCounter,
Labels: []string{"status"},
})
return err
}
// Hot-path observation site — no registration, no allocation beyond
// the variadic Label slice.
ordersCreated.Add(1, telemetry.Labels("status", "success")...)
Bad:
func handler(w http.ResponseWriter, r *http.Request) {
// re-registers on every request: allocates, validates,
// and in strict mode yields an error on the second call
// with a different shape.
c, _ := obs.Counter(telemetry.MetricDefinition{
Name: "orders_created_total", Kind: telemetry.KindCounter,
})
c.Add(1)
}
Testing Custom Metrics¶
Use obstest.NewTestObserver to install an in-memory backend and assert against observations without starting a Prometheus registry or OTLP collector:
func TestOrdersCreatedEmitsSuccessMetric(t *testing.T) {
obs := obstest.NewTestObserver(t)
// Register the same way production code would.
c, err := obs.Telemetry.Metrics.Counter(telemetry.MetricDefinition{
Name: "orders_created_total",
Kind: telemetry.KindCounter,
Labels: []string{"status"},
})
require.NoError(t, err)
c.Add(1, telemetry.Labels("status", "success")...)
obs.AssertCounter(t, "orders_created_total", 1,
telemetry.Label{Key: "status", Value: "success"})
}
The same harness records spans via obs.Spans.Ended(), which is how the storage and pubsub packages verify that their context-aware instrumentation produces the expected trace.
Concurrency¶
All Counter, Histogram, Gauge, and UpDownCounter implementations are safe for concurrent use. Registration (Observer.Counter / Observer.Histogram / …) is also safe for concurrent use and is idempotent for compatible shapes.
Metrics Export Modes¶
1. Prometheus Mode¶
Default. Simplest.
Behavior:
- tracing uses OTEL if enabled
- metrics are backed by native Prometheus collectors via
prometheus/client_golang /metricsis exposed viaMetricsHandler()- Go runtime and process metrics are collected via the standard Prometheus collectors
Best for:
- teams already scraping Prometheus endpoints
- simplest adoption path
- lowest friction for
magicconsumers
2. OTLP Mode¶
Behavior:
- tracing uses OTEL and exports over OTLP
- metrics use OTEL meters and export over OTLP
MetricsHandler()returns a 404 handler — nevernil— so registering/metricsstays safe
Best for:
- teams with a central OTEL Collector pipeline
- environments where metrics are pushed, not scraped
- users who want OTEL-only for both tracing and metrics
Migration Between Modes¶
Teams moving between Prometheus and OTLP should do so at the collector, not in the library. The recommended path is:
- stand up the OTEL Collector with a Prometheus receiver scraping the existing
/metricsendpoint - have the Collector emit OTLP (or anything else) downstream
- once all services and dashboards consume the Collector's output, flip the library to
MetricsModeOTLP
This avoids in-process dual-export and its double-counting, allocation, and bucket-alignment problems.
Runtime and Process Metrics¶
In scrape-based modes, the observability module registers:
- Go runtime metrics —
collectors.NewGoCollector(collectors.WithGoCollectorRuntimeMetrics(collectors.MetricsAll)) - Process metrics —
collectors.NewProcessCollector(collectors.ProcessCollectorOpts{})
Controlled by:
EnableRuntimeMetricsEnableProcessMetrics
In OTLP-only mode, Go runtime metrics are emitted via go.opentelemetry.io/contrib/instrumentation/runtime.
These metrics are part of the near-zero-touch value proposition and require no consumer code.
Error Rate¶
The system does not define a separate stored metric called error_rate.
Error rate is derived from counters:
http_requests_total{status_code=~"5.."}magic_storage_operation_errors_totalmagic_pubsub_errors_total
This keeps the metrics model simple and matches standard Prometheus / OTEL practice.
Logger Correlation¶
Scope¶
v1 modifies the existing logger package to automatically inject trace_id and span_id into every log line that is produced with an active span in its context. No call-site changes are required beyond using the *Context variants of slog (slog.InfoContext, slog.ErrorContext, etc.) that the slog API already encourages.
Enrichment is always on once logger.Init runs. When no span is active, the handler performs a single SpanContext.IsValid() check and delegates unchanged. There is no coupling to observability.Init — if tracing is never enabled, no span ever lives in context, and the handler simply passes through.
Mechanism¶
logger.Init wraps the underlying slog.Handler with a trace-aware handler:
package logger
import (
"context"
"log/slog"
"go.opentelemetry.io/otel/trace"
)
type traceHandler struct{ slog.Handler }
func (h *traceHandler) Handle(ctx context.Context, r slog.Record) error {
if sc := trace.SpanContextFromContext(ctx); sc.IsValid() {
r.AddAttrs(
slog.String("trace_id", sc.TraceID().String()),
slog.String("span_id", sc.SpanID().String()),
)
}
return h.Handler.Handle(ctx, r)
}
// WithAttrs and WithGroup delegate to the inner handler, preserving slog semantics.
func (h *traceHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
return &traceHandler{Handler: h.Handler.WithAttrs(attrs)}
}
func (h *traceHandler) WithGroup(name string) slog.Handler {
return &traceHandler{Handler: h.Handler.WithGroup(name)}
}
logger.Init becomes:
func Init(config *Config) {
var handler slog.Handler
if config.JSON {
handler = slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: config.Level})
} else {
handler = slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{Level: config.Level})
}
handler = &traceHandler{Handler: handler}
slog.SetDefault(slog.New(handler))
}
Usage¶
No new API is required for the common path. Any caller that already uses slog.InfoContext(ctx, ...) / slog.ErrorContext(ctx, ...) automatically gets correlated logs when a span is active:
func getOrder(w http.ResponseWriter, r *http.Request) {
slog.InfoContext(r.Context(), "fetching order", "id", chi.URLParam(r, "id"))
// log line includes trace_id and span_id when the observability middleware has started a span
}
Non-Context Calls¶
slog.Info(...) without context passes context.Background() to the handler, so no span is found and no trace fields are emitted. This is expected and matches slog's design. The implication is:
- Code that wants correlated logs must use
*Contextvariants. - Pre-existing
slog.Info/slog.Errorcall sites keep working, they just won't carry trace IDs. - This is a gentle forcing function toward context-aware logging, which is already idiomatic in Go 1.21+.
Escape Hatch: LoggerFromContext¶
For call sites that cannot easily use *Context variants (for example, a non-slog logger passed through a third-party library), observability.LoggerFromContext returns an *slog.Logger pre-populated with trace fields:
If the context has no active span, the input logger is returned unchanged. This is a secondary helper; the primary mechanism is the auto-wrapping handler.
Field Names¶
trace_id— hex-encoded 16-byte IDspan_id— hex-encoded 8-byte ID
Names chosen to match OTEL logs semantic conventions so that correlation works with downstream log pipelines that already understand OTEL.
Non-Goals for v1¶
- OTEL logs export (deferred; this section wires trace IDs into the existing
slogstdout path only) - Sampling-decision propagation into log records (deferred)
Testing Support¶
NewTestObserver¶
In observability/obstest:
package obstest
// TestObserver is the in-memory harness returned by NewTestObserver.
// It wraps an in-process MemoryBackend for metrics and an OTEL SDK
// TracerProvider backed by tracetest.SpanRecorder for spans.
type TestObserver struct {
Telemetry *telemetry.Telemetry
Metrics *MemoryBackend
Spans *tracetest.SpanRecorder
}
func NewTestObserver(tb interface{ Cleanup(func()) }) *TestObserver
// Direct access
func (b *MemoryBackend) CounterValue(name string, labels ...telemetry.Label) float64
func (b *MemoryBackend) HistogramObservations(name string, labels ...telemetry.Label) []float64
func (b *MemoryBackend) HistogramCount(name string, labels ...telemetry.Label) int
func (b *MemoryBackend) HistogramSum(name string, labels ...telemetry.Label) float64
func (b *MemoryBackend) GaugeValue(name string, labels ...telemetry.Label) float64
func (b *MemoryBackend) UpDownValue(name string, labels ...telemetry.Label) float64
// Assertion helpers (Fatalf on mismatch; *testing.T or any TestingTB)
func (o *TestObserver) AssertCounter(tb TestingTB, name string, want float64, labels ...telemetry.Label)
func (o *TestObserver) AssertHistogramObserved(tb TestingTB, name string, labels ...telemetry.Label) int
func (o *TestObserver) AssertHistogramCount(tb TestingTB, name string, want int, labels ...telemetry.Label)
func (o *TestObserver) AssertGauge(tb TestingTB, name string, want float64, labels ...telemetry.Label)
func (o *TestObserver) AssertUpDownCounter(tb TestingTB, name string, want float64, labels ...telemetry.Label)
func (o *TestObserver) AssertSpan(tb TestingTB, name string) sdktrace.ReadOnlySpan
func (o *TestObserver) AssertNoSpan(tb TestingTB, name string)
Label arguments follow the observation API: variadic telemetry.Label, constructed inline or via telemetry.Labels("k","v", ...). The MemoryBackend canonicalizes label order so assertions match regardless of the order the instrument recorded them.
Example:
func TestCreateOrderEmitsBusinessMetric(t *testing.T) {
obs := obstest.NewTestObserver(t)
ordersCreated, err := obs.Telemetry.Metrics.Counter(telemetry.MetricDefinition{
Name: "orders_created_total",
Kind: telemetry.KindCounter,
Labels: []string{"channel"},
})
require.NoError(t, err)
ordersCreated.Add(1, telemetry.Labels("channel", "web")...)
obs.AssertCounter(t, "orders_created_total", 1,
telemetry.Label{Key: "channel", Value: "web"})
}
Test Observer Behavior¶
- Installs an in-memory
MetricsBackend(not Prometheus, not OTEL) that records every operation. - Installs an OTEL
TracerProviderbacked by the in-memory SDKtracetest.SpanRecorder. - Automatically registers cleanup via
t.Cleanupto restoretelemetry.Global()to the no-op default. - Safe to use in
t.Parallel()tests because eachTestObserverscopes its telemetry to a context viatelemetry.WithContext, and the in-repo instrumented adapters resolve telemetry from context first, then global. (This is the one place where context-scoped telemetry matters.)
Initialization Flow¶
Init(ctx, cfg) performs:
- validate config (reject empty ServiceName or invalid MetricsMode)
- initialize the tracer provider if tracing is enabled; set the global propagator
- initialize the metrics backend if metrics are enabled
- register runtime and process metrics if applicable
- register built-in HTTP, storage, and pubsub instruments
- install
telemetry.SetGlobal(...)sostorageandpubsubwrappers see the telemetry - return an initialized
Observer
Pseudo-flow:
func Init(ctx context.Context, cfg Config) (*Observer, error) {
if err := cfg.validate(); err != nil {
return nil, err
}
// buildTracerProvider resolves the sampler:
// if cfg.Sampler != nil: sdktrace.ParentBased(cfg.Sampler)
// else: sdktrace.ParentBased(sdktrace.TraceIDRatioBased(ratio))
tp, tpShutdown, err := buildTracerProvider(ctx, cfg)
if err != nil {
return nil, err
}
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(cfg.Propagator)
backend, bShutdown, err := buildMetricsBackend(ctx, cfg)
if err != nil {
return nil, err
}
if err := registerBuiltinMetrics(backend, cfg); err != nil {
return nil, err
}
if err := registerRuntimeMetrics(backend, cfg); err != nil {
return nil, err
}
t := telemetry.Telemetry{
Tracer: tp.Tracer("github.com/tink3rlabs/magic"),
Metrics: backend,
}
telemetry.SetGlobal(&t)
return &Observer{
telemetry: t,
shutdownFns: []func(context.Context) error{tpShutdown, bShutdown},
}, nil
}
Package Integration Strategy¶
Storage Integration¶
The storage package defines ContextualStorageAdapter and storage/telemetry.go. The instrumented wrapper is applied at StorageAdapterFactory.GetInstance time when telemetry.Global().Metrics != nil.
The wrapper resolves telemetry per-call via telemetry.FromContextOrGlobal(ctx) so that:
- test harnesses can scope telemetry to a context
- the wrapper picks up
telemetry.SetGlobaleven if the adapter was constructed beforeInitran
PubSub Integration¶
Same pattern: ContextualPublisher, pubsub/telemetry.go, wrapper applied at PublisherFactory.GetInstance time.
Behavior When Observability Is Not Enabled¶
If a service does not initialize observability:
telemetry.Global()returns the no-op implementationstorageandpubsubwrappers short-circuit to the underlying adapter without extra work- no tracing or metrics are emitted
- package behavior is byte-for-byte identical to today
This keeps observability opt-in and prevents any regression for existing consumers.
Example Consumer Usage¶
A runnable version of this setup (including local Prometheus and OTLP collector instructions) lives directly in examples/main.go and demonstrates storage + observability together.
Default Prometheus Mode¶
obs, err := observability.Init(ctx, observability.Config{
ServiceName: "orders-api",
ServiceVersion: "1.0.0",
Environment: "prod",
EnableTracing: true,
MetricsMode: observability.MetricsModePrometheus,
TracesOTLPEndpoint: "otel-collector:4317",
TracesOTLPInsecure: true,
})
if err != nil {
log.Fatal(err)
}
defer obs.Shutdown(ctx)
r := chi.NewRouter()
r.Use(middlewares.Observability(obs))
r.Get("/orders/{id}", getOrder)
r.Post("/orders", createOrder)
r.Handle("/metrics", obs.MetricsHandler())
OTEL-Only Mode for Tracing and Metrics¶
obs, err := observability.Init(ctx, observability.Config{
ServiceName: "orders-api",
ServiceVersion: "1.0.0",
Environment: "prod",
EnableTracing: true,
MetricsMode: observability.MetricsModeOTLP,
TracesOTLPEndpoint: "otel-collector:4317",
TracesOTLPInsecure: true,
})
if err != nil {
log.Fatal(err)
}
defer obs.Shutdown(ctx)
r := chi.NewRouter()
r.Use(middlewares.Observability(obs))
// /metrics is still safe to register; it serves a 404 in this mode.
r.Handle("/metrics", obs.MetricsHandler())
Using the Contextual Storage API¶
func getOrder(w http.ResponseWriter, r *http.Request) {
var order Order
err := storageAdapter.(storage.ContextualStorageAdapter).
GetContext(r.Context(), &order, map[string]any{"id": chi.URLParam(r, "id")})
if err != nil {
// span is already marked error; metrics already recorded
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// ...
}
Since every adapter shipped in the magic repo implements ContextualStorageAdapter starting in Phase 2, the type assertion always succeeds for in-repo adapters. For defensive code against third-party adapters, use the two-value assertion and fall back to the non-ctx method.
Recommended Built-In Metric Names¶
HTTP¶
http_requests_total— counterhttp_request_duration_seconds— histogram, buckets{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
Storage¶
magic_storage_operations_total— countermagic_storage_operation_duration_seconds— histogram, buckets{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}magic_storage_operation_errors_total— counter
PubSub¶
magic_pubsub_messages_total— countermagic_pubsub_publish_duration_seconds— histogram, buckets{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}(Prometheus default)magic_pubsub_errors_total— counter
Runtime / Process¶
Standard collectors for Go runtime and process metrics in scrape-based modes; go.opentelemetry.io/contrib/instrumentation/runtime in OTLP-only mode.
Implementation Plan¶
Phase 1: Neutral Telemetry Package + Core Bootstrap¶
- create
telemetrypackage with interfaces, no-op implementations, andGlobal/SetGlobal - create
observabilitypackage - add config, init, shutdown
- add metrics mode support (Prometheus, OTLP)
- add tracer provider initialization and propagator configuration
- implement
prometheusMetricsBackendandotelMetricsBackend - add chi middleware (with trailing-edge recording)
- add metrics handler (with 404 fallback for OTLP mode)
- register runtime and process metrics
Phase 2: Storage Instrumentation + Contextual Migration¶
- add
ContextualStorageAdapterinterface instorage/telemetry.go - migrate all in-repo adapters (
sql.go,dynamodb.go,cosmosdb.go,memory.go, and the pending Cassandra adapter) to implement both interfaces, with real logic in theContextvariants and non-ctx methods as one-line delegates - implement instrumented adapter wrapper with capability check and warn-once on legacy adapters
- wire wrapper into
StorageAdapterFactory.GetInstance - add built-in storage spans and metrics
- add unit tests against
NewTestObserver
Phase 3: PubSub Publish Instrumentation¶
- add
ContextualPublisherinterface inpubsub/telemetry.go - migrate
sns.goto implement both interfaces - implement instrumented publisher wrapper with capability check and SNS attribute-limit handling
- wire wrapper into
PublisherFactory.GetInstance - add built-in pubsub publish spans and metrics
- expose
pubsub.MessageAttributesParamKeyso callers and other in-repo publishers agree on the trace-context carrier key - add unit tests (span on success/error, propagator injection, user-key preservation, attribute-limit drop, legacy-publisher metrics-only)
Phase 4: Custom Metrics + Logger Correlation¶
- add
MetricDefinitionvalidation: empty/invalid name, empty/invalid/duplicate label keys, buckets-on-non-histograms, built-in name collisions,go_/process_reserved prefixes - add duplicate-registration normalization in both Prometheus and OTEL backends (same shape returns the cached instrument; mismatched shape returns an error)
- add
Counter/Histogram/Gauge/UpDownCounterimplementations across all backends, all safe for concurrent use - add
telemetry.Labels(kv ...string) []Labelhelper for ergonomic observation call sites - add the
traceHandlerwrap insidelogger.Initso every log line with a valid span context carriestrace_idandspan_id - add
observability.LoggerFromContext(ctx, l) *slog.Loggerescape hatch for call sites that cannot easily use*Contextvariants - document custom metrics usage patterns, reserved names, and registration-vs-observation label ordering semantics
Phase 5: Testing Harness + Hardening¶
- implement
observability/obstest.NewTestObserverwith an in-memoryMetricsBackendand an OTEL SDKTracerProviderbacked bytracetest.SpanRecorder - add assertion helpers:
AssertCounter,AssertHistogramObserved,AssertHistogramCount,AssertGauge,AssertUpDownCounter,AssertSpan,AssertNoSpan - add tests across both metrics modes (
init_test.gofor Prometheus,init_otlp_test.gofor OTLP with a lazily-dialed endpoint) - add benchmarks for chi middleware (baseline, no-tracing, with-tracing, parallel) and metric record hot paths (counter Add 0/3 labels, histogram Observe, no-op counter)
- document the obstest API shape, benchmark numbers, cardinality rules, and migration examples
See Performance below for the benchmark numbers and how to interpret the middleware overhead vs. the chi-baseline subtraction.
Testing Strategy¶
Bootstrap¶
- defaults are applied correctly
- invalid config is rejected
- tracing-only, metrics-only, and both modes initialize correctly
SamplingRatiopointer semantics are respected (nil vs 0 vs > 0)- when
Sampleris set,SamplingRatiois ignored and the sampler is wrapped inParentBased MetricsMode = OTLPproduces a non-nil 404-servingMetricsHandler
HTTP Middleware¶
- chi route pattern is used instead of raw path
- unmatched routes emit
route="unmatched" - status code is captured correctly
- spans are created when tracing is enabled
- panics in downstream handlers are re-raised with error-tagged span and 500-labeled metrics
Storage Instrumentation Tests¶
- all supported operations emit spans when called through the
Contextvariants - duration and error metrics are recorded for both
Contextand non-Contextcalls - legacy (non-contextual) adapter emits metrics only, no spans, and logs a warn-once
- provider, operation, and status labels are applied correctly
PubSub Instrumentation Tests¶
PublishContextinjects the configured propagator's fields- SNS attribute-limit handling drops
baggagefirst, thentracestate, nevertraceparent - publish metrics are recorded with correct provider/destination/status
- legacy publisher gets metrics-only treatment
Custom Metrics Tests¶
- registration succeeds for valid definitions
- duplicate compatible registrations return the same instrument
- conflicting registrations (kind, labels, or buckets) fail
- undeclared labels fail when
AllowUndeclaredLabelsis false - namespace is applied only to custom metrics
Metrics Modes¶
- Prometheus mode exposes scrape endpoint with built-in and custom metrics
- OTLP mode pushes to the configured collector and
MetricsHandler()serves 404
Logger Correlation Tests¶
slog.InfoContext(ctx, ...)with an active span produces log lines containingtrace_idandspan_idslog.InfoContext(ctx, ...)with no active span produces log lines without those fieldsslog.Info(...)(no-context variant) produces log lines without those fields even when a span is active in a surrounding goroutineWithAttrsandWithGroupon the wrapped handler preserve trace injection after chaining- JSON and text handlers both emit the fields correctly
observability.LoggerFromContextreturns a logger with the same fields pre-populated for non-*Contextcall sites
Performance¶
Benchmarks live in middlewares/observability_bench_test.go and observability/metrics_bench_test.go. The numbers below were captured on linux/arm64 in a devcontainer; treat them as an order-of-magnitude reference rather than an SLA. Run go test -run '^$' -bench=. -benchtime=500ms ./middlewares ./observability/ to reproduce on your hardware.
Metric record hot path¶
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
BenchmarkNoopCounterAdd |
~16 | 32 | 1 |
BenchmarkPrometheusCounterAddNoLabels |
~25 | 0 | 0 |
BenchmarkPrometheusCounterAddThreeLabels |
~150 | 144 | 2 |
BenchmarkPrometheusHistogramObserveThreeLabels |
~140 | 144 | 2 |
Interpretation:
- Callers that never call
observability.Initpay ~16 ns per observation. That one allocation is the variadic...Labelslice the Go compiler places on the heap; eliminate it by callingAdd(1)with no labels where possible. - The two allocations on the labeled Prometheus paths come from
prometheus.CounterVec.WithLabelValuesbuilding a fresh[]stringlabel-values slice. They are the dominant cost above the lock-free atomic increment. - If a metric is on a truly hot path (> 1M ops/sec per core), pre-allocate its
Counter/Histogramonce at registration time (already what the built-in instrumentation does) and batch-record rather than observing per item.
HTTP middleware¶
| Benchmark | ns/op | B/op | allocs/op |
|---|---|---|---|
BenchmarkChiRouterBaseline |
~680 | 1384 | 12 |
BenchmarkChiMiddlewareNoTracing |
~2250 | 2953 | 36 |
BenchmarkChiMiddlewareWithTracing |
~3130 | 4620 | 41 |
BenchmarkChiMiddlewareParallel |
~1050 | 2952 | 36 |
The Chi* benchmarks include the full chi.Router.ServeHTTP dispatch and a httptest.NewRecorder() allocation. Subtract BenchmarkChiRouterBaseline (~680 ns) to get the pure middleware overhead:
- Without tracing: ~1.5 µs / ~1.5 KB / ~24 allocs per request. Most allocs are the response-writer wrapper, the three label-value slices (method/route/status), and the pre/post
requestsInFlightprojections. - With tracing: ~2.5 µs / ~3.2 KB / ~29 allocs per request. The extra cost is span start/end plus
WithAttributesfor method, route, and status.
The design targets were < 1 µs without tracing and < 5 µs with tracing on a typical dev laptop. Tracing is comfortably under target; the no-tracing path currently sits just above 1 µs on arm64 and is expected to be under target on x86_64 laptops. Shrinking the allocation count further (e.g. pooling the response-writer wrapper, collapsing in-flight projections into the request span) is an open optimization tracked in Future Enhancements.
Parallel scaling¶
BenchmarkChiMiddlewareParallel is faster per op than the serial variant because the per-goroutine cost is dominated by chi's router mutex and the Prometheus vector RLocks, both of which are well-contended-but-not-serialized. A sharp regression here (e.g. parallel op-cost rising above the serial number) would indicate accidental contention, typically from taking a write lock in the observation path.
Risks and Mitigations¶
Risk: High cardinality metrics¶
Mitigation:
- chi route patterns, not raw paths
- declared label keys enforced by default
- no tenant IDs, user IDs, resource IDs, or message IDs
Risk: Import cycle / heavy dependency footprint¶
Mitigation:
- neutral
telemetrypackage hosts interfaces only storageandpubsubimporttelemetryonly (no Prometheus or OTEL deps)- consumers who never call
observability.Initnever pay the Prometheus/OTEL cost
Risk: Adapters not migrated to ContextualStorageAdapter¶
Mitigation:
- all in-repo adapters are migrated in Phase 2
- legacy third-party adapters fall back to metrics-only with a warn-once
- documentation and the warn-once message link to the migration recipe
Risk: Hidden behavior after Init¶
Mitigation:
- observability is explicitly initialized by the service
- package instrumentation activates only after
telemetry.SetGlobal - no-op behavior before
Init, afterShutdown, and in tests that omitInit
Risk: Breaking existing consumers¶
Mitigation:
- existing
StorageAdapterandPublisherinterfaces are unchanged ContextualStorageAdapterandContextualPublisherare extension interfaces- default behavior without
Initis byte-for-byte identical to today
Risk: SNS message-attribute limit¶
Mitigation:
- propagator fields merged without overwriting user keys
- drop order defined (
baggage, thentracestate, nevertraceparent) - warn-once when truncation occurs
Risk: Middleware ordering and panic handling¶
Mitigation:
- metrics and span finalization happen in a deferred block after
next.ServeHTTP, so chi's route pattern is populated - middleware does not
recover(); panics are recorded, re-raised, and safe to handle with any upstream recover middleware
Future Enhancements¶
Potential additions after v1, in rough priority order:
- PubSub consumer instrumentation. Once a
Consumer/Subscriberinterface lands in thepubsubpackage, addpubsub.consume/pubsub.process/pubsub.ackspans, context extraction from inbound messages, and corresponding metrics. - Health-check instrumentation. Spans and metrics for
health.Checkcalls. - OTEL logs integration. Once OTEL Go logs SDK is stable for the exporters we need.
- Exemplars. Link metrics to traces for backends that support it (Prometheus exemplars, OTEL exemplars over OTLP).
- Route-level metric suppression helpers. E.g.
obs.SuppressRoute("/internal/*")at registration time rather than per-config. - Scoped
Meter("orders")helpers for optional metric prefix scoping. - ARN/topic normalization hook for pubsub
destinationlabel cardinality control.
Explicitly Deferred Metrics Export Modes¶
Two modes were considered for v1 and deliberately cut:
- OTEL Prometheus exporter mode (OTEL SDK emitting to
/metrics). The same operational outcome is achievable by running the OTEL Collector with a Prometheus receiver in front of services configured inMetricsModeOTLP. Adding it in-process was not worth the second code path, testing surface, and subtle output-format differences from nativeclient_golang. - Dual export mode (simultaneous Prometheus + OTLP from the same process). Migration is better handled at the Collector: scrape the existing
/metricswith a Prometheus receiver, emit OTLP downstream, then flip the library toMetricsModeOTLPonce dashboards consume the Collector's output. In-process dual export's double-counting risk and per-record overhead were not worth supporting.
Both modes can be added later as new MetricsMode values without breaking existing consumers.
Summary¶
This design introduces a first-class observability layer for magic that provides:
- OTEL-based distributed tracing with W3C context propagation
- two metrics export modes: Prometheus (scrape) and OTLP (push)
- automatic instrumentation for HTTP, storage (via
ContextualStorageAdapter), and pubsub publish (viaContextualPublisher) - near-zero-touch infrastructure telemetry without breaking existing consumers
- first-class custom business metrics with safe-by-default label validation
- a backend-neutral metrics API that works across Prometheus or OTEL
- automatic trace/span correlation in application logs via the
loggerpackage'ssloghandler wrap - an in-memory testing harness for unit tests
The result is a complete observability foundation for services built on magic:
- easy to adopt
- safe by default
- non-breaking for current consumers
- powerful without requiring every service to reinvent instrumentation
- flexible enough to support both Prometheus-first and OTEL-first teams