Observability
Inference Gateway ships first-class observability for production deployments: a Prometheus-compatible metrics endpoint with named gateway and tool-call metrics, OpenTelemetry-based instrumentation, structured JSON logs, and reference Grafana / Loki / Jaeger stacks you can copy directly into your cluster.
This page covers:
- Exported Prometheus metrics (names, labels, units)
- Enabling telemetry and scraping with Prometheus / kube-prometheus-stack
- A reference Grafana dashboard
- Distributed tracing with OpenTelemetry Collector and Jaeger
- Log aggregation with Loki + Promtail
- The official Kubernetes monitoring stack examples
Enabling Telemetry
Telemetry is opt-in. Set the following variables to expose metrics on a dedicated port:
TELEMETRY_ENABLE=true
TELEMETRY_METRICS_PORT=9464 # defaultWhen enabled, the gateway exposes a Prometheus exposition endpoint at http://<gateway>:9464/metrics. Metrics are exported by the OpenTelemetry SDK via the Prometheus exporter, with service attributes (service.name, service.version, deployment.environment) attached from the gateway's runtime.
The metrics port is separate from the main API port. Do not expose 9464 to the public internet; scope it to your cluster network or behind your monitoring ingress.
Exported Metrics
The following metrics are exported when TELEMETRY_ENABLE=true. All metric values are observed in milliseconds for histograms; counters are monotonic.
Token usage
| Metric | Type | Labels | Description |
|---|---|---|---|
llm_usage_prompt_tokens | Counter | provider, model | Number of prompt tokens consumed per provider/model. |
llm_usage_completion_tokens | Counter | provider, model | Number of completion tokens generated per provider/model. |
llm_usage_total_tokens | Counter | provider, model | Total tokens (prompt + completion) per provider/model. |
Requests and responses
| Metric | Type | Labels | Description |
|---|---|---|---|
llm_requests_total | Counter | provider, request_type | Total number of requests processed by the gateway. |
llm_responses_total | Counter | provider, request_method, request_path, status_code | Total responses bucketed by HTTP status code. |
llm_request_duration | Histogram (ms) | provider, request_method, request_path | End-to-end request duration in milliseconds. |
Histogram bucket boundaries (ms): 1, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000.
Tool / function calls
| Metric | Type | Labels | Description |
|---|---|---|---|
llm_tool_calls_total | Counter | provider, model, tool_type, tool_name | Total number of tool/function calls executed. |
llm_tool_calls_success_total | Counter | provider, model, tool_type, tool_name | Successful tool/function calls. |
llm_tool_calls_failure_total | Counter | provider, model, tool_type, tool_name, error_type | Failed tool/function calls bucketed by error reason. |
llm_tool_call_duration | Histogram (ms) | provider, model, tool_type, tool_name | Per-call duration of tool/function invocation. |
tool_type distinguishes MCP-backed tools from A2A tools and inline function calls (mcp, a2a, function). tool_name is the fully qualified tool identifier as registered with the gateway.
Example queries (PromQL)
# Requests per second by provider
sum by (provider) (rate(llm_requests_total[5m]))
# p95 end-to-end request latency
histogram_quantile(0.95, sum by (le, provider) (rate(llm_request_duration_bucket[5m])))
# Error rate (5xx) per provider
sum by (provider) (rate(llm_responses_total{status_code=~"5.."}[5m]))
/ sum by (provider) (rate(llm_responses_total[5m]))
# Tool-call failure ratio per tool
sum by (tool_name) (rate(llm_tool_calls_failure_total[5m]))
/ sum by (tool_name) (rate(llm_tool_calls_total[5m]))
# Tokens per minute (prompt + completion) per model
sum by (model) (rate(llm_usage_total_tokens[1m])) * 60Prometheus Scrape Configuration
Plain Prometheus
scrape_configs:
- job_name: inference-gateway
metrics_path: /metrics
static_configs:
- targets: ['inference-gateway:9464']
labels:
service: inference-gatewayKubernetes (kube-prometheus-stack)
Expose the metrics port on the gateway Service:
apiVersion: v1
kind: Service
metadata:
name: inference-gateway
namespace: inference-gateway
labels:
app: inference-gateway
spec:
selector:
app: inference-gateway
ports:
- name: http
port: 8080
targetPort: 8080
- name: metrics
port: 9464
targetPort: 9464Then have the Prometheus Operator pick it up via a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: inference-gateway
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: inference-gateway
namespaceSelector:
matchNames:
- inference-gateway
endpoints:
- port: metrics
path: /metrics
interval: 15s
scrapeTimeout: 10sMake sure the release: label matches your kube-prometheus-stack release name so the Prometheus Operator selects the ServiceMonitor.
Recommended alerting rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: inference-gateway
namespace: monitoring
spec:
groups:
- name: inference-gateway
rules:
- alert: InferenceGatewayHighErrorRate
expr: |
sum by (provider) (rate(llm_responses_total{status_code=~"5.."}[5m]))
/ sum by (provider) (rate(llm_responses_total[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: 'Inference Gateway 5xx rate above 5% for {{ $labels.provider }}'
- alert: InferenceGatewayP95LatencyHigh
expr: |
histogram_quantile(0.95,
sum by (le, provider) (rate(llm_request_duration_bucket[5m]))
) > 5000
for: 10m
labels:
severity: warning
annotations:
summary: 'p95 request duration > 5s for {{ $labels.provider }}'
- alert: InferenceGatewayToolCallFailures
expr: |
sum by (tool_name) (rate(llm_tool_calls_failure_total[10m]))
/ sum by (tool_name) (rate(llm_tool_calls_total[10m])) > 0.1
for: 15m
labels:
severity: warning
annotations:
summary: 'Tool {{ $labels.tool_name }} failing > 10% of calls'Grafana Dashboard
A reference Grafana dashboard for these metrics is maintained in the main gateway repo:
- Docker Compose example:
examples/docker-compose/monitoring/grafana/dashboards/inference-gateway.json - Kubernetes example with provisioned dashboards:
examples/kubernetes/monitoring
Import it into Grafana via Dashboards -> New -> Import, paste the JSON, and bind it to your Prometheus data source.
A minimal hand-rolled panel set you can build in Grafana:
| Panel | Query |
|---|---|
| Requests/sec by provider | sum by (provider) (rate(llm_requests_total[1m])) |
| Error rate by provider | sum by (provider) (rate(llm_responses_total{status_code=~"5.."}[5m])) / sum by (provider) (rate(llm_responses_total[5m])) |
| p50 / p95 / p99 latency | histogram_quantile(0.95, sum by (le, provider) (rate(llm_request_duration_bucket[5m]))) |
| Tokens/min by model | sum by (model) (rate(llm_usage_total_tokens[1m])) * 60 |
| Tool-call success rate | sum by (tool_name) (rate(llm_tool_calls_success_total[5m])) / sum by (tool_name) (rate(llm_tool_calls_total[5m])) |
Distributed Tracing
The gateway initializes the global OpenTelemetry tracer provider when TELEMETRY_ENABLE=true. Spans are emitted for inbound requests and downstream provider calls, with service.name, service.version, and deployment.environment populated as resource attributes.
Use the standard OpenTelemetry environment variables to export to a collector via OTLP:
TELEMETRY_ENABLE=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_SERVICE_NAME=inference-gateway
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=productionOpenTelemetry Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1024
timeout: 5s
memory_limiter:
check_interval: 2s
limit_mib: 512
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger]Jaeger backend (Kubernetes)
Install Jaeger via Helm:
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger \
--namespace observability --create-namespace \
--set allInOne.enabled=true \
--set provisionDataStore.cassandra=false \
--set storage.type=memoryThen point the gateway's OTEL_EXPORTER_OTLP_ENDPOINT at the collector service exposed by your Jaeger or OTel Collector deployment (port 4317 for gRPC or 4318 for HTTP). Traces will be visible in the Jaeger UI under the inference-gateway service.
Log Aggregation
The gateway emits structured JSON logs on stdout. Each entry includes a timestamp, level, message, and contextual fields such as provider, model, request_id, and status_code. Set log verbosity with:
ENVIRONMENT=production # info-level logs
# or
ENVIRONMENT=development # debug logs (verbose)Loki + Promtail (Kubernetes)
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace observability --create-namespace \
--set grafana.enabled=false \
--set promtail.enabled=truePromtail will discover the gateway pods automatically. To filter to gateway logs in Grafana / LogQL:
{namespace="inference-gateway", app="inference-gateway"} | jsonUseful queries:
# Error logs only
{app="inference-gateway"} | json | level="error"
# 5xx responses with their request_id
{app="inference-gateway"} | json | status_code >= 500
# All MCP-related logs
{app="inference-gateway"} | json | logger=~"mcp.*"
# Trace a single request across providers
{app="inference-gateway"} | json | request_id="<uuid>"If you also export traces to Jaeger, the request_id field doubles as a correlation key between logs and traces.
Reference Monitoring Stack
For a runnable end-to-end setup (Prometheus + Grafana + Loki + dashboards already wired), see the official examples:
- Docker Compose:
examples/docker-compose/monitoring - Kubernetes (kube-prometheus-stack):
examples/kubernetes/monitoring
Both bring up the gateway with TELEMETRY_ENABLE=true, scrape /metrics, provision the reference Grafana dashboards, and (in the Kubernetes example) install a ServiceMonitor for kube-prometheus-stack.
Troubleshooting Observability
- No data in
/metrics- confirmTELEMETRY_ENABLE=trueand that you are scraping the telemetry port (TELEMETRY_METRICS_PORT, default9464), not the main API port. - ServiceMonitor not picked up - the
release:label on theServiceMonitormust match your kube-prometheus-stack Helm release name (oftenkube-prometheus-stack). Check the Prometheus Operator'sserviceMonitorSelector. - Traces missing - verify
OTEL_EXPORTER_OTLP_ENDPOINTresolves from inside the gateway pod and that the collector is listening on the expected protocol (grpcon 4317 vshttpon 4318). - Logs not appearing in Loki - check that Promtail is running as a DaemonSet on the gateway's node and that its scrape config matches the gateway's namespace/labels.
For broader operational issues (auth, MCP, vision, provider 4xx debugging), see Troubleshooting.
