Skip to content

Service Catalog

The service catalog is YipYap’s internal-facing map of your system: a directory of services, each with its owners, monitors, runbooks, dashboards, and dependencies. When an alert fires for a monitor that’s been linked to a service, the alert surface picks up the service’s context (runbooks, links, owner team) automatically.

This is the difference between “Public API monitor is down” and “Public API (tier-1, owned by platform-oncall, runbook: rollback the latest deploy, depends on user-db and feature-flags).” On-call quality is the cumulative product of small contextual nudges; the catalog is the systemic version.

Service catalog overview

A service has:

  • Name: short, system-recognisable. (“checkout-api”, not “the checkout API service we built last quarter”.)
  • Description: what the service does, in one or two sentences. Plain text, used in alert context.
  • Tier: tier-0 (mission-critical), tier-1 (production-customer-facing), tier-2 (internal-but-important), tier-3 (experimental). Drives default alert severity and affects status-page prominence (for teams that publish service tiers).
  • Owner team: references a Team. Used as a fallback escalation target when the monitor’s escalation policy hits a “team” step that doesn’t otherwise resolve.
  • Linked monitors: the monitors whose health represents this service’s health. A service can have multiple monitors (HTTP + heartbeat + dependency-probe is typical for a real service).
  • Links: runbooks, dashboards, repository, design docs. Each is (label, url, kind). Surfaced both in the catalog UI AND on alert detail pages.
  • Dependencies: directed edges to other services. Used for blast-radius visualisation and for alert-context enrichment (“this alert affects 4 downstream services”).
  • Labels (key/value): free-form tagging used for filtering and grouping.

To set expectations correctly:

  • Not a public-facing surface. Services don’t render on status pages. The catalog is a console-internal aid for operators. (Status pages display monitors, organised into operator-defined groups; they’re a curated public face, not a mirror of the service graph.)
  • Not a monitoring substitute. The catalog doesn’t run health checks of its own; it derives a status badge from the linked monitors. If you don’t have monitors on a service, its catalog status will be “Unknown”.
  • Not a service mesh. “Dependencies” are documentation, not runtime. Declaring checkout depends on user-db doesn’t intercept network traffic or enforce policy. It enriches alerts and visualises blast radius; it doesn’t reroute requests.
  • Not a CMDB. It’s intentionally minimal: the goal is “context for on-call,” not “complete inventory of every running process.” Add what helps; resist the urge to add what doesn’t.

A useful rule of thumb: if an operator paged at 03:00 would benefit from knowing it exists, add it; otherwise, skip.

Cataloguing every Lambda function and Kafka topic is a path to a directory that no one maintains. Cataloguing your top 10-20 customer-impact-relevant services with sharp names, real owners, and current runbooks pays back the first time an alert fires.

  1. Features → Service Catalog → New Service.
  2. Name + Description + Tier (default tier-2).
  3. Owner Team: pick from existing teams. (Settings → Teams to create one.)
  4. Linked Monitors: multi-select. The monitor’s current status drives the service’s status; the latest alert on each monitor surfaces in the service detail page.
  5. Links: add runbook URLs, repository URLs, dashboard URLs. Each link picks a kind: runbook, dashboard, repo, docs, other. The kind drives which icon renders.
  6. Dependencies: declare what this service depends on. The dependency graph enforces a DAG (no cycles); the UI surfaces a blast-radius preview when you save.
  7. Save.

Once saved, every alert from a linked monitor will carry the service’s name, tier, owner team, and the FIRST runbook link in its surface; the on-call sees this on every page.

Tiers exist to make “this is mission-critical, that is experimental” legible at a glance. The mapping:

TierDefault downtime severityStatus page prominenceDefault escalation behaviour
tier-0CriticalProminentPage primary on-call immediately, fallback to manager team within 5 min.
tier-1MajorProminentPage primary on-call, fallback within 10 min.
tier-2MinorStandardPage primary on-call, fallback within 30 min.
tier-3InfoCollapsed by defaultBest-effort during business hours.

The escalation behaviour is a default; your actual escalation policy on the linked monitor wins. The tier just sets a sensible starting point for new policies and a presentation hint for the catalog UI.

A runbook link attached to a service shows up on every alert from a linked monitor. The convention is:

  • Title is short and actionable: “Rollback the latest checkout deploy”, not “Checkout incident response procedures v3 (Q2 2025)”.
  • URL points to wherever the runbook lives: Notion, Confluence, an internal wiki, GitHub markdown, doesn’t matter.
  • Kind: runbook for actual playbooks; docs for background reading; repo for source.

The first runbook-kind link on a service is highlighted in alert notifications; additional runbooks are listed below. If you have one runbook for a service, that’s plenty; if you have ten, the on-call won’t read any of them. Split them by symptom and link the most-likely-relevant one.

Declared dependencies form a directed acyclic graph. The catalog UI shows:

  • Upstream: services this service depends on. If checkout is down, look at user-db and feature-flags first.
  • Downstream: services that depend on this one. If user-db is down, expect alerts from checkout, login, profile.
checkout-apitier-1 · platformuser-dbtier-0feature-flagstier-2 depends ondepends on

Cycles are rejected at create time: if you try to declare a circular dependency, the API returns 422.

Dependencies are documentation, not runtime; see What it’s not above. They don’t intercept traffic or enforce policy.

When declaring a dependency:

RelationshipMeaning
requiredOutage of the target degrades this service. Surfaced as “Critical dependency” in alert context.
optionalUseful but not required; outage causes degradation only.
informationalNo health implication; just documenting the edge.

The relationship type drives blast-radius visualisation (required edges propagate severity; informational ones don’t).

Free-form string → string map. Common patterns:

  • lang: go, lang: rust, lang: typescript: language ownership.
  • runtime: k8s, runtime: lambda, runtime: vm: deploy substrate.
  • region: us-east-1, region: eu-west-1: locality.
  • pii: high / pii: low: compliance scoping.

Labels can be filtered in the catalog list view and queried via the Services API. They’re metadata; they don’t drive behaviour by themselves, but they let you build queries that do.

When an alert fires for a monitor linked to a service, the alert surface (web dashboard, Slack, email, the reply audit) carries:

  • Service name, tier, owner team
  • The service’s first runbook-kind link (highlighted)
  • A direct link to the service detail page in the console
  • Names of services that depend on this one (downstream blast radius preview)

This is automatic; there’s no per-monitor checkbox to opt in. Linking the monitor to the service is the opt-in.

Programmatic management lives behind the Services API, useful for terraform-style infra-as-code or for keeping the catalog in sync with your deployment pipeline.

  • Start with your tier-0 and tier-1 services. Coverage of the long tail can come later. Five well-curated services beat fifty stale ones.
  • Owner team is mandatory in spirit, even though optional in the API. A service without an owner is a service no one fixes.
  • Keep runbooks short and current. A runbook from 2024 that no one’s verified in 2026 is worse than no runbook; it’ll send the on-call down a dead path.
  • Don’t catalog what’s already in your monitors’ descriptions. If a monitor’s description field already says “Public API: customer-facing checkout endpoint”, duplicating it on the linked service adds maintenance, not value.
  • Use dependencies sparingly. A graph with every plausible edge is unreadable. A graph with the 5-10 most important edges is the one people will actually read.