Stop firefighting: practical observability for Python APIs, workers & jobs
2026-07-16 , Conference Hall Complex (S4)

Production has a special talent for turning “seems fine” into “why is everything on fire?” — usually because we’re missing signals. A service restarts and never becomes ready, a background worker silently stops consuming tasks, or a database gets overloaded and latency creeps up until downstream services (or customers) notice it first. These situations aren’t unsolvable — they’re preventable with the right signals in place.

I’ve seen how stressful this gets when a system is already in production, but there’s no clear guidance or shared “where to look first” playbook; so every incident starts with guesswork. Over time, we turned those lessons into a lightweight standard that replaces panic mode with a predictable investigation flow.

In this talk, I’ll share a practical, vendor-agnostic observability checklist for a Python setup with three cooperating workloads: an HTTP API, an event-driven worker, and a scheduled daily job. Each workload fails differently, so each requires a different set of signals to stay observable.

We’ll cover what “good enough” looks like for logging, metrics, tracing, and alerting: what to instrument first, what pitfalls to avoid, and how to design alerts that catch problems early without creating noise. You’ll leave with a concrete checklist and a phased rollout order you can apply to your own Python services — without rewriting your system or committing to a specific monitoring vendor.

Takeaways

  • A baseline observability checklist every service should have: health/readiness, logging with consistent context, core metrics, and alert routing
  • Workload-specific signals: what to watch in APIs vs background workers vs scheduled jobs, and why one size doesn’t fit all
  • Structured logging that works in production: a minimal event schema + contextual fields that speed up debugging
  • Must-have alerts that prevent silent failures: service never becomes ready, worker stalls, scheduled job misses its run, sustained latency increase
  • Where tracing adds value: when it’s worth the effort, what “minimal viable tracing” looks like, and what’s optional at the beginning
  • A rollout sequence you can apply incrementally: what to do first, what to add later

Expected audience expertise: Intermediate

Python backend engineer with a SDET background. I love systems that are both robust and boring to run, and I’m a fan of pragmatic engineering: clear APIs, solid testing, and production-friendly observability