Observability for agents: what to log (and what not to)

Agent debugging has a reputation for feeling mystical. One run succeeds, the next fails, and the team debates whether the problem was prompt phrasing, model drift, tool latency, or user ambiguity. In reality, many of these debates persist only because the system does not emit the right evidence. With ordinary web software, tracing requests across services is already table stakes. Agent products need the same discipline, plus a layer that explains how intent, tool choice, and final user-visible outcomes relate to each other.

The challenge is not simply to log more. Over-collection creates its own failure mode: massive volumes of raw conversation content, sensitive data scattered across systems, and operators who still cannot answer the simplest postmortem question. A good observability strategy captures the narrative of a run while keeping secrets out of the data plane wherever possible. That means choosing structured events carefully, redacting early, and resisting the temptation to store every raw artifact forever.

The purpose of agent observability is not to remember everything. It is to reconstruct enough of the execution story to debug, measure, and improve the system responsibly.

Log the execution story, not just the API calls

Traditional service logs tend to emphasize infrastructure events: request received, dependency called, response returned. Those events still matter, but agents add a semantic layer. You need to know what user request kicked off the run, what intent the system appeared to resolve, which tool it chose, what validated arguments were sent, what the tool returned, and how the final answer was framed. Logging only the transport layer leaves a gap precisely where most debugging questions live.

A practical event model often includes conversation_turn_started, model_output_received, tool_call_requested, tool_call_completed, user_response_rendered, and run_closed. Each event should carry a stable run identifier and enough metadata to connect the dots later. When a user says, “the agent told me it updated the ticket but nothing changed,” your logs should let you determine whether the tool was never called, called with the wrong identifier, rejected, retried, or succeeded with a stale response path in the UI.

Core fields worth standardizing

Run and conversation identifiers that span model and tool events.
Normalized tool names rather than ad hoc strings from each integration.
Latency and status for every tool execution, including cancellation and timeout states.
Compact result summaries that make dashboards readable without exposing raw payloads.

Redact at ingestion, not at query time

Teams often say they will store raw content and rely on downstream query filters to protect secrets. That pattern does not hold up well in practice. By the time data reaches your warehouse, metrics pipeline, or third-party observability tool, it may already have spread beyond the boundaries where you intended to keep it. A better rule is to apply redaction or summarization as close to event creation as possible. Treat raw secrets as things to avoid collecting, not things to clean up later.

This does not mean logs must become useless. You can keep hashed identifiers for correlation, preserve shape information, store field presence, and capture deterministic summaries. For example, it may be enough to know that an email send targeted one external recipient with an approved draft ID, or that a CRM update touched a particular account record, without keeping the full freeform body text in your primary logs.

event=tool_call_completed
tool=send_email
recipient_count=1
external_recipient=true
approved_draft_id=dr_98a
status=success

Use observability to drive product metrics

Observability is not only for incident response. The same event model can inform product quality metrics if designed well. Once you can trace a run end to end, you can ask better questions: which workflows have the highest retry rates, which tools drive the most abandoned sessions, which approval prompts convert poorly, which categories of tool errors correlate with negative feedback, and where does latency spike enough to change user behavior? These questions matter because agent quality is experienced through workflow outcomes, not isolated model outputs.

There is also a strategic benefit here. Teams that separate operational logging from product analytics often discover they instrumented neither sufficiently. If you choose a compact but expressive schema early, you can power dashboards, postmortems, and evaluation loops from the same foundation. That reduces duplication and gives everyone a shared language for discussing reliability.

Know what not to log

Restraint is a feature of good observability. Not every prompt, tool response, attachment, or intermediate chain-of-thought artifact belongs in a durable log. Storing unnecessary raw material increases cost, risk, and confusion. More importantly, it can create organizational habits where people debug by spelunking sensitive content rather than by improving structured signals. That is the observability equivalent of relying on production access because your abstractions failed you.

A useful litmus test is whether a field is necessary to answer a foreseeable operational or product question. If not, it probably does not need to be durable. Keep the narrative, keep the identifiers, keep the outcomes. Drop the raw artifacts unless you have a clear and well-governed reason to retain them. In agent systems, the easiest sensitive data leak is the one you logged because it seemed helpful at the time.

You do not earn observability maturity by collecting the most data. You earn it by collecting the most useful data with the least avoidable exposure.

The best logs make ambiguity smaller

Teams adopt observability because they want certainty. In practice, the real win is smaller ambiguity. A good trace will not answer every question automatically, but it will narrow the space of possible explanations fast enough that humans can act. It tells you whether the problem is in intent resolution, validation, tool execution, latency, UI messaging, or policy. That clarity is what makes the system operable at scale.

If your current logs leave you arguing about what probably happened, that is a design problem, not an unavoidable property of AI. The right schema will not remove complexity, but it will make complexity visible. That is the foundation for every serious reliability program.