Agentic System Governance: What the Frameworks Don't Tell You

This deep dive covers how we approach compliance, as an artefact package — a structured, queryable, tamper-evident record of every AI action.


When stakeholders ask what the AI system did on a specific date, what is the answer?

If your stakeholders asks you to demonstrate that your AI system behaved correctly on a specific transaction on a specific date — what do you hand them?

Some talk about logs. Some mention dashboards. Often times, nobody has a precise answer.

At QualitaX, our answer is: a compliance artefact package. A structured, queryable, tamper-evident record of every input the system received, every call made to the AI model, every output produced, and every decision logged — retrieved from a relational database in under five minutes and exportable to PDF.

That is what governance means in practice. Not a policy document. Not a responsible AI framework pinned to a website. A retrievable record of system behaviour that withstands stakeholder (and even regulatory) scrutiny.

This article covers our approach in producing that record: the four governance questions every stakeholder should ask, the three-layer audit trail architecture that answers them, how model safety refusals are handled as structured auditable outcomes rather than errors, and how model versioning is treated as a model risk management event.

The Gap Between Compliance Requirements and Operational Governance

When a stakeholder asks "show me what your AI system did on the 15th of March" - none of the below provides an appropriate answer:

  • An AI Ethics Policy PDF on the intranet
  • A responsible AI framework with principles like "fairness" and "transparency"
  • A governance committee that meets quarterly
  • Model cards for the AI systems in use
  • An AI risk register that is not regularly updated

Operational governance requires that:

  • Every AI system action is recorded at the time it occurs
  • Records are structured, queryable, and tamper-evident
  • Every AI model call captures exact inputs and outputs
  • Failure modes are defined and handled — including AI refusals
  • Model versions are pinned and logged with every result
  • A compliance artefact can be produced for any event in under five minutes

The distinction is between governance as documentation and governance as system behaviour. One is a paper exercise. The other is an architectural decision.

This distinction is most acute in regulated industries because the regulatory scrutiny is most intense and the consequences of non-compliance are most severe. But the same principle applies to businesses in any industry with any AI system that influences a consequential decision. The question is always the same: can you reconstruct what happened, why it happened, and what the system knew at the time?

AI systems must be designed with operational governance as an architectural requirement from day one — not a layer added later. Because you cannot or it is extremely challenging to retrofit audit capability into a running system without rebuilding the parts that generate auditable events. The data must be captured at the moment it's generated. You usually cannot go back and reconstruct it.

The goal is to produce intelligence reports, not decisions. Reports with with risk flags and talking points. Human oversight is built in as an architectural assumption. Reports that a human acts on. That boundary between AI-produced information and human decision is not just ethical — it is a governance positioning decision.

The Four Governance Questions Every Stakeholder Should Ask

In AI systems evaluations, four governance questions tend to come up every time. The formulations vary, but the underlying questions are constant.

Question 1: "What did the AI system do, and when?"

This is the audit trail question. The answer must be a structured record in a relational database of every job submitted, every task executed, every agent result produced, and every status transition — with millisecond timestamps, indexed for fast retrieval, and retained for a defined period.

QualitaX’ s Approach: An audit_log table records every state transition as an append-only ledger entry, queryable by job_id, timestamp range, or status. Retention timeframes can be configured (for financial services contexts, retention can be configured to seven years).

Question 2: "What information did the AI model use to make its determination?"

This is the inputs question. The answer must include the exact payload sent to the AI model — not a summary, not a log line. The exact JSON structure — every field, every value — that was passed to the model.

QualitaX’ s Approach: agent_results.data (a JSONB column) stores the complete structured output of every agent in the system, including the full input payload it operated on. reports.raw_synthesis stores the complete, verbatim model response for every synthesis call.

Question 3: "What happens when the AI model refuses or fails to process something?"

This is the failure mode question. Most vendors answer vaguely — "we have retry logic" — because they haven't actually designed the answer. The correct answer names every failure category, the system's response to each, and where the record of that failure lives.

QualitaX’ s Approach: Safety refusals → non-retryable, recorded in audit_log with refusal reason, task status set to DEAD_LETTERED. Timeouts → retryable up to three attempts with exponential backoff, then DEAD_LETTERED. Validation failures → retryable with adjusted context, then DEAD_LETTERED on final failure. Every outcome is structured and retrievable.

Question 4: "Which version of the AI model processed this, and how do you manage model changes?"

This is the model risk management question. In traditional quantitative finance, models are subject to validation, change control, and performance monitoring. We apply the same rigour to AI models.

QualitaX’ s Approach: agent_results.model captures the exact model string at runtime for every result. Model changes require an explicit configuration update, which in the CI/CD pipeline means a pull request, code review, and documented deployment. A model change triggers pre-deployment schema validation against the new model's behaviour.

The Audit Trail: Three Layers of Proof

An audit trail is built across three database layers, each serving a distinct purpose. Together they produce a complete account of what happened, what was found, and what the "model said”.

Layer 1: The State Machine Record — audit_log

audit_log
─────────────────────────────────────────────
id            UUID          primary key
entity_type   VARCHAR(50)   'job' | 'task'
entity_id     UUID          references jobs.id or tasks.id
from_status   VARCHAR(50)   null on creation
to_status     VARCHAR(50)   the new status
actor         VARCHAR(100)  the service that made the change
metadata      JSONB         additional context — error message, attempt number
created_at    TIMESTAMPTZ   recorded at the moment of transition

Every time a job or task changes status, the service responsible appends a row. The row is never updated. Never deleted. It is an immutable ledger of system behaviour.

A complete job run produces an audit trail that reads as a precise narrative. Example below:

09:14:02.341  job created        → PENDING         actor: gateway
09:14:02.891  job                → DECOMPOSED       actor: orchestrator
09:14:03.012  task created       → PENDING          actor: orchestrator  (×4)
09:14:03.045  tasks              → RUNNING          actor: web-research, news, tech, regulatory
09:14:34.221  task web_research  → COMPLETE         actor: web-research
09:14:41.887  task news          → COMPLETE         actor: news-intelligence
09:14:52.003  task tech          → COMPLETE         actor: tech-fingerprint
09:15:01.441  task regulatory    → COMPLETE         actor: regulatory
09:15:01.512  job                → SYNTHESISING     actor: orchestrator
09:15:23.774  job                → COMPLETE         actor: synthesis

From this record alone — without touching application logs — it's possible to answer: what happened, in what order, at what time, and which service was responsible for each transition.

Tamper-evidence. Three mechanisms prevent retrospective modification. First, the database role used by the application has INSERT permission on audit_log — not UPDATE, not DELETE. The application literally cannot overwrite a record. Second, database (e.g. PostgreSQL) row-level security can further restrict even UPDATE to the table owner if the threat model requires it. Third, in high-assurance environments, the database's commit timestamps can be combined with a hash chain — each row's hash includes the previous row's hash — making retrospective modification detectable. For most businesses, the INSERT-only permission should be sufficient.

Layer 2: The Agent Result Record — agent_results

agent_results
─────────────────────────────────────────────
id            UUID
task_id       UUID          FK → tasks.id
agent_type    VARCHAR(50)
success       BOOLEAN
data          JSONB         the full structured output of the agent
tokens_used   INTEGER       input + output tokens consumed
latency_ms    FLOAT         wall-clock time for the agent run
model         VARCHAR(100)  exact model string: "claude-sonnet-4-6"
created_at    TIMESTAMPTZ

The data field contains the entire structured output of the agent — every signal extracted, every URL fetched, every finding recorded. This is not a summary. It's the complete agent output, stored as queryable JSONB.

The model field is captured at runtime, not from configuration. Every agent result is permanently linked to the exact model version that produced it.

Layer 3: The Synthesis Record — reports

reports
─────────────────────────────────────────────
id               UUID
job_id           UUID          FK → jobs.id
subject          VARCHAR(500)
report_json      JSONB         structured report (all signal fields)
report_markdown  TEXT          human-readable formatted report
raw_synthesis    TEXT          the exact Claude response verbatim
tokens_total     INTEGER       cumulative across all agents + synthesis
generated_at     TIMESTAMPTZ

raw_synthesis is the complete, unprocessed text that the model returned for the synthesis call — not the structured fields extracted from it, but the full response. This is the AI's reasoning chain, preserved permanently. If the structured output is ever questioned — if a risk flag seems wrong, if a talking point seems misattributed — the raw reasoning that produced it is retrievable.

Layer 1 answers: what happened. Layer 2 answers: what was found. Layer 3 answers: what the model said.

Handling Model Safety Refusals in a Production Pipeline

This is one of the most important failure modes to design explicitly, and that is often overlooked.

Consider the scenario: a pipeline is running, a task is in progress, and the model returns a safety refusal. The pipeline has no designed handling. What happens next?

Without designed handling: the agent throws an exception, the exception is logged as an unstructured error, the job eventually produces an incomplete report, and the missing component is only noticed when someone reads the output and asks why a section is empty. If a stakeholder later asks why a section is absent from a specific report — there is no structured record. Only raw application logs, if they were retained, and a best-effort reconstruction.

This is not acceptable.

The Critical Design Decision: Refusals Are Not Errors

A safety refusal is not an error. It is a deterministic, non-retryable outcome.

If you treat a refusal as a retryable error and submit the same request three times, you will get three refusals. You will have used tokens on three failed attempts. And you will have lost the distinction between "the system was unavailable" and "the system made a deliberate decision not to process this."

As an example, the correct handling approach could be:

Claude returns safety refusal
    ↓
Catch StopReason == "end_turn" with no tool_use block
    + verify refusal language in response
    ↓
Classify as: SAFETY_REFUSAL
    ↓
Write to audit_log:
    entity_type: task
    entity_id:   task.id
    from_status: RUNNING
    to_status:   DEAD_LETTERED
    actor:       [agent-service-name]
    metadata: {
        "reason":        "SAFETY_REFUSAL",
        "refusal_text":  "[first 500 chars of Claude's response]",
        "attempt":       1
    }
    ↓
Update task.status = DEAD_LETTERED
    ↓
Do NOT retry
    ↓
Publish to qdap:dlq with reason: SAFETY_REFUSAL
    ↓
Job continues with available results
    (synthesis proceeds with N-1 agent results, flagging the gap)

If a stakeholder later asks why a particular report is missing a component — the answer is in audit_log. Pull the record for that task: to_status of DEAD_LETTERED, metadata containing reason: SAFETY_REFUSAL and the first 500 characters of Claude's refusal explanation. Exact timestamp, exact responsible service, exact reason.

The contrast with undesigned handling:

Designed handling: DEAD_LETTERED with reason in audit_log. Synthesis proceeds, gap flagged in report. Retrievable in under 5 minutes. Answer: precise.

Undesigned handling: Exception logged, no structured record. Silent incomplete report. Reconstructible only from raw logs, if at all. Answer: approximate.

Validation Failures: A Different Category

A validation failure — where the model's output passes safety but does not conform to the expected Pydantic schema — is a retryable condition, but with a specific retry strategy.

On a validation failure, the same prompt is not resubmitted unchanged. The prompt is modified: the validation error is appended as a correction signal — "your previous response did not match the required schema, specifically: [error message]. Please respond with a valid [schema name]." This is prompt self-correction at the system level.

If validation fails on attempt three, the task is dead-lettered with reason VALIDATION_FAILURE and the full validation error recorded in the metadata. Not an error that disappears into a log — a structured, retrievable record.

Model Versioning and the Explainability Problem

This is where AI governance diverges most sharply from traditional software governance.

Unlike a code change, which is visible, diffable, and rollback-able, an AI model migration introduces behavioural changes that cannot be fully characterised by reading a changelog. When a provider deprecates a model version and a firm migrates to a successor, the 'change' is not a line of code — it's a distribution shift in outputs that must be empirically assessed. Organizations should treat model migration with great rigour: test against historical inputs, compare outputs, document the assessment, and obtain sign-off before deployment.

An approach could be to address this with three mechanisms.

Mechanism 1: Model string captured in every agent result. For instance, agent_results.model = "claude-sonnet-4-6" is captured at runtime, not from configuration. Every report is permanently linked to the model version that produced it. In five years, if someone asks which version of the model produced a specific intelligence report — the answer is in the database.

Mechanism 2: Model version is a configuration change, not a code change. The model string lives in a specific settings. Changing it requires an explicit configuration update — a pull request, a code review, and a documented deployment. The model change is auditable through Git history.

Mechanism 3: Model change triggers output schema validation. When the model configuration is changed, a pre-deployment validation step runs a suite of fixture inputs through the new model and validates that all outputs conform to the expected Pydantic schemas. If a new model version changes its tool_use response format — which does occasionally happen — this validation catches it before it reaches production.

Prompt Versioning: The Other Half of the Reproducibility Problem

Model versioning solves one half of the reproducibility problem. Prompt versioning solves the other.

Consider the scenario: a stakeholder asks why two intelligence reports on the same subject, produced three months apart, seem so different. The model string in agent_results.model is identical in both cases — the same model version processed both. But the system prompt for the agent was updated between the two runs. The input changed. The model did not.

Without prompt versioning, there is no structured answer to that question. With it, the answer is precise.

The core problem is that a prompt is not just configuration — it is a material input to every AI model call. Changing a prompt is functionally equivalent to changing a model: it shifts the distribution of outputs for identical inputs. In a production AI system, prompts evolve continuously — refinements to extraction logic, adjustments to output schema instructions, corrections to tone or scope. Each of these changes affects reproducibility, and each should be treated as a governed event.

At runtime, each agent resolves its active prompt version, records the prompt_version_id against its result, and proceeds. The prompt text itself is never reconstructed from memory — it is retrieved from prompt_versions by ID.

The prompt hash is the critical tamper-evidence mechanism. Before an agent constructs its model call, it computes a SHA-256 hash of the prompt template it is about to use and compares it against prompt_versions.prompt_hash for the active version. A mismatch — indicating the in-memory prompt has diverged from the registered version — halts the agent and raises an alert. This prevents silent prompt drift: the condition where a prompt is modified in a running system without a corresponding version record.

Prompt changes as governed events. Consistent with the model versioning approach, a prompt change should require an explicit, auditable deployment. The mechanism is the same: the prompt text lives in a versioned configuration store, changing it requires a pull request and code review, and the deployment writes a new row to prompt_versions with a change_summary authored by the engineer making the change. The Git history of the prompt configuration and the prompt_versions table together constitute the change record.

What this enables. When a stakeholder asks why two reports on the same subject differ — and the model version is identical — the answer is now precise: the agent's prompt was updated from version 1.2.0 to 1.3.0 between the two runs, the change introduced tighter scope constraints, and the deployment was reviewed and merged on a specific date. That is an auditable answer. Without prompt versioning, the same question has no structured response.

The Explainability Boundary

There is an important distinction between what can be explained and what cannot, and precision on this boundary is itself a governance competency.

Can explain the inputs: what data each agent gathered, what payload was sent to to the model, what schema it was asked to conform to.

Can explain the outputs: what the model returned, verbatim.

Cannot fully explain the internal reasoning of the transformer model that produced the output — that is the fundamental opacity of large language models, and no vendor claiming otherwise is being honest.

What an AI system can do — and does — is ensure that the system's decision surface is narrow. The model is not making free-form decisions. It is filling in a structured schema from a defined set of audited inputs. The explainability problem is most severe when an AI system has wide latitude. When constrained to produce structured outputs from audited inputs, the explainability gap narrows considerably.

The honest answer to a stakeholder, regulator or compliance officer: "Here is what the model was given. Here is what it returned. Here is the schema it was asked to conform to. The internal weights that produced the output are proprietary to the model provider." That is precise and credible. It is more credible than a vague claim of full explainability.

Producing a Compliance Artefact: The SQL

A compliance artefact package for a specific job — what you hand to a stakeholder, compliance officer or regulator — is the output of three SQL queries. These are runnable in under thirty seconds against a database schema.

Example - Query 1: The job record and complete audit trail.

-- The job itself
SELECT
    j.id,
    j.status,
    j.input,
    j.created_at,
    j.updated_at
FROM jobs j
WHERE j.id = '[job-uuid]';

-- Complete audit trail for this job and all its tasks
SELECT
    al.entity_type,
    al.entity_id,
    al.from_status,
    al.to_status,
    al.actor,
    al.metadata,
    al.created_at
FROM audit_log al
WHERE al.entity_id = '[job-uuid]'
   OR al.entity_id IN (
       SELECT id FROM tasks WHERE job_id = '[job-uuid]'
   )
ORDER BY al.created_at ASC;

This returns every state transition for the job and all its tasks, in chronological order. It is a complete narrative of what happened.

Example - Query 2: What each agent found.

SELECT
    t.task_type,
    t.status,
    t.attempt,
    ar.success,
    ar.data,
    ar.tokens_used,
    ar.latency_ms,
    ar.model,
    ar.created_at
FROM tasks t
LEFT JOIN agent_results ar ON ar.task_id = t.id
WHERE t.job_id = '[job-uuid]'
ORDER BY t.task_type;

For each agent: what status it reached, how many attempts it took, what it found, how many tokens it consumed, how long it took, and which model version processed it.

Example - Query 3: The final report and raw AI reasoning.

SELECT
    r.subject,
    r.report_json,
    r.raw_synthesis,
    r.tokens_total,
    r.generated_at
FROM reports r
WHERE r.job_id = '[job-uuid]';

report_json contains the structured fields. raw_synthesis contains the complete the model response verbatim.

These three queries together produce the complete record of what the system did for a specific job: what was requested, what each agent found, which model produced each result, whether anything was refused or failed, what the model said in synthesis, and the full timeline of every state transition. That is the compliance artefact package. Not a dashboard. Not a log file. A structured, queryable, retrievable record.

The Principle That Runs Through All of This

Looking across the governance architecture — the three-layer audit trail, the refusal handling, the model versioning, the explainability boundary — a single principle connects them.

Governance is a property of how the AI system is built. Every auditable event must be captured at the moment it occurs. Every failure mode must have a named, structured outcome — not an unhandled exception. Every AI model call must be linked, permanently, to the exact model version that produced it. The compliance artefact package is only possible because these decisions were made at the architecture stage.

That is what operational AI governance looks like. An architectural decision.

QualitaX builds production-grade agentic systems for B2B businesses. If your AI agent costs are higher than they should be — or you want to build something that won't surprise you with the bill — get in touch.