What to hash — choosing canonical bytes for each artifact shape

The single most important decision in a Satsignal integration is not which endpoint you call. It's which exact bytes you choose to hash. This page is the cross-cutting reference: for each common artifact shape (files, JSON events, webhook bodies, agent decisions, manifest rows), what counts as the canonical bytes, what to persist alongside the proof, and how to reproduce those bytes at verification time. Read this once before your first production anchor — most verification-time failures trace back here.

Companion docs: API reference · OpenAPI spec · Bundle spec — canonical schemes · Production checklist · Compatibility map

1. Why canonical bytes matter

A proof binds an on-chain commitment to a set of bytes — not to a logical object. The chain anchors sha256(canonical_bytes) (or in sealed mode, HMAC-SHA256(salt, canonical_bytes)). At verification time, the verifier re-derives the canonical bytes from whatever they hold and compares hashes.

If the bytes can't be reproduced bit-for-bit later, the proof fails verification. This is the most common integration mistake: treating the canonical artifact as a logical object ("the event JSON") rather than as the specific byte sequence that was hashed.

Three properties matter for a good canonicalization choice:

  1. Determinism. Given the same logical object, the canonicalization always produces the same bytes. JCS over JSON, RFC 4180 over CSV, NFC-normalize over text are all deterministic schemes.
  2. Reproducibility. Anyone with the logical object and the scheme name can reproduce the bytes. The scheme tag in the canonical doc (text-norm-v1, json-jcs-v1, etc.) names which canonicalizer the verifier must run.
  3. Resistance to brute-force on small payload spaces. For low-entropy payloads (a yes/no vote, a small integer bid), plain SHA-256 can be brute-forced; you want a nonce envelope (standard mode) or an HMAC under a private salt (sealed mode).

The Satsignal scheme tags are listed in bundle-v1 §4.3. Each tag is a documented canonicalization. The reference implementation in the live verifier is the byte-level authority.

2. Files

The simplest case. The file's raw bytes — exactly as they exist on disk — are the canonical bytes.

Rule

canonical_bytes = file.read()
sha256_hex = sha256(canonical_bytes).hex()

No transformation. No normalization. Whatever the file's bytes are, those are what get hashed.

Store alongside the proof

Reproduce at verification time

with open(artifact_path, "rb") as fh:
    bytes_now = fh.read()
assert sha256(bytes_now).hex() == canonical_doc["subject"]["proofs"]["byte_exact"]["hash"]

That's it. No normalization needed; the file IS the canonical artifact.

Edge cases

Tier-1 content schemes (optional)

For text-like content, you can additionally anchor a normalized view alongside the byte-exact hash:

schemeapplies tonormalization
text-norm-v1text/plain, text/markdown, source code, PDF textNFC + strip BOM + LF + trim trailing whitespace
json-jcs-v1application/jsonRFC 8785 JCS
csv-norm-v1text/csvRFC 4180 + LF + canonical quoting
pdf-text-v1application/pdfper-page text extraction + text-norm-v1

The verifier re-runs the named canonicalizer over the supplied file and compares. This is useful when "the same content" should verify across format variations (different line endings, different trailing whitespace, etc.) — the byte-exact hash still binds, but the content-canonical hash gives you "logically-equivalent" proof.

3. JSON events

Events are the trickiest shape because there are infinitely many byte sequences for "the same" logical JSON object. Whitespace, key order, number encoding, Unicode normalization all vary across serializers.

Rule — use JCS (RFC 8785) wrapped in a nonce envelope

envelope = {"nonce": rand16hex(), "payload": <your event>}
canonical_bytes = jcs(envelope)
sha256_hex = sha256(canonical_bytes).hex()

JCS (JSON Canonicalization Scheme, RFC 8785) is deterministic:

The nonce is fresh per envelope — 16 random bytes, hex-encoded (32 hex chars). It makes the hash unguessable from small payload spaces. Without a nonce, an event with only a few possible payloads (e.g. {"vote": "yes"} vs {"vote": "no"}) can be brute-forced by an adversary holding the hash.

Store alongside the proof

Reproduce at verification time

import hashlib, json

with open("event_canonical.json", "rb") as fh:
    canonical_bytes_now = fh.read()
assert hashlib.sha256(canonical_bytes_now).hexdigest() == proof_sha256

# Optional: JCS-decode and sanity-check the payload.
envelope = json.loads(canonical_bytes_now)
assert "nonce" in envelope
assert "payload" in envelope

Edge cases

When NOT to nonce-wrap

If your payload is naturally high-entropy (UUID-keyed, long free-text, large structured data), the nonce isn't needed for brute-force resistance — the payload itself is unguessable. But you usually want it anyway as defense in depth; the cost is 16 bytes. Default to wrapping.

4. Webhook bodies

Webhooks have two reasonable canonicalization choices, with a sharp trade-off.

Rule A — raw request body (the default)

canonical_bytes = raw_body
sha256_hex = sha256(canonical_bytes).hex()

The raw bytes that arrived on the wire, in source-byte order, with the signature header that proves the source produced them. No re-serialization, no parsing, no transformation. This is what the Satsignal webhook handler does for source_type: stripe, github, langfuse, and none.

Pros. Trivial to reproduce — the source's outgoing payload is the canonical bytes. The signature header binds these exact bytes; if the body matches the signature, the source produced them.

Cons. Bytes-fragile. JSON-parse-then-JSON-stringify breaks the hash. Any intermediate proxy that "helpfully" reformats the body — adds whitespace, changes key order, re-encodes Unicode — breaks the hash. The body must reach the notary byte-exact from the source.

Rule B — canonical normalized envelope (opt-in)

envelope = {"source": "stripe",
            "delivery_id": "evt_xyz",
            "received_at_utc": "2026-05-26T14:30:01Z",
            "payload": <parsed JSON of the body>}
canonical_bytes = jcs(envelope)
sha256_hex = sha256(canonical_bytes).hex()

Re-emit the body's parsed form into a canonical envelope. Pros and cons invert: the proof verifies as long as the logical event content matches, regardless of how the source's serializer formatted the wire bytes. Cons: the signature header doesn't bind to your re-emitted bytes — you've broken the source's non-repudiation chain unless you separately persist the original raw body alongside.

Trade-off — which to use

concernRule A (raw body)Rule B (canonical envelope)
signature non-repudiationpreservedbroken (unless raw body separately persisted)
robustness to proxiesfragilerobust
reproducibility at verify timeneeds exact bytesneeds parsed object + scheme
Satsignal webhook handlerthis is what it doesnot the default; opt-in

Default to Rule A (raw body). The Satsignal webhook handler records the raw bytes that arrived after signature verification; that's the canonical artifact. If you want the canonical-envelope shape additionally, use the Files path with a client-built envelope, not the webhook path.

Store alongside the proof

For Rule A:

For Rule B:

Reproduce at verification time

For Rule A: fetch the raw body bytes from your store; sha256; compare. The verifier might additionally re-verify the source's signature header against the body (this is the chain of trust to the source, separate from the chain anchor).

For Rule B: rebuild the canonical envelope from the persisted payload + the persisted metadata; JCS-encode; sha256; compare.

5. Agent decisions

The agent-session pattern produces three categories of artifact: policy snapshots, decision commitments, and the final evidence- bundle manifest. Each has its own canonicalization recipe.

Policy snapshot

Anchored at session start. category: "policy_snapshot".

policy = {
    "schema": "satsignal-policy-snapshot-v1",
    "session_id": session_id,
    "model": "claude-opus-4-7",
    "system_policy_text": SYSTEM_PROMPT,
    "tools": [...],
    "budget": {"max_turns": 50, "max_tokens": 100000},
    "permissions": [...],
    "session_started_at_utc": "2026-05-26T14:30:01Z",
}
canonical_bytes = jcs(policy)

The policy snapshot is high-entropy by nature (system prompts are usually long); no nonce needed.

Decision commitment

Anchored per decision (strong-timing default). category: "commitment".

decision = {
    "nonce": rand16hex(),
    "session_id": session_id,
    "step_index": 3,
    "label": "tool_call: search('foo')",
    "payload": <the decision's content — chosen tool, output, etc.>,
}
canonical_bytes = jcs(decision)

Decisions are often low-entropy (a tool name from a small set, a boolean, a small integer choice). The nonce is required to resist brute-force on the on-chain hash.

Evidence-bundle manifest

Anchored at session end. category: "evidence_bundle", mode: "manifest". The items are the policy snapshot + each decision:

items = [
    {"label": "policy", "sha256_hex": policy_proof_sha},
    {"label": "decision-0", "sha256_hex": decision_0_proof_sha},
    {"label": "decision-1", "sha256_hex": decision_1_proof_sha},
    # ...
]

See Manifest-backed proofs for the per-leaf canonical-bytes rule (sha256(jcs({label, sha256_hex}))).

Store alongside the proof

For each policy snapshot:

For each decision:

For the manifest:

agent_anchor.py produces a handoff.json that consolidates all of the above per session. That's the file you ship to the auditor.

Reproduce at verification time

For each artifact, re-JCS-encode the stored object, sha256, compare to the proof's hash. For the manifest, rebuild the Merkle tree from the items list, compare the root.

Cross-link: Agents covers the operational shape; this section is the byte-level rule.

6. Manifest rows / tables

Tabular data — eval results, ledger rows, scoreboard entries, bid tables, survey responses. Each row is one leaf in a Merkle manifest.

Rule

canonical_row_bytes = canonicalize(row)
leaf_sha256 = sha256({label, canonical_row_bytes})  # via JCS

The leaf is sha256(jcs({label, sha256_hex})) where sha256_hex is the hash of the canonical row bytes. The labels are preserved exactly — character-for-character, byte-for-byte.

Row canonicalization by source shape

sourcecanonicalizationscheme tag
CSV rowRFC 4180 + LF + canonical quotingcsv-row-v1
JSON rowJCS (RFC 8785)merkle-row-v1 (JCS-based)
Sealed JSON row (low-entropy)JCS + HMAC under per-leaf saltmerkle-row-sealed-v1
Free-text rowtext-norm-v1 (NFC + LF + trim)text-line-v1

Store alongside the proof

Reproduce at verification time

The verifier rebuilds the Merkle tree:

import hashlib, json

def jcs_bytes(obj):
    return json.dumps(obj, sort_keys=True,
                      separators=(",", ":")).encode("utf-8")

def leaf_hash(label, canonical_row_bytes):
    item = {"label": label,
            "sha256_hex": hashlib.sha256(canonical_row_bytes).hexdigest()}
    return hashlib.sha256(jcs_bytes(item)).digest()

For per-row inclusion proofs, only the target row + the sibling hashes are needed. See Manifest §5 Case B.

Labels can leak metadata

A bid table with labels alice-bid, bob-bid leaks bidder identities even if the bid amounts stay sealed. Use neutral labels (row-7, item-12) where label privacy matters. Threat-model details: /spec-merkle-row §4.

7. Anti-patterns

The most common mistakes. Each of these has shown up in real integration failures.

Don't hash a stringified Python dict

# WRONG — key order is insertion order in CPython 3.7+, which is
# fine for one process but breaks across implementations and
# across calls that rebuild the dict in a different order.
sha256_hex = hashlib.sha256(str(my_dict).encode()).hexdigest()

Use JCS. json.dumps(my_dict, sort_keys=True, separators=(",", ":")) is a usable approximation; a real JCS library handles NFC normalization + number canonicalization on top.

Don't hash a serialized timestamp without canonicalizing format

# WRONG — datetime.now().isoformat() varies across:
#   - timezone-aware vs naive
#   - microsecond precision (3 digits vs 6 vs absent)
#   - "Z" vs "+00:00" suffix
ts = datetime.datetime.now().isoformat()
sha256_hex = hashlib.sha256(ts.encode()).hexdigest()

Canonicalize to a fixed format: ISO 8601, UTC, second-precision, Z suffix. datetime.now(UTC).replace(microsecond=0).isoformat() .replace("+00:00", "Z"). Or hash a structured envelope that includes the timestamp as a JCS-encoded string with a known format.

Don't trust upstream "give me your canonical bytes" answers without re-running canonicalization yourself

If a third-party library or service hands you "the canonical bytes of X", verify by re-canonicalizing X yourself and byte-comparing. Canonicalization bugs are subtle: a library that sorts keys at top-level but not recursively, or normalizes most Unicode but not NFC, produces bytes that look right but don't match a strict-conformant verifier.

Don't hash file metadata into the file's canonical bytes

# WRONG — mixing logical metadata (mtime, owner) into what should be
# content bytes. Verifiers won't have the metadata.
canonical = file.read() + str(os.stat(file).st_mtime).encode()

The file's content bytes are the canonical bytes. Metadata (filename, mtime, owner, size) lives in the canonical doc's top-level fields, NOT in the hash input. The verifier doesn't re-derive metadata from your filesystem.

Don't re-serialize a JSON object through JSON.parse → JSON.stringify

// WRONG — the round trip changes the bytes (whitespace, key
// order, number encoding). Hash mismatch on the way back.
const re = JSON.stringify(JSON.parse(body));
const hash = sha256(re);

For webhooks: hash the raw body bytes that arrived. For client- emitted JSON: JCS-encode at production time and persist the canonical bytes; don't re-serialize.

Don't use a "human-readable" timestamp as the only freshness signal

# WRONG — "yesterday" is ambiguous, "Tuesday" is too, "5:00 AM"
# without a timezone is too.
envelope = {"when": "yesterday", "payload": ...}

Use a fresh nonce + a structured UTC timestamp. The nonce is what makes the envelope unique-per-event; the timestamp is human metadata.

Don't hash logs without nonce-wrapping

# WRONG — log lines are often low-entropy ("INFO: request OK").
# An adversary can grind candidate log lines against the hash.
sha256_hex = hashlib.sha256(log_line.encode()).hexdigest()

Wrap in {nonce, log_line} and JCS-encode. The nonce makes each commitment unguessable.

Don't rely on repr() for canonical bytes

# WRONG — repr() output varies across Python versions, types,
# and locale.
canonical = repr(my_object).encode()

repr() is for debugging, not canonical encoding. Always pick a well-specified scheme (JCS for JSON, RFC 4180 for CSV, text-norm-v1 for text).

Don't share the master salt over an unauthenticated channel

For sealed-mode proofs: the salt is the bearer secret. Sharing it with an auditor over unencrypted email or in a screenshot is equivalent to sharing the proof's verifiability with anyone who intercepts.

Don't anchor PII as plaintext "to be safe"

If the canonical bytes contain PII (card numbers, full names, addresses), the on-chain anchor leaks nothing (only a hash is public), BUT the bundle itself carries the canonical bytes, and anyone you share the bundle with sees the PII. Use sealed mode to keep the hash itself private, or canonicalize to a redacted form (hash a structured object with only the non-PII fields).

Don't change canonicalization mid-stream

If your integration has been hashing rows under json-jcs-v1, don't switch to csv-row-v1 mid-batch. Old proofs verify under their original scheme; the canonical doc records which scheme was used. Pick one and stay there for a given dataset.

8. Custody around the hash

Canonicalizing and anchoring gets you a fingerprint committed to the chain at a block time. That is necessary but not sufficient for an evidence claim: the anchor establishes when an exact byte string existed and that it has not changed since — it says nothing about where those bytes came from, who controlled them, or whether anyone authored them. A hash binds to knowledge of the fingerprint, not to possession or provenance: anyone who can compute (or is handed) the SHA-256 can anchor it. Custody — the chain of control around the bytes — lives in your records, not in the anchor. (Canonical scope: what-it-proves.html.)

So the anchor timestamps; you attribute. Make the timestamp evidentially useful by keeping a custody record alongside the proof:

The honest one-liner for a recipient: Satsignal timestamps the fingerprint; custody and authenticity are claims the issuer makes with their own records, supported by — not replaced by — the anchor.

9. Recap — the checklist

For every artifact you anchor, answer these four:

  1. What's the canonical scheme? (byte_exact raw bytes; text-norm-v1; json-jcs-v1; csv-norm-v1; etc.)
  2. What bytes does the scheme produce? (Run it once; record the bytes; sha256 them.)
  3. What do I persist? (The canonical bytes; the original artifact; the scheme tag; the proof's proof_id / txid.)
  4. How does a verifier reproduce the bytes? (Tooling + scheme name + the original artifact.)

If you can answer all four for every artifact in your integration, you have a verifiable proof system. If any answer is shaky, the integration will fail verification at some point in the future — usually exactly when someone needs the proof to hold up.

Where this fits

Questions about this specification? Email hello@satsignal.cloud.