What to hash — choosing canonical bytes for each artifact shape

The single most important decision in a Satsignal integration is not which endpoint you call. It's which exact bytes you choose to hash. This page is the cross-cutting reference: for each common artifact shape (files, JSON events, webhook bodies, agent decisions, manifest rows), what counts as the canonical bytes, what to persist alongside the proof, and how to reproduce those bytes at verification time. Read this once before your first production anchor — most verification-time failures trace back here.

Companion docs: API reference · OpenAPI spec · Bundle spec — canonical schemes · Production checklist · Compatibility map

1. Why canonical bytes matter

A proof binds an on-chain commitment to a set of bytes — not to a logical object. The chain anchors sha256(canonical_bytes) (or in sealed mode, HMAC-SHA256(salt, canonical_bytes)). At verification time, the verifier re-derives the canonical bytes from whatever they hold and compares hashes.

If the bytes can't be reproduced bit-for-bit later, the proof fails verification. This is the most common integration mistake: treating the canonical artifact as a logical object ("the event JSON") rather than as the specific byte sequence that was hashed.

Three properties matter for a good canonicalization choice:

Determinism. Given the same logical object, the canonicalization always produces the same bytes. JCS over JSON, RFC 4180 over CSV, NFC-normalize over text are all deterministic schemes.
Reproducibility. Anyone with the logical object and the scheme name can reproduce the bytes. The scheme tag in the canonical doc (text-norm-v1, json-jcs-v1, etc.) names which canonicalizer the verifier must run.
Resistance to brute-force on small payload spaces. For low-entropy payloads (a yes/no vote, a small integer bid), plain SHA-256 can be brute-forced; you want a nonce envelope (standard mode) or an HMAC under a private salt (sealed mode).

The Satsignal scheme tags are listed in bundle-v1 §4.3. Each tag is a documented canonicalization. The reference implementation in the live verifier is the byte-level authority.

2. Files

The simplest case. The file's raw bytes — exactly as they exist on disk — are the canonical bytes.

Rule

canonical_bytes = file.read()
sha256_hex = sha256(canonical_bytes).hex()

No transformation. No normalization. Whatever the file's bytes are, those are what get hashed.

Store alongside the proof

The original file, bit-identical. If you also store a derived view (e.g. extracted text from a PDF), keep the source file too. The verifier hashes the original.
The file's logical name — useful for human-readable proof pages. Display only; does not enter the hash.
The category — commitment, evidence_bundle, etc. (closed enum; see API reference for the full list).

Reproduce at verification time

with open(artifact_path, "rb") as fh:
    bytes_now = fh.read()
assert sha256(bytes_now).hex() == canonical_doc["subject"]["proofs"]["byte_exact"]["hash"]

That's it. No normalization needed; the file IS the canonical artifact.

Edge cases

Trailing newlines on text files. printf "hello" and echo "hello" produce different bytes (the latter has a trailing \n). Whichever your tooling produced is what got hashed; preserve it.
Line endings on cross-platform stores. Git's core.autocrlf silently rewrites line endings on Windows checkout. A file hashed on Linux (LF) won't match its Windows checkout (CRLF). Disable autocrlf for any file that's part of a proof, or apply text-norm-v1 canonicalization at anchor time so line endings are normalized into the canonical form.
Filesystem metadata. xattr, owner, mtime: NOT part of the hash. Only the file's content bytes matter.
Symlinks vs file content. open(symlink, "rb") follows the symlink and reads the target file. The symlink itself isn't hashed.

Tier-1 content schemes (optional)

For text-like content, you can additionally anchor a normalized view alongside the byte-exact hash:

scheme	applies to	normalization
`text-norm-v1`	text/plain, text/markdown, source code, PDF text	NFC + strip BOM + LF + trim trailing whitespace
`json-jcs-v1`	application/json	RFC 8785 JCS
`csv-norm-v1`	text/csv	RFC 4180 + LF + canonical quoting
`pdf-text-v1`	application/pdf	per-page text extraction + text-norm-v1

The verifier re-runs the named canonicalizer over the supplied file and compares. This is useful when "the same content" should verify across format variations (different line endings, different trailing whitespace, etc.) — the byte-exact hash still binds, but the content-canonical hash gives you "logically-equivalent" proof.

3. JSON events

Events are the trickiest shape because there are infinitely many byte sequences for "the same" logical JSON object. Whitespace, key order, number encoding, Unicode normalization all vary across serializers.

Rule — use JCS (RFC 8785) wrapped in a nonce envelope

envelope = {"nonce": rand16hex(), "payload": <your event>}
canonical_bytes = jcs(envelope)
sha256_hex = sha256(canonical_bytes).hex()

JCS (JSON Canonicalization Scheme, RFC 8785) is deterministic:

UTF-8 output
NFC-normalize all string values
Sort all object keys lexicographically
No insignificant whitespace
Reject NaN, +Inf, -Inf
Canonical number encoding per RFC 8785 §3.2.2

The nonce is fresh per envelope — 16 random bytes, hex-encoded (32 hex chars). It makes the hash unguessable from small payload spaces. Without a nonce, an event with only a few possible payloads (e.g. {"vote": "yes"} vs {"vote": "no"}) can be brute-forced by an adversary holding the hash.

Store alongside the proof

The full canonical envelope bytes (the JCS-encoded {nonce, payload}). Not just the payload — the verifier needs the nonce + payload + JCS encoding to reproduce.
The original payload (separately if you want; it's derivable from the canonical envelope by JCS-decode-and-strip- nonce).
The scheme tag if you applied one — e.g. content_canonical: json-jcs-v1.

Reproduce at verification time

import hashlib, json

with open("event_canonical.json", "rb") as fh:
    canonical_bytes_now = fh.read()
assert hashlib.sha256(canonical_bytes_now).hexdigest() == proof_sha256

# Optional: JCS-decode and sanity-check the payload.
envelope = json.loads(canonical_bytes_now)
assert "nonce" in envelope
assert "payload" in envelope

Edge cases

json.dumps(d) is NOT JCS. Python's default json.dumps emits insignificant whitespace, doesn't sort keys, and uses Python's default number encoding. Use a JCS library, or apply json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False) and NFC-normalize string values.
Floats vs ints. JCS treats 1 and 1.0 as different numbers (different canonical encodings). Pick one shape and stick to it.
null vs missing key. JCS preserves both: {"x": null} and {} have different canonical forms. Your application layer must be consistent about which it produces.
Sorting nested structures. JCS sorts at every level, recursively. Verifiers that only sort top-level produce a different canonical form.

When NOT to nonce-wrap

If your payload is naturally high-entropy (UUID-keyed, long free-text, large structured data), the nonce isn't needed for brute-force resistance — the payload itself is unguessable. But you usually want it anyway as defense in depth; the cost is 16 bytes. Default to wrapping.

4. Webhook bodies

Webhooks have two reasonable canonicalization choices, with a sharp trade-off.

Rule A — raw request body (the default)

canonical_bytes = raw_body
sha256_hex = sha256(canonical_bytes).hex()

The raw bytes that arrived on the wire, in source-byte order, with the signature header that proves the source produced them. No re-serialization, no parsing, no transformation. This is what the Satsignal webhook handler does for source_type: stripe, github, langfuse, and none.

Pros. Trivial to reproduce — the source's outgoing payload is the canonical bytes. The signature header binds these exact bytes; if the body matches the signature, the source produced them.

Cons. Bytes-fragile. JSON-parse-then-JSON-stringify breaks the hash. Any intermediate proxy that "helpfully" reformats the body — adds whitespace, changes key order, re-encodes Unicode — breaks the hash. The body must reach the notary byte-exact from the source.

Rule B — canonical normalized envelope (opt-in)

envelope = {"source": "stripe",
            "delivery_id": "evt_xyz",
            "received_at_utc": "2026-05-26T14:30:01Z",
            "payload": <parsed JSON of the body>}
canonical_bytes = jcs(envelope)
sha256_hex = sha256(canonical_bytes).hex()

Re-emit the body's parsed form into a canonical envelope. Pros and cons invert: the proof verifies as long as the logical event content matches, regardless of how the source's serializer formatted the wire bytes. Cons: the signature header doesn't bind to your re-emitted bytes — you've broken the source's non-repudiation chain unless you separately persist the original raw body alongside.

Trade-off — which to use

concern	Rule A (raw body)	Rule B (canonical envelope)
signature non-repudiation	preserved	broken (unless raw body separately persisted)
robustness to proxies	fragile	robust
reproducibility at verify time	needs exact bytes	needs parsed object + scheme
Satsignal webhook handler	this is what it does	not the default; opt-in

Default to Rule A (raw body). The Satsignal webhook handler records the raw bytes that arrived after signature verification; that's the canonical artifact. If you want the canonical-envelope shape additionally, use the Files path with a client-built envelope, not the webhook path.

Store alongside the proof

For Rule A:

The raw body bytes. This is the load-bearing one. Source- side log retention is usually sufficient; if not, persist on your side.
The signature header (e.g. Stripe-Signature, X-Hub-Signature-256). Proves the source produced the body.
The source's delivery_id if present — useful for cross- referencing with the source's audit log.

For Rule B:

The canonical envelope bytes (the JCS-encoded {source, delivery_id, received_at_utc, payload}).
The original raw body separately, with its signature header.

Reproduce at verification time

For Rule A: fetch the raw body bytes from your store; sha256; compare. The verifier might additionally re-verify the source's signature header against the body (this is the chain of trust to the source, separate from the chain anchor).

For Rule B: rebuild the canonical envelope from the persisted payload + the persisted metadata; JCS-encode; sha256; compare.

5. Agent decisions

The agent-session pattern produces three categories of artifact: policy snapshots, decision commitments, and the final evidence- bundle manifest. Each has its own canonicalization recipe.

Policy snapshot

Anchored at session start. category: "policy_snapshot".

policy = {
    "schema": "satsignal-policy-snapshot-v1",
    "session_id": session_id,
    "model": "claude-opus-4-7",
    "system_policy_text": SYSTEM_PROMPT,
    "tools": [...],
    "budget": {"max_turns": 50, "max_tokens": 100000},
    "permissions": [...],
    "session_started_at_utc": "2026-05-26T14:30:01Z",
}
canonical_bytes = jcs(policy)

The policy snapshot is high-entropy by nature (system prompts are usually long); no nonce needed.

Decision commitment

Anchored per decision (strong-timing default). category: "commitment".

decision = {
    "nonce": rand16hex(),
    "session_id": session_id,
    "step_index": 3,
    "label": "tool_call: search('foo')",
    "payload": <the decision's content — chosen tool, output, etc.>,
}
canonical_bytes = jcs(decision)

Decisions are often low-entropy (a tool name from a small set, a boolean, a small integer choice). The nonce is required to resist brute-force on the on-chain hash.

Evidence-bundle manifest

Anchored at session end. category: "evidence_bundle", mode: "manifest". The items are the policy snapshot + each decision:

items = [
    {"label": "policy", "sha256_hex": policy_proof_sha},
    {"label": "decision-0", "sha256_hex": decision_0_proof_sha},
    {"label": "decision-1", "sha256_hex": decision_1_proof_sha},
    # ...
]

See Manifest-backed proofs for the per-leaf canonical-bytes rule (sha256(jcs({label, sha256_hex}))).

Store alongside the proof

For each policy snapshot:

The canonical bytes of the snapshot (the JCS-encoded object).
The original logical object if you want — derivable from the canonical bytes.

For each decision:

The canonical envelope bytes (with the nonce).
The payload separately if useful for human reading; the canonical bytes are what the verifier needs.

For the manifest:

The items list, in submission order.
The Merkle root (returned in the anchor response).

agent_anchor.py produces a handoff.json that consolidates all of the above per session. That's the file you ship to the auditor.

Reproduce at verification time

For each artifact, re-JCS-encode the stored object, sha256, compare to the proof's hash. For the manifest, rebuild the Merkle tree from the items list, compare the root.

Cross-link: Agents covers the operational shape; this section is the byte-level rule.

6. Manifest rows / tables

Tabular data — eval results, ledger rows, scoreboard entries, bid tables, survey responses. Each row is one leaf in a Merkle manifest.

Rule

canonical_row_bytes = canonicalize(row)
leaf_sha256 = sha256({label, canonical_row_bytes})  # via JCS

The leaf is sha256(jcs({label, sha256_hex})) where sha256_hex is the hash of the canonical row bytes. The labels are preserved exactly — character-for-character, byte-for-byte.

Row canonicalization by source shape

source	canonicalization	scheme tag
CSV row	RFC 4180 + LF + canonical quoting	`csv-row-v1`
JSON row	JCS (RFC 8785)	`merkle-row-v1` (JCS-based)
Sealed JSON row (low-entropy)	JCS + HMAC under per-leaf salt	`merkle-row-sealed-v1`
Free-text row	text-norm-v1 (NFC + LF + trim)	`text-line-v1`

Store alongside the proof

The full items list, in submission order. The Merkle root commits to ORDER.
The canonical bytes of each row — what's actually hashed. For CSV rows: the canonical CSV-encoded row bytes. For JSON rows: the JCS-encoded row bytes. For text rows: the text-norm-v1 canonical bytes.
The labels exactly as submitted.

Reproduce at verification time

The verifier rebuilds the Merkle tree:

import hashlib, json

def jcs_bytes(obj):
    return json.dumps(obj, sort_keys=True,
                      separators=(",", ":")).encode("utf-8")

def leaf_hash(label, canonical_row_bytes):
    item = {"label": label,
            "sha256_hex": hashlib.sha256(canonical_row_bytes).hexdigest()}
    return hashlib.sha256(jcs_bytes(item)).digest()

For per-row inclusion proofs, only the target row + the sibling hashes are needed. See Manifest §5 Case B.

Labels can leak metadata

A bid table with labels alice-bid, bob-bid leaks bidder identities even if the bid amounts stay sealed. Use neutral labels (row-7, item-12) where label privacy matters. Threat-model details: /spec-merkle-row §4.

7. Anti-patterns

The most common mistakes. Each of these has shown up in real integration failures.

Don't hash a stringified Python dict

# WRONG — key order is insertion order in CPython 3.7+, which is
# fine for one process but breaks across implementations and
# across calls that rebuild the dict in a different order.
sha256_hex = hashlib.sha256(str(my_dict).encode()).hexdigest()

Use JCS. json.dumps(my_dict, sort_keys=True, separators=(",", ":")) is a usable approximation; a real JCS library handles NFC normalization + number canonicalization on top.

Don't hash a serialized timestamp without canonicalizing format

# WRONG — datetime.now().isoformat() varies across:
#   - timezone-aware vs naive
#   - microsecond precision (3 digits vs 6 vs absent)
#   - "Z" vs "+00:00" suffix
ts = datetime.datetime.now().isoformat()
sha256_hex = hashlib.sha256(ts.encode()).hexdigest()

Canonicalize to a fixed format: ISO 8601, UTC, second-precision, Z suffix. datetime.now(UTC).replace(microsecond=0).isoformat() .replace("+00:00", "Z"). Or hash a structured envelope that includes the timestamp as a JCS-encoded string with a known format.

Don't trust upstream "give me your canonical bytes" answers without re-running canonicalization yourself

If a third-party library or service hands you "the canonical bytes of X", verify by re-canonicalizing X yourself and byte-comparing. Canonicalization bugs are subtle: a library that sorts keys at top-level but not recursively, or normalizes most Unicode but not NFC, produces bytes that look right but don't match a strict-conformant verifier.

Don't hash file metadata into the file's canonical bytes

# WRONG — mixing logical metadata (mtime, owner) into what should be
# content bytes. Verifiers won't have the metadata.
canonical = file.read() + str(os.stat(file).st_mtime).encode()

The file's content bytes are the canonical bytes. Metadata (filename, mtime, owner, size) lives in the canonical doc's top-level fields, NOT in the hash input. The verifier doesn't re-derive metadata from your filesystem.

Don't re-serialize a JSON object through `JSON.parse → JSON.stringify`

// WRONG — the round trip changes the bytes (whitespace, key
// order, number encoding). Hash mismatch on the way back.
const re = JSON.stringify(JSON.parse(body));
const hash = sha256(re);

For webhooks: hash the raw body bytes that arrived. For client- emitted JSON: JCS-encode at production time and persist the canonical bytes; don't re-serialize.

Don't use a "human-readable" timestamp as the only freshness signal

# WRONG — "yesterday" is ambiguous, "Tuesday" is too, "5:00 AM"
# without a timezone is too.
envelope = {"when": "yesterday", "payload": ...}

Use a fresh nonce + a structured UTC timestamp. The nonce is what makes the envelope unique-per-event; the timestamp is human metadata.

Don't hash logs without nonce-wrapping

# WRONG — log lines are often low-entropy ("INFO: request OK").
# An adversary can grind candidate log lines against the hash.
sha256_hex = hashlib.sha256(log_line.encode()).hexdigest()

Wrap in {nonce, log_line} and JCS-encode. The nonce makes each commitment unguessable.

Don't rely on `repr()` for canonical bytes

# WRONG — repr() output varies across Python versions, types,
# and locale.
canonical = repr(my_object).encode()

repr() is for debugging, not canonical encoding. Always pick a well-specified scheme (JCS for JSON, RFC 4180 for CSV, text-norm-v1 for text).

For sealed-mode proofs: the salt is the bearer secret. Sharing it with an auditor over unencrypted email or in a screenshot is equivalent to sharing the proof's verifiability with anyone who intercepts.

Don't anchor PII as plaintext "to be safe"

If the canonical bytes contain PII (card numbers, full names, addresses), the on-chain anchor leaks nothing (only a hash is public), BUT the bundle itself carries the canonical bytes, and anyone you share the bundle with sees the PII. Use sealed mode to keep the hash itself private, or canonicalize to a redacted form (hash a structured object with only the non-PII fields).

Don't change canonicalization mid-stream

If your integration has been hashing rows under json-jcs-v1, don't switch to csv-row-v1 mid-batch. Old proofs verify under their original scheme; the canonical doc records which scheme was used. Pick one and stay there for a given dataset.

8. Custody around the hash

Canonicalizing and anchoring gets you a fingerprint committed to the chain at a block time. That is necessary but not sufficient for an evidence claim: the anchor establishes when an exact byte string existed and that it has not changed since — it says nothing about where those bytes came from, who controlled them, or whether anyone authored them. A hash binds to knowledge of the fingerprint, not to possession or provenance: anyone who can compute (or is handed) the SHA-256 can anchor it. Custody — the chain of control around the bytes — lives in your records, not in the anchor. (Canonical scope: what-it-proves.html.)

So the anchor timestamps; you attribute. Make the timestamp evidentially useful by keeping a custody record alongside the proof:

Provenance metadata next to every proof. Source system, who or what produced the bytes, ingestion time, and the SHA-256 you submitted. The proof pins the fingerprint to a time; this record is what ties that time back to a real-world origin.
Anchor at each custody transition, not just once. If the steps — intake, transformation, review, hand-off — each need to stand on their own, anchor the canonical bytes at each one. The chain then carries a timestamped fingerprint at every transition — a gap in anchoring is a gap in the custody story you can later prove.
Keep the canonical bytes under your control. The chain carries only the 32-byte hash; you carry the bytes (§2–§6). Lose them and you have a timestamp with nothing to compare against — the proof still verifies, but it can no longer be tied to your specific record.
When the origin itself is sensitive, anchor in sealed mode. Sealed (guide-sealed) commits to an HMAC, so the fingerprint — and therefore the existence of this document — stays private while the timestamp stays provable.

The honest one-liner for a recipient: Satsignal timestamps the fingerprint; custody and authenticity are claims the issuer makes with their own records, supported by — not replaced by — the anchor.

9. Recap — the checklist

For every artifact you anchor, answer these four:

What's the canonical scheme? (byte_exact raw bytes; text-norm-v1; json-jcs-v1; csv-norm-v1; etc.)
What bytes does the scheme produce? (Run it once; record the bytes; sha256 them.)
What do I persist? (The canonical bytes; the original artifact; the scheme tag; the proof's proof_id / txid.)
How does a verifier reproduce the bytes? (Tooling + scheme name + the original artifact.)

If you can answer all four for every artifact in your integration, you have a verifiable proof system. If any answer is shaky, the integration will fail verification at some point in the future — usually exactly when someone needs the proof to hold up.

Where this fits

For the per-integration recipes, see: Files, Webhooks, Agents, CI/CD, Sealed, Manifest, Headless redaction.
For the canonical-scheme registry + byte-level rules, see bundle-v1 §4.3.
For the full pre-flight checklist before going live (key rotation, broadcast failure recovery, support flow), see Production checklist.
For the canonical/legacy field mapping across endpoints, fields, scopes, CLI flags, and error codes, see Compatibility map.
For the live reference implementation of every scheme above, view source of the production verifier at proof.satsignal.cloud/verify. Where this guide is ambiguous or silent, the verifier is authoritative.

Questions about this specification? Email hello@satsignal.cloud.

What to hash — choosing canonical bytes for each artifact shape

1. Why canonical bytes matter

2. Files

Rule

Store alongside the proof

Reproduce at verification time

Edge cases

Tier-1 content schemes (optional)

3. JSON events

Rule — use JCS (RFC 8785) wrapped in a nonce envelope

Store alongside the proof

Reproduce at verification time

Edge cases

When NOT to nonce-wrap

4. Webhook bodies

Rule A — raw request body (the default)

Rule B — canonical normalized envelope (opt-in)

Trade-off — which to use

Store alongside the proof

Reproduce at verification time

5. Agent decisions

Policy snapshot

Decision commitment

Evidence-bundle manifest

Store alongside the proof

Reproduce at verification time

6. Manifest rows / tables

Rule

Row canonicalization by source shape

Store alongside the proof

Reproduce at verification time

Labels can leak metadata

7. Anti-patterns

Don't hash a stringified Python dict

Don't hash a serialized timestamp without canonicalizing format

Don't trust upstream "give me your canonical bytes" answers without re-running canonicalization yourself

Don't hash file metadata into the file's canonical bytes

Don't re-serialize a JSON object through JSON.parse → JSON.stringify

Don't use a "human-readable" timestamp as the only freshness signal

Don't hash logs without nonce-wrapping

Don't rely on repr() for canonical bytes

Don't share the master salt over an unauthenticated channel

Don't anchor PII as plaintext "to be safe"

Don't change canonicalization mid-stream

8. Custody around the hash

9. Recap — the checklist

Where this fits

Don't re-serialize a JSON object through `JSON.parse → JSON.stringify`

Don't rely on `repr()` for canonical bytes