csv-column-v1 — selective-disclosure profile for CSV column leaves

Authority. This profile defines a NEW forever-contract leaf rule — one Merkle leaf per CSV column. Unlike its sibling csv-row-v1 (which merely writes down a rule on-chain anchors already commit), csv-column-v1 is a net-new scheme: an anchor must be built to emit it, a verifier to recompute it, and a redact tool to disclose from it. Profile literals are forever-contracts; disclosure binds to the anchor's committed chunk_merkle.

Freeze status. The standard rule (§§2–4, §6–8) is the freeze-ready scope of this spec: every byte-level decision below is locked and every fixture value is computed by the reference implementation (validated against the frozen csv-row-v1 corpus — see §8). The sealed rule (§5b) is now FROZEN: its per-leaf HKDF info is the bare "chunk/" || u32_be(j) shared with the three shipped sealed profiles, with its own frozen fixture corpus csv_column_v1_native_sealed/. The literal is csv-column-v1; it is in _VALID_MERKLE_SCHEMES (standard) and _NATIVE_SEALED_LITERALS (sealed).

Versioning. The profile literal is the hyphenated "csv-column-v1" — the exact string a CSV column anchor stamps into subject.proofs.chunk_merkle.scheme. The shape evolves additively as v1.x (new fixtures, clarifying prose); the segmentation / canonicalization / leaf-hash / merkle rules below are fixed forever for this literal once an anchor commits under it. A bug in any of these rules can never be patched in place; the only remedy is a new sibling literal (csv-column-v2) that compatible verifiers support in parallel. This profile covers both modes that share this literal: standard (algo: "sha256", unsalted; §§2–4) and sealed (algo: "merkle-hmac-sha256", per-leaf HKDF salts; §5b, frozen). The two modes share the §2 canonicalization, §3 column segmentation, and §6 duplicate-last merkle byte-for-byte — they differ only in the per-leaf hash. The mode a verifier applies is selected by the carrier chunk_merkle.algo, never by the literal alone.

Status: frozen (standard + sealed). Audience: anchorers who anchor a CSV by column at time T1 and later produce a validated redacted copy revealing specific columns under disclosure-v1.md; verifier authors who must recompute a column leaf from (value) alone and walk it into the merkle root the original anchor committed. Goal: pin one canonical byte-level rule for "given a CSV file, what is column N, and what bytes does its leaf hash cover?" — to the byte, with adversarial fixtures, forever.

1. Why this exists

csv-row-v1 answers "reveal these rows, withhold the rest" — the right shape for invoices, event logs, ledgers (one row = one record). It cannot answer the orthogonal question: "reveal these columns, withhold the rest." That is the shape for column-projection disclosures — prove the date and amount columns of a ledger while withholding counterparty; prove a status column across every row while withholding the identifiers; share a dataset's non-PII columns while sealing the PII ones. A column anchor at time T1 commits a csv-column-v1 chunk_merkle root over per-column leaves; selective column disclosure at any later time T2 reveals a subset of those committed leaves and proves them into that existing root — no re-anchoring, and (standard mode) no salt keyfile.

The scope is narrow on purpose (a profile this small canonicalizes to the byte and is exhaustively fixture-tested):

2. Inputs and canonicalization

The anchorer feeds raw bytes — the source CSV as it exists on disk. The cell-level canonicalization is byte-for-byte identical to csv-row-v1 §2 and is reproduced here so this profile is self-contained. It MUST match the anchor path's CSV parser exactly; any divergence forks the leaf rule.

Decision (forever): Encoding is UTF-8, decoded leniently (an invalid byte sequence becomes U+FFFD, matching the anchor's file.text()). The canonical cell strings are re-encoded to UTF-8 for hashing.

Decision (forever): Strip ONE leading BOM. If the first decoded code point is U+FEFF, remove it (only the single leading one) before parsing.

Decision (forever): RFC-4180 quote-aware parse. A field opens a quoted region on "; inside, "" is a literal " and a lone " closes the region; , outside quotes ends a field; an unquoted LF/CR/CRLF ends a row (a CRLF is consumed as one break). A ,, LF, or CR inside a quoted field is content.

Decision (forever): No trailing-newline empty row. After the parse loop a final row is appended only if the last field or row buffer is non-empty. A file ending …,3\n and one ending …,3 produce the same rows.

Decision (forever): Minimal re-quote per cell (csvField). A parsed cell is re-emitted wrapped in " with internal " doubled to "" iff it contains any of " , LF CR (the predicate /[",\n\r]/); otherwise it is emitted bare. This is the load-bearing primitive for §3's column join: because a cell containing LF is quoted, the \n that joins a column's cells (§3) can never be confused with an LF inside a cell value. A cell quoted in the source only to wrap an empty string ("") parses to the empty string and re-emits bare.

Consequence carried over from csv-row-v1: a source cell "" (quoted empty) and a source cell that is bare-empty canonicalize to the same bytes (the empty string). Column leaves built from them collide. This is the documented zero-entropy property (§5), not a defect.

3. Leaf extraction — ONE LEAF PER COLUMN, by header INDEX, header EXCLUDED from values

THE COLUMN RULE — read carefully. A leaf is a column, identified by its 0-based position in the header row, not by its header name. The header row defines how many columns exist and supplies each column's display name, but the header cells are excluded from the leaf values (only the data rows, rows 1..N, contribute bytes to a column leaf). A verifier or redaction tool that identifies columns by name, re-orders them, or folds the header cell into a leaf value computes a different leaf-set and a different root and binds to no real anchor.

Decision (forever): The column set and count are pinned by the HEADER row (row 0). ncols = len(header_row) after §2 parsing. leaf_count = ncols. Column index j runs 0 .. ncols-1, left to right.

Decision (forever): A column is identified by header INDEX, never by header name. The header name (header_row[j]) is a display/ordering handle surfaced to humans; it is NOT part of any leaf preimage (§4). Identifying by index survives a header rename and is total under duplicate or empty header names. (A by-name sibling, csv-col-name-v1, is explicitly out of scope — §9.)

Decision (forever): A column leaf's VALUE is the column's DATA cells, header excluded, each csvField-re-quoted, joined by LF (0x0A), with no trailing newline. For column j:

canonical_column_j = "\n".join( csvField(cell(row_i, j))  for i in 1 .. N )

where cell(row_i, j) is data row i's field at position j (see the ragged rule below), csvField is the §2 minimal re-quote, and N is the data-row count. The header cell cell(row_0, j) is excluded. No trailing \n is appended. The \n join is unambiguous precisely because §2 quotes any cell containing \n (so an in-cell newline is wrapped in quotes and is byte-distinct from the join separator — see fixture CC4).

Decision (forever): Ragged rows are reconciled against the header width:

Decision (forever): leaf_id is c<NNN> with N = the 0-based column index zero-padded to three decimal digits. Examples: c000 (first column), c001, c019. leaf_id is a display/ordering handle only — NOT part of any hash preimage (§4). Three digits support up to 1000 columns.

Decision (forever): Leaf ordering is header index order, zero-indexed. Leaf 0 is the first (leftmost) column; the merkle leaf-set order is this order, unchanged. The verifier does NOT re-sort. Header index order is the only column ordering derivable from the raw bytes without anchorer intent.

Decision (forever): Validity caps. Empty input (zero rows after BOM strip) is invalid (invalid_csv_empty). A header-only file (zero data rows) is invalid (invalid_csv_header_only) — every column leaf would be empty and carry no commitment, so a valid csv-column-v1 source has ≥ 1 data row (≥ 2 file rows). A file with > 1000 columns is invalid (invalid_csv_too_many_columns). leaf_count = ncols always satisfies 1 ≤ leaf_count ≤ 1000.

4. Leaf hash — bare sha256 of the canonical column (standard mode)

Decision (forever): The standard leaf hash is the BARE sha256 of the canonical column string's UTF-8 bytes:

leaf_hash_j = SHA-256( utf8( canonical_column_j ) )

There is no profile literal in the preimage, no leaf_id, no header name, no salt, and no 0x00 separators — exactly the canonical column bytes (§3), nothing else. (Identical construction to csv-row-v1 §4; only the unit differs: a column-join instead of a row-join.) The value a disclosure carries for a revealed standard leaf is the canonical column string itself; the verifier hashes utf8(value) and compares to leaf_hash — it does not re-canonicalize.

Worked example (NOT placeholders — computed by the reference impl)

Input CSV bytes (string view, \n = LF byte 0x0A) — the same matrix as csv-row-v1 §4, hashed by column:

name,age,role\nAlice,42,Engineer\nBob,35,Designer\nCarol,29,Writer\n

Header name,age,role defines 3 columns; the header cells are excluded from the leaf values, leaving three column leaves over the data rows:

leaf_idheader namevalue (canonical column)leaf_hash = sha256(utf8(value))
c000nameAlice⏎Bob⏎Carol4861638f64a6f5f3f82c117a61821b78da7e2b3fa81e65d3c018095148fa7435
c001age42⏎35⏎2939472358bdad9300f20becbe3e18b8e311fc62bb338689b1849dda8c12f58a5a
c002roleEngineer⏎Designer⏎Writerb0e228611fd461ffc53d28967779901871d78531ce3d82e70141121a47d40087

( denotes a literal LF 0x0A in the value; the header names are shown for orientation and are NOT in any preimage.) leaf_count = 3. A verifier MUST reproduce these three leaf_hash values exactly from the listed value bytes; if it does not, a step in §2/§3/§4 is wrong — debug against §8 first.

5. Salts — standard mode is UNSALTED (privacy posture is first-class)

Decision (forever): Standard csv-column-v1 is UNSALTED. No salt in the leaf (§4); salt_b64 is ABSENT from standard revealed-leaf entries. Do not synthesize an empty/zero salt.

Privacy posture (restated for columns — understand, do not "fix"). A standard column anchor's redacted columns are protected only against a party who cannot enumerate the unknown column content, and the column shape changes the entropy calculus relative to rows:

This is the documented, anchor-time-chosen cost of standard mode — not a defect to patch with salts. Do NOT use standard mode to withhold a low-cardinality or small-space sensitive column (a flag, an enum, a column of dates from a narrow range): route it to sealed mode (§5b), where redacted columns are unguessable. The UI MUST warn when a selected column is low-cardinality and recommend sealed; the user chooses standard-vs-sealed at anchor time, not at disclosure time. Sealed is the privacy answer; standard's brute-forceability is the documented cost of the no-keyfile path — not a defect to re-litigate.

5b. Sealed mode — HMAC leaf under a per-leaf HKDF salt (algo: "merkle-hmac-sha256")

FROZEN. Sealed csv-column-v1 shares the literal with algo: "merkle-hmac-sha256", the same §2 canonicalization, §3 column segmentation, and §6 duplicate-last merkle as standard, replacing the bare sha256 leaf (§4) with an HMAC under a per-leaf HKDF salt. The per-leaf salt uses the same bare info = "chunk/" || u32_be(j) the three shipped sealed profiles (csv-row-v1 / text-line-v1 / json-keypath-v1) use — NOT scheme-prefixed. The frozen fixture corpus is tests/vectors/disclosure-v1/csv_column_v1_native_sealed/.

Per-leaf salt — bare "chunk/" info. An earlier draft floated a scheme-prefixed info ("csv-column-v1/chunk/") for forward multi-axis domain separation. On the corrected implementation cost — the JS file-anchor verifier re-derives the salt from the master salt, which makes the info a 4-site forever wire-contract (two anchor twins + redact + the JS verifier), not a producer-only detail — the frozen rule keeps the same bare info as the shipped profiles. This is provably collision-free for a single-axis anchor (one chunk_merkle tree, a fresh master salt per anchor; a column leaf and a row leaf only ever share a per-leaf salt across different anchors, whose master salts differ). A future multi-axis container (rows+cols+cells co-anchored under one anchor) will keep its trees salt-disjoint with a per-tree master salt rather than a per-scheme info prefix. For column leaf index j (0-based):

salt_j = HKDF-SHA256(
            ikm    = master_salt,                            # 32-byte bearer secret
            salt   = utf8("satsignal-sealed-v1/per-leaf"),   # shared namespace
            info   = utf8("chunk/") || u32_be(j),            # BARE — same as the 3 shipped profiles
            L      = 32 )
leaf_hash_j = HMAC-SHA256( key = salt_j, msg = utf8( canonical_column_j ) )

The carrier pins algo: "merkle-hmac-sha256", salt_version: "salt_v1". A revealed sealed leaf carries salt_b64 = base64(salt_j) — the PER-LEAF salt, NEVER the master salt; the redact tool reads the master salt from the source .mbnt manifest.json, derives the per-leaf salts of the revealed columns only, and strips the master salt from all output (the §5b.1 master-salt-strip rule of csv-row-v1, applied identically). Revealing per-leaf HKDF salts of revealed columns leaks nothing about the master salt or other columns (HKDF-Expand is a PRF). salt_b64 is REQUIRED for a sealed leaf; a sealed carrier missing it fails closed (sealed_leaf_missing_salt).

6. Merkle behavior — DUPLICATE-LAST on odd

Decision (forever): The merkle is DUPLICATE-LAST on odd nodes, identical to csv-row-v1 §6 and the native anchor builder merkleRootFromHexLeaves (web/static/verifier/merkle.mjs, customer/_anchor_canon_js.py). At each level, nodes are paired left-to-right; an unpaired last node's right sibling is itself (right = (i+1 < len) ? level[i+1] : level[i]); the parent is SHA-256(raw(left) || raw(right)) over raw 32-byte concatenation, no domain tag. A single-leaf tree's root is that leaf (proof_path = []).

Implementer warning. The codebase contains a second, promote-unchanged odd-node primitive — disclosure/merkle.py merkle_root (the generic disclosure-v1.md §3.4 helper / retired dotted corpus). It produces a different root on any odd-count level. csv- column-v1 roots and vectors MUST be computed with the duplicate-last builder. The disclosure verifier never rebuilds the root — it only walks proof_path structure-agnostically — so the only place the odd-node rule matters is the root/proof-path builder; build duplicate-last (emit a self-sibling entry for the odd-promoted column).

Worked example (the §4 three-column tree)

Leaves (L0[j] = column leaf j):

L0[0] = 4861638f…7435   (name:   Alice⏎Bob⏎Carol)
L0[1] = 39472358…8a5a   (age:    42⏎35⏎29)
L0[2] = b0e22861…0087   (role:   Engineer⏎Designer⏎Writer)

Level 1 (3 leaves → odd; the last self-pairs under duplicate-last):

L1[0] = SHA-256( raw(L0[0]) || raw(L0[1]) ) = 831415da13203883b490fb302150f5409a776136428b8db7d2b28d7f21016ab7
L1[1] = SHA-256( raw(L0[2]) || raw(L0[2]) ) = d4d506f43209b0b23acbeeefa4645d2ace08d82fe9c42fa33d0c09572c9a6144   ← DUPLICATE-LAST (role self-pairs)
ROOT  = SHA-256( raw(L1[0]) || raw(L1[1]) ) = eff33d555c0ad3fc4b030f6431052daa79206c3f3c961e8229df0e75c1c3925a

Proof paths a disclosure carries to reveal each column (all walk to ROOT):

reveal c000 (name):
  proof_path = [
    { "side": "R", "hash": "39472358bdad9300f20becbe3e18b8e311fc62bb338689b1849dda8c12f58a5a" },  // L0[1] (age)
    { "side": "R", "hash": "d4d506f43209b0b23acbeeefa4645d2ace08d82fe9c42fa33d0c09572c9a6144" }   // L1[1]
  ]

reveal c002 (role):  ← the odd last node, TWO-entry self-sibling path
  proof_path = [
    { "side": "R", "hash": "b0e228611fd461ffc53d28967779901871d78531ce3d82e70141121a47d40087" },  // L0[2] ITSELF — self-sibling
    { "side": "L", "hash": "831415da13203883b490fb302150f5409a776136428b8db7d2b28d7f21016ab7" }   // L1[0]
  ]

The c002 two-entry self-sibling path is the duplicate-last signature, exactly as csv-row-v1 §6's Carol path — the tree shape is identical; only the leaf bytes differ (columns, not rows).

7. Original anchor binding

A CSV column anchor commits the csv-column-v1 leaf-set under the .mbnt canonical document's subject.proofs.chunk_merkle:

canonical fieldrequired value under this profile
subject.proofs.chunk_merkle.schemeexactly "csv-column-v1"
subject.proofs.chunk_merkle.algo"sha256" (standard); "merkle-hmac-sha256" (sealed, §5b)
subject.proofs.chunk_merkle.leaf_countncols (the header width, §3)
subject.proofs.chunk_merkle.rootduplicate-last merkle root over the column leaves (§6)

Because the server rides the client-supplied root verbatim and never recomputes from leaves (0015; customer/routes.py "leaf recompute is structurally impossible"), the anchor client computes the column leaves + root locally and submits them via proof_set.chunk_merkle + the off-chain proof_leaves companion — the same wire path csv-row-v1 uses. A disclosure under disclosure-v1.md carries this literal in disclosure.linked_anchor.subject_profile; each revealed leaf's profile field MUST equal "csv-column-v1". The binding chain walks revealed[i].value → leaf_hash → linked_anchor.root → original canonical-doc chunk_merkle.root → on-chain document_hash. The whole-file byte_exact (mandatory) and any content_canonical (e.g. csv-norm-v1) proofs are unchanged and still cover the entire file — column granularity governs the redaction tree only, never what the file commitment covers.

Forbidden-variant note. A verifier MUST apply this rule (bare-sha256 column-join leaf, header-excluded, by-index, duplicate-last merkle) to an anchor whose chunk_merkle.scheme == "csv-column-v1" and algo == "sha256". The (subject_profile, chunk_merkle.algo) pair selects the mode: ("csv-column-v1", "sha256") is §4; ("csv-column-v1", "merkle-hmac-sha256") is §5b. An unknown scheme MUST fail closed (unsupported_linked_*), never silently pass.

8. Fixtures (test vectors)

All values below were computed by the csv-column-v1 reference implementation, which is validated against the frozen csv-row-v1 corpus (it reproduces csv-row-v1's Alice leaf 3147617d…, row root 19d82f…, and N1 on-chain hash 3b1fdc5c…) before computing any column value — so the shared §2/§6 primitives are proven byte-correct. They are NOT placeholders. A frozen native corpus lands at tests/vectors/disclosure-v1/csv_column_v1_native/ (positive CC1 + negatives); the inline vectors below keep this spec self-contained.

CC1: primary — header + 3 data rows, 3 columns (odd → duplicate-last)

Input: name,age,role\nAlice,42,Engineer\nBob,35,Designer\nCarol,29,Writer\n. Three column leaves (§4), root = eff33d555c0ad3fc4b030f6431052daa79206c3f3c961e8229df0e75c1c3925a. The committed carrier (scheme:"csv-column-v1", algo:"sha256", leaf_count:3, that root) has on-chain document hash 9de524ad9bfc10ec06aa2c4e7b394459e959652f1cb4e90096c593aef72595ef. A disclosure revealing c000 (name) + c002 (role) and redacting c001 (age) is the committed positive fixture; revealed entries carry no salt_b64.

CC2: even leaf count — 2 columns (clean pairing, no self-pair)

Input: h1,h2\nAlice,42\nBob,35. leaf_count = 2; c000 value Alice⏎Bob (19f31aa7…1f07), c001 value 42⏎35 (c31c5f18…3475); root = SHA-256(raw(c000) || raw(c001)) = 976edbe56aaa841e4b853b7b6877ba664396dac9c9c584a8cac18421cf2d3b0d. Proof paths: c000 → [{R, c31c5f18…3475}]; c001 → [{L, 19f31aa7…1f07}].

CC3: quoted comma in a cell — "Smith, John" is one cell

Input: name,note\n"Smith, John",hi. The comma is inside a quoted field, so the field still contains a , and minimal re-quote preserves the quotes: c000 value is "Smith, John" (with the quotes, 90a3195e…23a7); c001 value is hi (8f434346…7aa4). Pins quote-aware parsing + quote preservation inside a column leaf.

CC4: embedded LF in a quoted cell — the load-bearing \n-escape

Input: a,b\n"x\ny",z where the \n (0x0A) is inside the quoted first cell. Single data row → each column has one cell. c000 value is "x\ny" (the quotes preserved because the cell contains \n; leaf 29b8826c…1e61). This pins WHY the \n column-join is unambiguous: an in-cell newline is quoted and is byte-distinct from the join separator. (Contrast a column whose two cells were x and y: its value would be the bare x⏎y — a different leaf.)

CC5: ragged SHORT row pads trailing cell with empty-string

Input: a,b,c\n1,2,3\n4,5. The second data row has 2 fields; the header has 3, so column c pads the missing cell with "". Columns: c000 = 1⏎4 (6fe7f60f…c9d7), c001 = 2⏎5 (05b6a35f…1e6d), c002 = 3⏎ (the trailing member is the padded empty string; leaf 1121cfcc…02a2). root = 54231fc3c045b76392757cdbf1deec02e080168e8fc1bef55897cccbc6d3a919. Pins the short-row pad rule.

CC6: ragged LONG row REJECTED

Input: a,b\n1,2,3. The data row has 3 fields; the header has 2. Invalid input — REJECTED with invalid_csv_ragged_over. Pins the over-wide-row rejection (no leaf-set is produced).

CC7: single column (leaf_count = 1 → root == leaf, empty path)

Input: only\nx\ny\nz. One column; c000 value x⏎y⏎z (6d421ec4b623af3bdd47ad1d61a629eab8c11f7bf19a1e59576b5f2eede7befc). leaf_count = 1; the single-leaf tree's root == leaf and a reveal carries proof_path = [].

CC8: empty columns collide (zero-entropy cross-equality leak)

Input: a,b,c\n,,foo\n,,bar. Columns a and b are both two empty cells → both have canonical value (a single LF joining two empty strings) → both leaves are 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b (= sha256("\n")). Column c value is foo⏎bar (807eff62…a776). Pins the §5 zero-entropy property: identical column content (here, two empty columns) produces identical leaf hashes — an observer can tell the columns match without recovering their values. Use sealed mode (§5b) to withhold such columns.

CC-NEG: header-cell-included mistake → merkle_path_mismatch

A buggy builder that folds the header cell into each column leaf (CC1 source, c000 value name⏎Alice⏎Bob⏎Carol instead of Alice⏎Bob⏎Carol) keeps leaf_count = 3 but produces a different leaf-set and a different root (ce7466ec…b17b). The header-included leaf's own recompute is self-consistent (value→hash matches), so it is not a leaf_hash_mismatch; instead the header-included proof_path walks to the wrong root while the carrier commits the correct header-EXCLUDED root (eff33d…925a), failing the §7-step-4 merkle walk → merkle_path_mismatch. This pins §3's header-exclusion of leaf VALUES.

9. Out of scope / deprecation pointers

Every future granularity gets its own literal. This profile's literal is the hyphenated csv-column-v1, fixed forever once an anchor commits under it: chunk_merkle.algo == "sha256" selects the standard rule (§§2–4, 6), "merkle-hmac-sha256" selects the sealed rule (§5b, provisional). A verifier distinguishes the two by the (subject_profile, chunk_merkle.algo) pair.

Questions about this specification? Email hello@satsignal.cloud.