satsignal.text.paragraph_sentence.v1 — sentence-level selective disclosure of UTF-8 plaintext

DEPRECATED / INERT (read this first). This satsignal.text.paragraph_sentence.v1 dotted profile is deprecated and inert. It defined a salted rule with one leaf per sentence, addressed by paragraph and sentence index (p<N>/s<M>). No production flow emits or consumes it. Its allowlist literal is retained forever (an allowlist literal is never removed) and its frozen corpus is kept solely as a regression-guard record — it is never produced or verified by any live path. The rules below stay frozen for that regression corpus; do not implement them for new work. Live successor: text-line-v1. New text disclosures use the native text-line-v1 literal, which binds to the chunk_merkle a .txt / .md anchor already commits. Note the granularity difference: this deprecated profile segments by sentence (salted, paragraph/sentence indexed); the live text-line-v1 segments by line (text-norm-v1 canon, drop empty lines, bare/sealed leaf, native merkle binding — no re-anchor, no salt keyfile in standard mode). The two cannot interbind. Authority for the deprecation: disclosure-v1 §11.

Versioning (2026-05-27). This is satsignal.text.paragraph_sentence.v1. The profile literal is fixed at "satsignal.text.paragraph_sentence.v1" and appears verbatim inside every leaf-hash preimage (see §7). Profile literals are forever-contracts: once any client has anchored under this literal, every segmentation, canonicalization, normalization, salting, and leaf_id construction rule below is fixed for that literal forever. A bug in any of these rules cannot be patched in place; the only remedy is a new satsignal.text.paragraph_sentence.v2 profile that compatible verifiers must support in parallel. Breaking shape changes ship as v2, never as a quiet v1 mutation. This spec is the authoritative source for v1 rules; the master disclosure spec at disclosure-v1.md is the home for cross-profile plumbing (manifest shape, merkle invariants, verifier contract).

Status: draft 1, 2026-05-27. Audience: integrators anchoring UTF-8 plaintext documents (contracts, transcripts, articles, policy texts) who anticipate later disclosing specific sentences without revealing the rest; verifier authors who must reproduce this profile's segmentation byte-for-byte against an anchored leaf-set. Goal: define exactly one segmentation, canonicalization, leaf-id, and leaf-hash construction for sentence-level disclosure of UTF-8 plaintext — pinned to the byte, with a fixture set covering the adversarial cases that motivated the forever-contract discipline.

1. Why this exists

satsignal.disclosure.v1 lets an anchorer publish a redacted view of an already-anchored document with cryptographic proof that the revealed fragments were members of the original leaf-set. That construction needs a segmentation rule: somebody has to decide where a sentence starts and ends, because the revealed-leaf membership check recomputes the leaf hash from the exact sentence bytes the original anchor committed to. If two implementations disagree on where a sentence ends, they disagree on its leaf hash, and the disclosure fails to verify against an honest anchor.

Plaintext contracts, transcripts, news articles, policy texts, and similar UTF-8 documents are the headline driver. The disclosure builder ingests the document, runs this profile's canonicalization and segmentation, hashes each sentence into a leaf, builds the merkle tree, and anchors the root through the standard path. Later — at any time — the anchorer picks a subset of sentences, ships a disclosure bundle with their values, salts, and merkle proofs, and a verifier walks each proof to the on-chain-bound root. Sentences that are not disclosed never leave the anchorer's machine.

Sentence-level granularity is the operative leaf-type in v1. Paragraph- level disclosure is not a distinct leaf-type — it is achieved by disclosing every sentence in the chosen paragraph. Word-level and token-level disclosure are out of scope for v1 (§12).

2. Scope and prerequisites

This profile applies to a single UTF-8 plaintext input — a document without structural markup, treated as a sequence of paragraphs of sentences. Markdown, HTML, RTF, PDF, DOCX, and any other shape with encoded structure are not in scope: an anchorer who wants to disclose from such a source MUST first export it to plaintext under their own rules, then anchor that plaintext under this profile. The exported plaintext is the document this profile operates on; the original markup-bearing source is not.

Per disclosure-v1.md §2, selective disclosure is only possible for original anchors that were committed under a leaf-set scheme. The original canonical doc MUST carry subject.proofs.chunk_merkle.{scheme,algo,leaf_count,root} with scheme == "satsignal.text.paragraph_sentence.v1" and algo == "sha256". A pure-byte_exact anchor cannot be selectively disclosed under this profile; an anchorer planning a future disclosure MUST anchor the document under this profile from the start.

3. Inputs and canonicalization

Canonicalization is applied to the raw input bytes before segmentation, leaf extraction, or hashing. Every step below is forever-pinned for v1.

3.1 Encoding

Decision (forever): the input is UTF-8. Invalid UTF-8 fails closed — the disclosure builder MUST refuse to anchor a non-UTF-8 input under this profile. Rationale: pinning one encoding eliminates "which decoder did we use" as a cross-implementation drift surface.

3.2 Byte Order Mark

Decision (forever): a single leading UTF-8 BOM (0xEF 0xBB 0xBF) is stripped if present. No BOM elsewhere is recognized as a BOM. Rationale: many editors emit a BOM that is not part of the author's intended content; treating it as content would change every leaf hash for files saved through such an editor.

3.3 Empty input

Decision (forever): an input that canonicalizes to zero sentences is invalid input. The builder MUST refuse to anchor it. An empty disclosure has no leaves, no root, and no useful semantics; admitting it would force leaf_count = 0 corner cases into every verifier. Rationale: failing closed at the builder is simpler than carrying a zero-leaves edge case forever.

3.4 Line-ending normalization

Decision (forever): any of CR (0x0D), LF (0x0A), or CRLF (0x0D 0x0A) is normalized to a single LF (0x0A) before any further processing. Rationale: a contract saved on Windows and the same contract saved on Linux must produce the same leaves.

3.5 Unicode normalization

Decision (forever): the input is run through NFC (Unicode TR15 Canonical Composition) before any further processing. Rationale: a composed é (U+00E9) and a decomposed é (U+0065 U+0301) display identically and must hash identically; NFC is the single defensible choice and is documented as the canonical form by Unicode. See fixture C9 for the byte-level demonstration.

3.6 Trailing whitespace per line

Decision (forever): trailing space (0x20) and tab (0x09) characters are stripped from each line before paragraph splitting. Other whitespace categories (vertical tab, form feed, NBSP, etc.) are not stripped. Rationale: editors and CI tools routinely add or remove trailing spaces; treating them as content invites silent drift.

3.7 Trailing newline at EOF

Decision (forever): a single trailing LF (0x0A) at end-of-input is stripped if present, so an input ending in "foo\n" and one ending in "foo" produce identical canonical bytes. Interior empty lines are preserved exactly. Rationale: many editors append a final newline as a POSIX convention; treating it as content would change leaves depending on editor save behavior.

3.8 Multiple internal whitespace

Decision (forever): internal whitespace inside a sentence is not collapsed. "foo bar" (two spaces) and "foo bar" (one space) are different sentences with different leaf hashes. Rationale: collapsing would hide deliberate spacing differences (poetry, code blocks copied into prose, intentional emphasis) that the anchorer may need to disclose as written.

4. Paragraph segmentation

4.1 Paragraph splitting

Decision (forever): after canonicalization, a paragraph is a maximal run of non-empty lines separated by one or more empty lines. Empty lines are NOT paragraphs and produce no leaves. So "a\n\nb\n\n\nc" produces three paragraphs (a, b, c); the two empty lines between b and c collapse into a single boundary. Rationale: the empty-line-as-paragraph-break convention is universal in plaintext sources; this rule pins it.

4.2 Paragraph numbering

Decision (forever): paragraphs are zero-indexed in document order. The first paragraph is p0, the second p1, and so on. Rationale: zero-indexing matches every programming-language convention; ambiguity- free.

4.3 Within-paragraph line joining

Decision (forever): when a paragraph spans multiple non-empty lines (no empty line between them), those lines are joined with a single ASCII space (0x20) before sentence segmentation runs. So "a\nb" becomes the segmentation-input string "a b". Rationale: in plaintext sources a mid-paragraph \n is overwhelmingly a soft wrap, not a sentence break; joining with a space matches author intent. The \n characters do not appear in any sentence's bytes.

5. Sentence segmentation — the hard part

Sentence segmentation runs over each paragraph's joined string (per §4.3) independently. The rules below are forever-pinned and apply in the order written.

5.1 Terminator codepoints

Decision (forever): the following code points are the only sentence-ending characters in v1:

CodepointHexName
.U+002EFULL STOP
?U+003FQUESTION MARK
!U+0021EXCLAMATION MARK
U+2026HORIZONTAL ELLIPSIS
U+3002IDEOGRAPHIC FULL STOP
U+FF1FFULLWIDTH QUESTION MARK
U+FF01FULLWIDTH EXCLAMATION MARK

Any future addition is a v2 profile. Other punctuation (semicolons, colons, em-dashes, single dashes) does not terminate.

Rationale: this set covers Latin-script, CJK, and fullwidth Latin without admitting language-specific heuristics that would not survive contact with a new corpus.

5.2 Abbreviation list (DO-NOT-TERMINATE)

Decision (forever): a . (U+002E) is not a terminator when it matches the forever-frozen abbreviation list below under the matching algorithm specified in §5.11 (Decision 25 — multi-dot abbreviation handling). Match is case-sensitive. The list itself is frozen:

Mr   Mrs  Ms   Mx   Dr   Jr   Sr   St   Mt   Ft
vs   etc  e.g  i.e  cf   viz  Inc  Ltd  Co   Corp
LLC  LLP  PLC  GmbH S.A  Pty  No   Vol  pp   p
ch   sec  fig  eq   Prof Rev  Hon  Capt Gen  Lt
Sgt  Maj  Col  Ave  Blvd Rd   Ln   Ct

The carve-out applies only to .; other terminators (?, !, , , , ) are never affected by this list. Adding to the list or changing the matching algorithm in §5.11 requires a v2 profile.

Rationale: every plausible abbreviation must be enumerable; relying on capitalization heuristics or trained models would put the segmentation under a black-box dependency that cannot be reproduced byte-for-byte by an unaffiliated verifier. The list is small and tractable; any addition mints a new profile.

Multi-dot entries note. Three entries in the list — e.g, i.e, S.A — contain an internal . codepoint. The matching algorithm in §5.11 is the only authoritative reading of how these entries shield candidate . characters in the text being segmented; the plain-language phrasing "the word immediately preceding" used in earlier drafts of this spec is insufficient for multi-dot entries and is superseded by §5.11. See fixtures C2 / C14 (single-dot abbreviations) and C15 (multi-dot abbreviations) for byte-level demonstrations.

5.3 Decimal-number rule

Decision (forever): a . (U+002E) is not a terminator when the character immediately before it is [0-9] and the character immediately after it is [0-9]. So 3.14 does not terminate; 5. followed by a space (no digit after) does terminate. Rationale: decimal numbers in prose are unambiguous and trivially recognizable; admitting them would split $3.14 yesterday into two sentences.

5.4 Ellipsis-with-space rule

Decision (forever): a run of three or more . (U+002E) characters terminates if the run is followed by whitespace or by end-of-paragraph. A run of three or more . followed by a non- whitespace, non-. character (a "mid-token" ellipsis) does not terminate. Only the last . of a terminating run is the terminator; intermediate dots are part of the sentence bytes ("foo..." includes all three dots).

Note that this rule and §5.3 (decimal) do not collide: a run of three or more . cannot be a decimal because the immediately-following character of the first dot would itself be ., not a digit.

Rationale: ellipsis is the only multi-codepoint terminator allowed in v1; pinning the rule to "three or more, terminator on whitespace boundary" handles both "She paused... Then she spoke." (terminating) and "She was...uncertain." (mid-token, single sentence) without heuristics.

5.5 Quoted-speech rule

Decision (forever): a terminator that sits inside a quoted span does NOT end the sentence. The sentence ends at the next terminator after the closing quote (under all other rules in this section). So "\"Did you go?\" she asked." is ONE sentence ending at the . after asked, not two.

Recognized quote pairs (forever list):

OpenCloseCodepoints
""U+0022 (self-paired)
''U+0027 (self-paired)
U+201C / U+201D
U+2018 / U+2019
«»U+00AB / U+00BB
U+300C / U+300D

Quote depth is tracked with a simple paired-quote counter: incrementing on an opener and decrementing on the matching closer. While depth > 0, terminator codepoints encountered are passed through as sentence bytes without terminating. The two ASCII straight quotes (U+0022, U+0027) are self-paired — they alternate open/close based on current depth at that level (depth 0 → opener; depth 1 of the same self-paired character → closer). A typographic-pair quote (U+201C/U+201D, etc.) only closes a span opened by its mate.

Mismatched / unclosed quotes (e.g. "Hello.) leave depth > 0 to end-of-paragraph; the paragraph then terminates per §5.10 with its trailing content as a single sentence. Rationale: this is a deterministic fallback that does not require backtracking and does not introduce heuristics.

Rationale: prose with embedded direct speech is the single most common adversarial case; the simple paired-depth tracker handles it without invoking grammar.

5.6 Trailing whitespace after a terminator

Decision (forever): a terminator that is followed by zero, one, or many whitespace characters and then either end-of-paragraph or the start of the next sentence terminates a sentence under the rules of §§5.1–5.5. Capitalization of the next word is NOT used as a heuristic. A terminator followed by whitespace and a lowercase letter still terminates; the abbreviation list of §5.2 is the only carve-out for ambiguous cases. Rationale: capitalization heuristics are language-specific and produce non-reproducible behavior on mixed-script and informal text; ruling them out keeps the segmentation byte-deterministic.

5.7 Sentence-value definition

Decision (forever): a sentence's canonical bytes are the UTF-8 bytes from the first non-whitespace character after the previous terminator (or paragraph start) through and INCLUDING the terminator character itself. Trailing whitespace after the terminator is in no sentence's bytes.

So "Hello. World!" produces two sentences:

The single space between them is in neither sentence's bytes.

Rationale: this rule makes the leaf bytes a contiguous substring of the post-canonicalization paragraph; it is reproducible without any state beyond the terminator position.

5.8 Sentence numbering

Decision (forever): sentences are zero-indexed within their paragraph. The first sentence of paragraph p3 is p3/s0. Rationale: matches §4.2.

5.9 Paragraph with no terminators

Decision (forever): if canonicalization produces a paragraph that contains no terminator codepoints AND is non-empty (a fragment without final punctuation), the paragraph is treated as ONE sentence whose value is the full paragraph bytes (no synthetic terminator is added). See fixture C13.

So "This has no terminator" produces:

Rationale: refusing to anchor unterminated fragments would block real documents (headers, titles, bullet lists); synthesizing a terminator would create bytes the anchorer never wrote. The both-bad options are ruled out by accepting the fragment as-is.

5.10 Trailing content after the last terminator

Decision (forever): if a paragraph contains terminators and is then followed by non-empty content after the final terminator (e.g. "First. Second" — no trailing dot), the trailing content is emitted as an additional sentence with no terminator in its bytes. The rule of §5.9 generalizes: any contiguous post-terminator non-empty content is a sentence whose value is exactly those bytes. Rationale: identical reasoning to §5.9; refusing to emit it loses content the anchorer wrote.

5.11 Decision 25 — Multi-dot abbreviation handling

Decision (forever). A candidate . (U+002E) at position idx within a paragraph string is shielded from being a sentence terminator (per §5.2) iff there exists an entry a in the abbreviation list of §5.2 such that

a == buf[start : start + len(a)] AND start <= idx <= start + len(a)

where start is the left boundary of the dotted-token that contains idx. The dotted-token is the maximal run of non-(whitespace | quote | bracket) characters around idx, defined precisely as follows:

The bracket characters ()[]{} are admitted as dotted-token boundaries to keep the rule sensible for prose patterns such as "(e.g. red ones)" (see EX4 below). None of the forever-frozen abbreviation entries in §5.2 contains a bracket, a quote, or a whitespace character, so widening the boundary set this way cannot make a previously-recognized abbreviation stop matching.

Coverage of the two failure modes the rule corrects. The algorithm above handles both internal and trailing dots of multi-dot abbreviations:

Single-dot abbreviations (Mr, vs, …) are a degenerate case of the same rule: len(a) == 2, idx == start + 2, and the candidate . is the dot immediately after the abbreviation. This matches the prior behavior of §5.2 on single-dot entries (fixtures C2, C14).

Helper name (normative). The reference Python in §11.16 names this helper _dotted_token_match. It supersedes the prior _preceding_word helper, which examined only buf[start : idx] (the prefix ending strictly at idx) and therefore could not recognize an abbreviation whose internal . lies at the candidate position. _preceding_word is removed from v1.

Invariant. _dotted_token_match consults only:

  1. the position idx,
  2. the paragraph buffer buf (post-canonicalization NFC bytes),
  3. the dotted-token boundary character set defined above, and
  4. the forever-frozen abbreviation list of §5.2.

A future profile version (v2) is required to change the abbreviation list, the dotted-token boundary character set, or the matching algorithm. None of those three can be changed under v1.

Worked examples (forever-pinned). The corrected algorithm produces, for the inputs below, exactly the segmentation shown. These four examples are normative test cases for any verifier implementation; they are demonstrated in §11.15 (fixture C15) and in the prose worked-out below:

InputResult
He said e.g. before. And i.e. after.2 sentences: He said e.g. before. · And i.e. after.
It's e.g. a test.1 sentence: It's e.g. a test.
S.A. de C.V. holdings expanded.3 sentences: S.A. de C. · V. · holdings expanded.
Apples (e.g. red ones) are good.1 sentence: Apples (e.g. red ones) are good.

The S.A. de C.V. case is non-obvious and is deliberately pinned here as a forever contract. The reasoning, walked through:

Result: three sentences. This is the segmentation v1 produces; documents that need a different segmentation around C.V must wait for a future profile that extends the list or the matching algorithm.

For Apples (e.g. red ones) are good., the open and close parens are dotted-token boundaries per the rule above; the e.g. dotted-token is bounded by ( on the left and a space on the right. The matching algorithm shields both dots of e.g. as in EX1, so the whole input is one sentence.

6. Leaf id construction

Decision (forever): each sentence's leaf_id is the literal string p<N>/s<M> where N is the zero-indexed paragraph number and M is the zero-indexed sentence number within that paragraph. Neither number is zero-padded.

Examples:

Rationale: this is the address every fixture uses; the bare integers keep the leaf_id short and human-readable.

7. Leaf-hash preimage

Decision (forever): the leaf-hash preimage is built from four byte strings separated by single 0x00 bytes, in this exact order:

leaf_hash = SHA-256(
      profile_literal_utf8        // "satsignal.text.paragraph_sentence.v1" as UTF-8
   || 0x00                        // separator
   || leaf_id_utf8                // e.g. "p3/s0" as UTF-8
   || 0x00                        // separator
   || sentence_value_utf8         // NFC-normalized sentence bytes,
                                  //   including the terminator
                                  //   character itself
   || 0x00                        // separator
   || salt_bytes                  // base64-decoded RAW BYTES from salt_b64
                                  //   (NOT the base64 ASCII)
)

The output is the raw 32-byte SHA-256 digest; on the wire it appears as 64 lowercase hex characters per disclosure-v1.md §3.4.

7.1 Worked example

Take p0/s0 of fixture C1 (full fixture in §11):

Total preimage length: 72 bytes. SHA-256 of those bytes is the C1 p0/s0 leaf_hash:

a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae

7.2 Pin-points

Cross-profile consistency: the preimage shape above (profile || 0x00 || leaf_id || 0x00 || value || 0x00 || salt) is the same layout used by sibling profiles satsignal.csv.row.v1 and satsignal.json.field.v1. A verifier implementing all three shares the same outer hashing routine; only the value-canonicalization rule differs per profile.

8. Salts

8.1 Salt size

Decision (forever): 16 raw bytes per leaf. Rationale: 128 bits of entropy is sufficient to make per-leaf candidate-value attacks infeasible while keeping the per-leaf overhead small.

8.2 Salt uniqueness and persistence

Decision (forever): each leaf's salt is generated by a CSPRNG and is unique per leaf (no two leaves in a single document share a salt). The anchorer MUST persist the salt off-chain alongside the sentence value: without the salt, the disclosure cannot be regenerated and the merkle proof cannot be reconstructed. Rationale: the salt is what prevents an attacker who has the anchored root and a candidate value from confirming whether the candidate matches an undisclosed leaf — without the salt, they cannot recompute the leaf hash. Salts are the privacy primitive of selective disclosure under this profile.

The salt is transported on the wire (when a leaf is disclosed) as salt_b64 per disclosure-v1.md §3.4: base64 of the raw 16 bytes. Undisclosed leaves' salts are NEVER transmitted.

9. Merkle behavior (cross-reference)

The merkle-tree construction, proof-path encoding, single-leaf-tree rule, odd-node promote-unchanged rule, and proof-walk algorithm are defined once in disclosure-v1.md §3.4. This profile does not re-derive them.

Profile-specific notes:

10. Original anchor binding

When the original document is anchored under this profile, the canonical doc's subject.proofs.chunk_merkle block MUST carry:

Fieldv1 value
scheme"satsignal.text.paragraph_sentence.v1"
algo"sha256"
leaf_count(positive integer; number of sentence leaves)
root(64-char lowercase hex; merkle root of leaf-set)

A disclosure bundle's manifest.disclosure.linked_anchor.subject_profile MUST equal "satsignal.text.paragraph_sentence.v1", and every revealed[i].profile in the same block MUST equal that literal. A verifier that finds a mismatch fails closed per disclosure-v1.md §7 with profile_mismatch.

Sealed-mode anchors (algo: "merkle-hmac-sha256") are NOT supported by satsignal.disclosure.v1 (per disclosure-v1.md §4 step 5); this profile inherits that restriction. Sealed-mode sentence disclosure is deferred to a future minor of the disclosure spec or to a per-profile sealed addendum.

11. Fixtures

Each fixture below shows: the input bytes (hex), the canonical form after §3, the paragraph and sentence count, every (leaf_id, value) pair, the per-leaf salt_b64 and computed leaf_hash, and (where indicated) the merkle root and proof paths. All hashes are real SHA-256 outputs computed against the algorithm described in §§3–8; they are not placeholders. The exact reference Python that produced them is in §11.16.

11.1 C1 — minimal

leaf_idvaluesalt_b64leaf_hash
p0/s0Hello world.QzEtc2FsdC1hYWFhYWFhYQ==a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae
p0/s1Goodbye world.QzEtc2FsdC1iYmJiYmJiYg==1bff79cb195d06b2602dc44ce1f7f8e4fc854e621d20954f7cf4fc47ad8e91e5

Merkle root: ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094

Full merkle tree. Two leaves, one inner level (the root).

              root  = ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = a0a9...08aae                                  leaf[p0/s1] = 1bff...91e5

proof_path entries:

11.2 C2 — abbreviation (Mr.)

leaf_idvaluesalt_b64leaf_hash
p0/s0Mr. Smith went home.QzItc2FsdC1jY2NjY2NjYw==57ade39c221efb0830664a597df9ad7a1ac2f2d23859f79d15eb8bd127219419
p0/s1He left at 5.QzItc2FsdC1kZGRkZGRkZA==0f014458d7fe1eb748812f820306f2333de1c29f75d7a388bda4279cfb386588

Merkle root: 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591

Full merkle tree.

              root  = 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = 57ad...9419                                  leaf[p0/s1] = 0f01...6588

proof_path entries:

11.3 C3 — decimal ($3.14)

leaf_idvaluesalt_b64leaf_hash
p0/s0The price was $3.14 yesterday.QzMtc2FsdC1lZWVlZWVlZQ==24734ad67b1bb468d8ab550afd8f85c8a95acdc5dbbbde387485568a39ad6dbe
p0/s1It changed today.QzMtc2FsdC1mZmZmZmZmZg==776c91d98ebf0202c4562bca2e72ed44aed8b5e357eb6773d7eed188678b0b48

11.4 C4 — quoted speech ("Did you go?" she asked.)

leaf_idvaluesalt_b64leaf_hash
p0/s0"Did you go?" she asked.QzQtc2FsdC1nZ2dnZ2dnZw==7a4f31c93ad048e5be7e8f65ac67e4a351b6fbbf30e99b789b3305999906420c
p0/s1He nodded.QzQtc2FsdC1oaGhoaGhoaA==c900b0ef421c8c97af2ee820cb9154945d34b1ef25328dd3291bed496c63ef0f

11.5 C5 — ellipsis with space

leaf_idvaluesalt_b64leaf_hash
p0/s0She paused...QzUtc2FsdC1paWlpaWlpaQ==f66863a8bbde2e3c1d67e540ce6174d4a074fd2cd4bb4cfed9f13ee33f0429e4
p0/s1Then she spoke.QzUtc2FsdC1qampqampqag==31737452546fa568667abce83ea34cbfd6d64900bf3541ac5f40a44b6106cd06

11.6 C6 — mid-token ellipsis (deviation flag)

The brief's literal example was "She was... uncertain.\n" (with a space after the ellipsis) labelled as one sentence, but §5.4 makes ... followed by whitespace a terminator (consistent with C5). To keep the spec rules pure and to exercise the mid-token carve-out as written, this fixture uses "She was...uncertain.\n" (no space after the ellipsis). The deviation is noted in the worker report.

leaf_idvaluesalt_b64leaf_hash
p0/s0She was...uncertain.QzYtc2FsdC1ra2tra2traw==436d353f3f204360122841244c4a4ec3ea9443b9288d3ec0fc868eb194361d25

11.7 C7 — CJK (こんにちは。さようなら。)

leaf_idvaluesalt_b64leaf_hash
p0/s0こんにちは。Qzctc2FsdC1sbGxsbGxsbA==b09ff7dfda7aff38ec1e1e37ec1fe6cef2f2316bc679b15223a406cb9f61998d
p0/s1さようなら。Qzctc2FsdC1tbW1tbW1tbQ==61c4071b343a0c45366f3eb84a83d8a37ffc8ad213e3b42ea825ea0d0dfd5960

11.8 C8 — smart quotes (“Hello!” she said.)

leaf_idvaluesalt_b64leaf_hash
p0/s0“Hello!” she said.Qzgtc2FsdC1ubm5ubm5ubg==24137c38c612d038244f25d7434e464a9f68b4c79ffa4825e943e47036c8ae78

11.9 C9 — NFC equivalence (café.)

This fixture demonstrates that NFC normalization (§3.5) collapses the two byte-distinct representations of café into the same canonical form, yielding identical leaf_hash values.

VariantInput bytes (hex)Canonical bytes (hex)
C9a63 61 66 c3 a9 2e 0a (precomposed é = U+00E9)63 61 66 c3 a9 2e
C9b63 61 66 65 cc 81 2e 0a (decomposed e + U+0301)63 61 66 c3 a9 2e

Both variants canonicalize to the same 6 bytes (café.). With the same salt_b64 (Qzktc2FsdC1vb29vb29vbw== — raw bytes 43 39 2d 73 61 6c 74 2d 6f 6f 6f 6f 6f 6f 6f 6f), both variants produce the identical leaf_hash:

e604850e3138f48df7d5f1858d316500904b89f4f7949446bd42d8faa4b054b4

This is the byte-level demonstration that §3.5's NFC requirement makes the profile editor-agnostic for accented Latin text.

11.10 C10 — multi-paragraph (full merkle tree)

leaf_idvaluesalt_b64leaf_hash
p0/s0First para.QzEwLXNhbHQtYWFhYWFhYQ==1801777a023ee4acb739a39a5c360fd6b5e6a50a4cc59b727db759246323947f
p0/s1Second sentence here.QzEwLXNhbHQtYmJiYmJiYg==d3d3b641f18937b8c1178a0d43964b3f878687cdc62d014a48a6b92614558999
p1/s0Second para.QzEwLXNhbHQtY2NjY2NjYw==63ac8e0c1c571079530066a4e09abc63dfaa435385997c610080173b22755dfc
p1/s1With two sentences.QzEwLXNhbHQtZGRkZGRkZA==ff147d6f08197312d28f2450f7fe4d8ce22600ba95a9ae26e207b09bd915c05e

Merkle root: b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2

Full merkle tree. Four leaves, two inner levels.

                            root  =  b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2
                                  =   SHA-256( I_L || I_R )
                       /                                                     \
        I_L = a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc      I_R = dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734
              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )                                  =  SHA-256( leaf[p1/s0] || leaf[p1/s1] )
            /                       \                                                /                          \
   leaf[p0/s0] = 1801...947f   leaf[p0/s1] = d3d3...8999            leaf[p1/s0] = 63ac...5dfc   leaf[p1/s1] = ff14...c05e

proof_path entries:

11.11 C11 — CRLF mixed

leaf_idvaluesalt_b64leaf_hash
p0/s0First.QzExLXNhbHQtYWFhYWFhYQ==b04b6704c41bd91295ab9c02fca0c5d69324c92609f4d3750077036f7055be61
p0/s1Still p0.QzExLXNhbHQtYmJiYmJiYg==6e35b2ae0bfee6fe8ff5b2cd88e48a289b3e38970558c940dc849b08f7d2df9b
p1/s0Second para.QzExLXNhbHQtY2NjY2NjYw==1b2237c8e8c8fac65b7af016cb6005a79b735eb25a600e3a652eba0921dfbefb

Note that the within-paragraph CRLF between First. and Still p0. is normalized to LF (§3.4), and the two lines are then joined with a single space per §4.3 — yielding the segmentation-input string "First. Still p0." for p0. The LF is not present in any sentence's bytes.

11.12 C12 — BOM

leaf_idvaluesalt_b64leaf_hash
p0/s0Hello world.QzEyLXNhbHQtYWFhYWFhYQ==2bb0fd5264ba70973a636c65419dfdf47be4956a2cc992b0c2d3d690547356c2

Note: this leaf_hash differs from C1's p0/s0 because the salt_b64 differs. The point of the fixture is that the canonical bytes of the sentence are identical to a BOM-less variant — not that the leaf hash is identical (which it would be only if the salt and leaf_id were also identical).

11.13 C13 — incomplete-sentence paragraph

leaf_idvaluesalt_b64leaf_hash
p0/s0This has no terminatorQzEzLXNhbHQtYWFhYWFhYQ==3efe3081be5681139d706aa809e132cace91a96d31683a1a2842514c836c0c9d

11.14 C14 — forever-list abbreviation collision (vs.)

leaf_idvaluesalt_b64leaf_hash
p0/s0vs. the rest.QzE0LXNhbHQtYWFhYWFhYQ==8a0016e15703dedd89d512d3f54f2b8d55173ec1e2827503650051c37bb74821
p0/s1End.QzE0LXNhbHQtYmJiYmJiYg==f9a6380a2ffb925556f6247683576697c9660f7adfcdc78b76a8bc4d4cf812e9

11.15 C15 — multi-dot abbreviations (e.g., i.e.)

This fixture exercises §5.11 (Decision 25) directly: both e.g and i.e are multi-dot entries in the abbreviation list, and v1's matching algorithm shields all four of the internal-and-trailing dots they introduce. The bug being fixed was that the prior _preceding_word helper split this input into four sentence fragments; the patched _dotted_token_match in §11.16 produces the intended two.

leaf_idvaluesalt_b64leaf_hash
p0/s0He said e.g. before lunch.Dw8PDw8PDw8PDw8PDw8PDw==be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882
p0/s1Then i.e. after.8PDw8PDw8PDw8PDw8PDw8A==1a455dea1a4456b1bb6af79022bb3bdd77900367a4a91184904d3c395cb9dcb4

The two salts are deliberately constant byte patterns (16 × 0x0F and 16 × 0xF0) so an unaffiliated verifier can reproduce these hashes without a salt-generation step: the raw-bytes preimage is fully determined by the spec.

Full merkle tree. Two leaves, one inner level (the root).

              root  = 65bc540ee29ee81834460758069e4ed48c2bc05ca8958f11db98ea5b08a060ba
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = be17...9882                                  leaf[p0/s1] = 1a45...dcb4

proof_path entries:

Full preimage breakdown for p0/s0. Reproducing the leaf-hash by hand:

Total preimage length: 86 bytes. SHA-256 of those bytes: be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882.

Counterfactual. Under the pre-patch _preceding_word helper (now removed from v1), this input would have produced four sentence fragments: "He said e.", "g. before lunch.", "Then i.", "e. after.". The hashes for those four fragments would not match the two hashes pinned here. A verifier that runs the patched §11.16 algorithm reproduces the two hashes above exactly; a verifier that runs the pre-patch algorithm cannot verify a C15 disclosure. This is the single bug §5.11 corrects.

11.16 Reference algorithm (Python)

The hashes in §§11.1–11.15 were produced by the algorithm below. It is normative for the segmentation and hashing rules in §§3–8: any divergence between this code and the prose above is a bug in the prose, not in the code (subject to the spec author retaining final say on rule shape).

import hashlib, unicodedata

PROFILE = "satsignal.text.paragraph_sentence.v1"

ABBREVIATIONS = {
    "Mr","Mrs","Ms","Mx","Dr","Jr","Sr","St","Mt","Ft",
    "vs","etc","e.g","i.e","cf","viz","Inc","Ltd","Co","Corp",
    "LLC","LLP","PLC","GmbH","S.A","Pty","No","Vol","pp","p",
    "ch","sec","fig","eq","Prof","Rev","Hon","Capt","Gen","Lt",
    "Sgt","Maj","Col","Ave","Blvd","Rd","Ln","Ct",
}

TERMINATORS = {".", "?", "!", "…", "。", "?", "!"}

QUOTE_OPEN_TO_CLOSE = {
    '"': '"', "'": "'",
    "“": "”", "‘": "’",
    "«": "»", "「": "」",
}
QUOTE_OPENERS = set(QUOTE_OPEN_TO_CLOSE.keys())
QUOTE_CLOSERS = set(QUOTE_OPEN_TO_CLOSE.values())

# §5.11 (Decision 25): dotted-token boundary set. The forever-frozen
# abbreviation list in §5.2 contains no bracket / quote / whitespace
# character, so widening the boundary set with brackets cannot make a
# previously-recognized abbreviation stop matching.
DOTTED_TOKEN_BREAKS = {"(", ")", "[", "]", "{", "}"}


def canonicalize(raw: bytes) -> str:
    text = raw.decode("utf-8")            # §3.1
    if text.startswith(""):         # §3.2
        text = text[1:]
    text = unicodedata.normalize("NFC", text)  # §3.5
    text = text.replace("\r\n", "\n").replace("\r", "\n")  # §3.4
    lines = [ln.rstrip(" \t") for ln in text.split("\n")]  # §3.6
    text = "\n".join(lines)
    if text.endswith("\n"):               # §3.7
        text = text[:-1]
    return text


def split_paragraphs(canonical: str) -> list[str]:
    paragraphs, cur = [], []
    for line in canonical.split("\n"):
        if line == "":
            if cur:
                paragraphs.append(" ".join(cur))   # §4.3
                cur = []
        else:
            cur.append(line)
    if cur:
        paragraphs.append(" ".join(cur))
    return paragraphs


def _is_dotted_token_boundary(ch: str) -> bool:
    """§5.11: whitespace, recognized quote, or ASCII bracket."""
    return (
        ch.isspace()
        or ch in QUOTE_OPENERS
        or ch in QUOTE_CLOSERS
        or ch in DOTTED_TOKEN_BREAKS
    )


def _dotted_token_match(buf: str, idx: int) -> bool:
    """
    §5.11 (Decision 25). Returns True iff the candidate '.' at
    `buf[idx]` is shielded from being a sentence terminator by a
    forever-frozen abbreviation entry.

    The 'dotted-token' surrounding `idx` is the maximal run of
    non-(whitespace | quote | bracket) characters around `idx`,
    delimited by `_is_dotted_token_boundary`. The candidate '.' is
    shielded iff there exists an entry `a` in `ABBREVIATIONS` with

        buf[start : start + len(a)] == a
        AND start <= idx <= start + len(a)

    where `start` is the dotted-token's left boundary. This handles
    BOTH the internal '.' of a multi-dot abbreviation (idx strictly
    inside the entry's span) AND the trailing '.' immediately after
    the entry (idx == start + len(a)).
    """
    # Left boundary of the dotted-token.
    start = idx
    while start > 0 and not _is_dotted_token_boundary(buf[start - 1]):
        start -= 1
    # Right boundary (used only to cap candidate entry length).
    n = len(buf)
    end = idx
    while end < n and not _is_dotted_token_boundary(buf[end]):
        end += 1
    span_left  = idx - start
    span_right = end - start
    for a in ABBREVIATIONS:
        L = len(a)
        # Entry must cover idx (L >= span_left) and fit in the
        # dotted-token (L <= span_right).
        if L < span_left or L > span_right:
            continue
        if buf[start:start + L] == a:
            return True
    return False


def segment_sentences(p: str) -> list[str]:
    sents, n = [], len(p)
    i = 0
    while i < n and p[i].isspace():
        i += 1
    start, stack = i, []
    while i < n:
        ch = p[i]
        # quote depth (§5.5)
        if ch in QUOTE_OPENERS and (
            ch not in QUOTE_CLOSERS or not stack or stack[-1] != ch
        ):
            stack.append(QUOTE_OPEN_TO_CLOSE[ch]); i += 1; continue
        if ch in QUOTE_CLOSERS and stack and stack[-1] == ch:
            stack.pop(); i += 1; continue
        if stack:
            i += 1; continue
        if ch not in TERMINATORS:
            i += 1; continue
        terminates = True
        if ch == ".":
            prev_c = p[i - 1] if i > 0 else ""
            next_c = p[i + 1] if i + 1 < n else ""
            if prev_c.isdigit() and next_c.isdigit():    # §5.3 decimal
                terminates = False
            else:
                # §5.4 ellipsis-with-space
                rs = i
                while rs > 0 and p[rs - 1] == ".":
                    rs -= 1
                re = i
                while re + 1 < n and p[re + 1] == ".":
                    re += 1
                if re - rs + 1 >= 3:
                    if i != re:
                        terminates = False
                    else:
                        after = p[re + 1] if re + 1 < n else ""
                        terminates = (after == "" or after.isspace())
                else:
                    # §5.2 / §5.11 (Decision 25) — multi-dot aware.
                    if _dotted_token_match(p, i):
                        terminates = False
        if not terminates:
            i += 1; continue
        sents.append(p[start:i + 1])                     # §5.7
        i += 1
        while i < n and p[i].isspace():
            i += 1
        start = i
    if start < n and p[start:n].strip() != "":           # §5.9 / §5.10
        sents.append(p[start:n])
    return sents


def leaf_hash(profile, leaf_id, value, salt_bytes):      # §7
    return hashlib.sha256(
        profile.encode("utf-8") + b"\x00"
        + leaf_id.encode("utf-8") + b"\x00"
        + value.encode("utf-8") + b"\x00"
        + salt_bytes
    ).digest()


def merkle_root_and_levels(leaves):
    levels = [list(leaves)]
    cur = list(leaves)
    while len(cur) > 1:
        nxt = []
        for i in range(0, len(cur), 2):
            if i + 1 < len(cur):
                nxt.append(hashlib.sha256(cur[i] + cur[i + 1]).digest())
            else:
                nxt.append(cur[i])                       # odd: promote
        levels.append(nxt)
        cur = nxt
    return cur[0], levels

12. Out of scope for v1

The following are explicitly out of scope for v1 of this profile. Any of them may motivate a future vN+1 profile under a separate decision record; none can be retrofitted into v1.

Questions about this specification? Email hello@satsignal.cloud.