satsignal.text.paragraph_sentence.v1 — sentence-level selective disclosure of UTF-8 plaintext

DEPRECATED / INERT (read this first). This satsignal.text.paragraph_sentence.v1 dotted profile is deprecated and inert. It defined a salted rule with one leaf per sentence, addressed by paragraph and sentence index (p<N>/s<M>). No production flow emits or consumes it. Its allowlist literal is retained forever (an allowlist literal is never removed) and its frozen corpus is kept solely as a regression-guard record — it is never produced or verified by any live path. The rules below stay frozen for that regression corpus; do not implement them for new work. Live successor: text-line-v1. New text disclosures use the native text-line-v1 literal, which binds to the chunk_merkle a .txt / .md anchor already commits. Note the granularity difference: this deprecated profile segments by sentence (salted, paragraph/sentence indexed); the live text-line-v1 segments by line (text-norm-v1 canon, drop empty lines, bare/sealed leaf, native merkle binding — no re-anchor, no salt keyfile in standard mode). The two cannot interbind. Authority for the deprecation: disclosure-v1 §11.

Versioning (2026-05-27). This is satsignal.text.paragraph_sentence.v1. The profile literal is fixed at "satsignal.text.paragraph_sentence.v1" and appears verbatim inside every leaf-hash preimage (see §7). Profile literals are forever-contracts: once any client has anchored under this literal, every segmentation, canonicalization, normalization, salting, and leaf_id construction rule below is fixed for that literal forever. A bug in any of these rules cannot be patched in place; the only remedy is a new satsignal.text.paragraph_sentence.v2 profile that compatible verifiers must support in parallel. Breaking shape changes ship as v2, never as a quiet v1 mutation. This spec is the authoritative source for v1 rules; the master disclosure spec at disclosure-v1.md is the home for cross-profile plumbing (manifest shape, merkle invariants, verifier contract).

Status: draft 1, 2026-05-27. Audience: integrators anchoring UTF-8 plaintext documents (contracts, transcripts, articles, policy texts) who anticipate later disclosing specific sentences without revealing the rest; verifier authors who must reproduce this profile's segmentation byte-for-byte against an anchored leaf-set. Goal: define exactly one segmentation, canonicalization, leaf-id, and leaf-hash construction for sentence-level disclosure of UTF-8 plaintext — pinned to the byte, with a fixture set covering the adversarial cases that motivated the forever-contract discipline.

1. Why this exists

satsignal.disclosure.v1 lets an anchorer publish a redacted view of an already-anchored document with cryptographic proof that the revealed fragments were members of the original leaf-set. That construction needs a segmentation rule: somebody has to decide where a sentence starts and ends, because the revealed-leaf membership check recomputes the leaf hash from the exact sentence bytes the original anchor committed to. If two implementations disagree on where a sentence ends, they disagree on its leaf hash, and the disclosure fails to verify against an honest anchor.

Plaintext contracts, transcripts, news articles, policy texts, and similar UTF-8 documents are the headline driver. The disclosure builder ingests the document, runs this profile's canonicalization and segmentation, hashes each sentence into a leaf, builds the merkle tree, and anchors the root through the standard path. Later — at any time — the anchorer picks a subset of sentences, ships a disclosure bundle with their values, salts, and merkle proofs, and a verifier walks each proof to the on-chain-bound root. Sentences that are not disclosed never leave the anchorer's machine.

Sentence-level granularity is the operative leaf-type in v1. Paragraph- level disclosure is not a distinct leaf-type — it is achieved by disclosing every sentence in the chosen paragraph. Word-level and token-level disclosure are out of scope for v1 (§12).

2. Scope and prerequisites

This profile applies to a single UTF-8 plaintext input — a document without structural markup, treated as a sequence of paragraphs of sentences. Markdown, HTML, RTF, PDF, DOCX, and any other shape with encoded structure are not in scope: an anchorer who wants to disclose from such a source MUST first export it to plaintext under their own rules, then anchor that plaintext under this profile. The exported plaintext is the document this profile operates on; the original markup-bearing source is not.

Per disclosure-v1.md §2, selective disclosure is only possible for original anchors that were committed under a leaf-set scheme. The original canonical doc MUST carry subject.proofs.chunk_merkle.{scheme,algo,leaf_count,root} with scheme == "satsignal.text.paragraph_sentence.v1" and algo == "sha256". A pure-byte_exact anchor cannot be selectively disclosed under this profile; an anchorer planning a future disclosure MUST anchor the document under this profile from the start.

3. Inputs and canonicalization

Canonicalization is applied to the raw input bytes before segmentation, leaf extraction, or hashing. Every step below is forever-pinned for v1.

3.1 Encoding

Decision (forever): the input is UTF-8. Invalid UTF-8 fails closed — the disclosure builder MUST refuse to anchor a non-UTF-8 input under this profile. Rationale: pinning one encoding eliminates "which decoder did we use" as a cross-implementation drift surface.

3.2 Byte Order Mark

Decision (forever): a single leading UTF-8 BOM (0xEF 0xBB 0xBF) is stripped if present. No BOM elsewhere is recognized as a BOM. Rationale: many editors emit a BOM that is not part of the author's intended content; treating it as content would change every leaf hash for files saved through such an editor.

3.3 Empty input

Decision (forever): an input that canonicalizes to zero sentences is invalid input. The builder MUST refuse to anchor it. An empty disclosure has no leaves, no root, and no useful semantics; admitting it would force leaf_count = 0 corner cases into every verifier. Rationale: failing closed at the builder is simpler than carrying a zero-leaves edge case forever.

3.4 Line-ending normalization

Decision (forever): any of CR (0x0D), LF (0x0A), or CRLF (0x0D 0x0A) is normalized to a single LF (0x0A) before any further processing. Rationale: a contract saved on Windows and the same contract saved on Linux must produce the same leaves.

3.5 Unicode normalization

Decision (forever): the input is run through NFC (Unicode TR15 Canonical Composition) before any further processing. Rationale: a composed é (U+00E9) and a decomposed é (U+0065 U+0301) display identically and must hash identically; NFC is the single defensible choice and is documented as the canonical form by Unicode. See fixture C9 for the byte-level demonstration.

3.6 Trailing whitespace per line

Decision (forever): trailing space (0x20) and tab (0x09) characters are stripped from each line before paragraph splitting. Other whitespace categories (vertical tab, form feed, NBSP, etc.) are not stripped. Rationale: editors and CI tools routinely add or remove trailing spaces; treating them as content invites silent drift.

3.7 Trailing newline at EOF

Decision (forever): a single trailing LF (0x0A) at end-of-input is stripped if present, so an input ending in "foo\n" and one ending in "foo" produce identical canonical bytes. Interior empty lines are preserved exactly. Rationale: many editors append a final newline as a POSIX convention; treating it as content would change leaves depending on editor save behavior.

3.8 Multiple internal whitespace

Decision (forever): internal whitespace inside a sentence is not collapsed. "foo bar" (two spaces) and "foo bar" (one space) are different sentences with different leaf hashes. Rationale: collapsing would hide deliberate spacing differences (poetry, code blocks copied into prose, intentional emphasis) that the anchorer may need to disclose as written.

4. Paragraph segmentation

4.1 Paragraph splitting

Decision (forever): after canonicalization, a paragraph is a maximal run of non-empty lines separated by one or more empty lines. Empty lines are NOT paragraphs and produce no leaves. So "a\n\nb\n\n\nc" produces three paragraphs (a, b, c); the two empty lines between b and c collapse into a single boundary. Rationale: the empty-line-as-paragraph-break convention is universal in plaintext sources; this rule pins it.

4.2 Paragraph numbering

Decision (forever): paragraphs are zero-indexed in document order. The first paragraph is p0, the second p1, and so on. Rationale: zero-indexing matches every programming-language convention; ambiguity- free.

4.3 Within-paragraph line joining

Decision (forever): when a paragraph spans multiple non-empty lines (no empty line between them), those lines are joined with a single ASCII space (0x20) before sentence segmentation runs. So "a\nb" becomes the segmentation-input string "a b". Rationale: in plaintext sources a mid-paragraph \n is overwhelmingly a soft wrap, not a sentence break; joining with a space matches author intent. The \n characters do not appear in any sentence's bytes.

5. Sentence segmentation — the hard part

Sentence segmentation runs over each paragraph's joined string (per §4.3) independently. The rules below are forever-pinned and apply in the order written.

5.1 Terminator codepoints

Decision (forever): the following code points are the only sentence-ending characters in v1:

Codepoint	Hex	Name
`.`	U+002E	FULL STOP
`?`	U+003F	QUESTION MARK
`!`	U+0021	EXCLAMATION MARK
`…`	U+2026	HORIZONTAL ELLIPSIS
`。`	U+3002	IDEOGRAPHIC FULL STOP
`？`	U+FF1F	FULLWIDTH QUESTION MARK
`！`	U+FF01	FULLWIDTH EXCLAMATION MARK

Any future addition is a v2 profile. Other punctuation (semicolons, colons, em-dashes, single dashes) does not terminate.

Rationale: this set covers Latin-script, CJK, and fullwidth Latin without admitting language-specific heuristics that would not survive contact with a new corpus.

5.2 Abbreviation list (DO-NOT-TERMINATE)

Decision (forever): a . (U+002E) is not a terminator when it matches the forever-frozen abbreviation list below under the matching algorithm specified in §5.11 (Decision 25 — multi-dot abbreviation handling). Match is case-sensitive. The list itself is frozen:

Mr   Mrs  Ms   Mx   Dr   Jr   Sr   St   Mt   Ft
vs   etc  e.g  i.e  cf   viz  Inc  Ltd  Co   Corp
LLC  LLP  PLC  GmbH S.A  Pty  No   Vol  pp   p
ch   sec  fig  eq   Prof Rev  Hon  Capt Gen  Lt
Sgt  Maj  Col  Ave  Blvd Rd   Ln   Ct

The carve-out applies only to .; other terminators (?, !, …, 。, ？, ！) are never affected by this list. Adding to the list or changing the matching algorithm in §5.11 requires a v2 profile.

Rationale: every plausible abbreviation must be enumerable; relying on capitalization heuristics or trained models would put the segmentation under a black-box dependency that cannot be reproduced byte-for-byte by an unaffiliated verifier. The list is small and tractable; any addition mints a new profile.

Multi-dot entries note. Three entries in the list — e.g, i.e, S.A — contain an internal . codepoint. The matching algorithm in §5.11 is the only authoritative reading of how these entries shield candidate . characters in the text being segmented; the plain-language phrasing "the word immediately preceding" used in earlier drafts of this spec is insufficient for multi-dot entries and is superseded by §5.11. See fixtures C2 / C14 (single-dot abbreviations) and C15 (multi-dot abbreviations) for byte-level demonstrations.

5.3 Decimal-number rule

Decision (forever): a . (U+002E) is not a terminator when the character immediately before it is [0-9] and the character immediately after it is [0-9]. So 3.14 does not terminate; 5. followed by a space (no digit after) does terminate. Rationale: decimal numbers in prose are unambiguous and trivially recognizable; admitting them would split $3.14 yesterday into two sentences.

5.4 Ellipsis-with-space rule

Decision (forever): a run of three or more . (U+002E) characters terminates if the run is followed by whitespace or by end-of-paragraph. A run of three or more . followed by a non- whitespace, non-. character (a "mid-token" ellipsis) does not terminate. Only the last . of a terminating run is the terminator; intermediate dots are part of the sentence bytes ("foo..." includes all three dots).

Note that this rule and §5.3 (decimal) do not collide: a run of three or more . cannot be a decimal because the immediately-following character of the first dot would itself be ., not a digit.

Rationale: ellipsis is the only multi-codepoint terminator allowed in v1; pinning the rule to "three or more, terminator on whitespace boundary" handles both "She paused... Then she spoke." (terminating) and "She was...uncertain." (mid-token, single sentence) without heuristics.

5.5 Quoted-speech rule

Decision (forever): a terminator that sits inside a quoted span does NOT end the sentence. The sentence ends at the next terminator after the closing quote (under all other rules in this section). So "\"Did you go?\" she asked." is ONE sentence ending at the . after asked, not two.

Recognized quote pairs (forever list):

Open	Close	Codepoints
`"`	`"`	U+0022 (self-paired)
`'`	`'`	U+0027 (self-paired)
`“`	`”`	U+201C / U+201D
`‘`	`’`	U+2018 / U+2019
`«`	`»`	U+00AB / U+00BB
`「`	`」`	U+300C / U+300D

Quote depth is tracked with a simple paired-quote counter: incrementing on an opener and decrementing on the matching closer. While depth > 0, terminator codepoints encountered are passed through as sentence bytes without terminating. The two ASCII straight quotes (U+0022, U+0027) are self-paired — they alternate open/close based on current depth at that level (depth 0 → opener; depth 1 of the same self-paired character → closer). A typographic-pair quote (U+201C/U+201D, etc.) only closes a span opened by its mate.

Mismatched / unclosed quotes (e.g. "Hello.) leave depth > 0 to end-of-paragraph; the paragraph then terminates per §5.10 with its trailing content as a single sentence. Rationale: this is a deterministic fallback that does not require backtracking and does not introduce heuristics.

Rationale: prose with embedded direct speech is the single most common adversarial case; the simple paired-depth tracker handles it without invoking grammar.

5.6 Trailing whitespace after a terminator

Decision (forever): a terminator that is followed by zero, one, or many whitespace characters and then either end-of-paragraph or the start of the next sentence terminates a sentence under the rules of §§5.1–5.5. Capitalization of the next word is NOT used as a heuristic. A terminator followed by whitespace and a lowercase letter still terminates; the abbreviation list of §5.2 is the only carve-out for ambiguous cases. Rationale: capitalization heuristics are language-specific and produce non-reproducible behavior on mixed-script and informal text; ruling them out keeps the segmentation byte-deterministic.

5.7 Sentence-value definition

Decision (forever): a sentence's canonical bytes are the UTF-8 bytes from the first non-whitespace character after the previous terminator (or paragraph start) through and INCLUDING the terminator character itself. Trailing whitespace after the terminator is in no sentence's bytes.

So "Hello. World!" produces two sentences:

p0/s0 = "Hello." (6 bytes)
p0/s1 = "World!" (6 bytes)

The single space between them is in neither sentence's bytes.

Rationale: this rule makes the leaf bytes a contiguous substring of the post-canonicalization paragraph; it is reproducible without any state beyond the terminator position.

5.8 Sentence numbering

Decision (forever): sentences are zero-indexed within their paragraph. The first sentence of paragraph p3 is p3/s0. Rationale: matches §4.2.

5.9 Paragraph with no terminators

Decision (forever): if canonicalization produces a paragraph that contains no terminator codepoints AND is non-empty (a fragment without final punctuation), the paragraph is treated as ONE sentence whose value is the full paragraph bytes (no synthetic terminator is added). See fixture C13.

So "This has no terminator" produces:

p0/s0 = "This has no terminator" (22 bytes, no trailing dot)

Rationale: refusing to anchor unterminated fragments would block real documents (headers, titles, bullet lists); synthesizing a terminator would create bytes the anchorer never wrote. The both-bad options are ruled out by accepting the fragment as-is.

5.10 Trailing content after the last terminator

Decision (forever): if a paragraph contains terminators and is then followed by non-empty content after the final terminator (e.g. "First. Second" — no trailing dot), the trailing content is emitted as an additional sentence with no terminator in its bytes. The rule of §5.9 generalizes: any contiguous post-terminator non-empty content is a sentence whose value is exactly those bytes. Rationale: identical reasoning to §5.9; refusing to emit it loses content the anchorer wrote.

5.11 Decision 25 — Multi-dot abbreviation handling

Decision (forever). A candidate . (U+002E) at position idx within a paragraph string is shielded from being a sentence terminator (per §5.2) iff there exists an entry a in the abbreviation list of §5.2 such that

a == buf[start : start + len(a)] AND start <= idx <= start + len(a)

where start is the left boundary of the dotted-token that contains idx. The dotted-token is the maximal run of non-(whitespace | quote | bracket) characters around idx, defined precisely as follows:

A character c is a dotted-token boundary iff
- c.isspace() is true (Python str.isspace semantics on the post-canonicalization NFC string — covers U+0020, U+0009, U+00A0, etc.), OR
- c is a recognized quote opener or closer per §5.5 (one of ", ', “, ”, ‘, ’, «, », 「, 」), OR
- c is one of (, ), [, ], {, } (ASCII brackets).
start is the smallest index s ≤ idx such that every character in buf[s : idx] is not a dotted-token boundary.

The bracket characters ()[]{} are admitted as dotted-token boundaries to keep the rule sensible for prose patterns such as "(e.g. red ones)" (see EX4 below). None of the forever-frozen abbreviation entries in §5.2 contains a bracket, a quote, or a whitespace character, so widening the boundary set this way cannot make a previously-recognized abbreviation stop matching.

Coverage of the two failure modes the rule corrects. The algorithm above handles both internal and trailing dots of multi-dot abbreviations:

Internal . of a multi-dot abbreviation (e.g. the first . of e.g., between e and g): the entry e.g satisfies buf[start : start + 3] == "e.g" and start <= idx < start + 3, so the candidate . is shielded.
Trailing . of a multi-dot abbreviation (e.g. the second . of e.g., after g): the entry e.g again satisfies buf[start : start + 3] == "e.g", and the candidate . sits at idx == start + 3, so the candidate . is shielded.

Single-dot abbreviations (Mr, vs, …) are a degenerate case of the same rule: len(a) == 2, idx == start + 2, and the candidate . is the dot immediately after the abbreviation. This matches the prior behavior of §5.2 on single-dot entries (fixtures C2, C14).

Helper name (normative). The reference Python in §11.16 names this helper _dotted_token_match. It supersedes the prior _preceding_word helper, which examined only buf[start : idx] (the prefix ending strictly at idx) and therefore could not recognize an abbreviation whose internal . lies at the candidate position. _preceding_word is removed from v1.

Invariant. _dotted_token_match consults only:

the position idx,
the paragraph buffer buf (post-canonicalization NFC bytes),
the dotted-token boundary character set defined above, and
the forever-frozen abbreviation list of §5.2.

A future profile version (v2) is required to change the abbreviation list, the dotted-token boundary character set, or the matching algorithm. None of those three can be changed under v1.

Worked examples (forever-pinned). The corrected algorithm produces, for the inputs below, exactly the segmentation shown. These four examples are normative test cases for any verifier implementation; they are demonstrated in §11.15 (fixture C15) and in the prose worked-out below:

Input	Result
`He said e.g. before. And i.e. after.`	2 sentences: `He said e.g. before.` · `And i.e. after.`
`It's e.g. a test.`	1 sentence: `It's e.g. a test.`
`S.A. de C.V. holdings expanded.`	3 sentences: `S.A. de C.` · `V.` · `holdings expanded.`
`Apples (e.g. red ones) are good.`	1 sentence: `Apples (e.g. red ones) are good.`

The S.A. de C.V. case is non-obvious and is deliberately pinned here as a forever contract. The reasoning, walked through:

S.A is on the abbreviation list; C.V is not (no addition is permitted under v1 per §5.2). Per §5.6 (Decision 17), next-word capitalization is not used as a heuristic — the lowercase next word de after S.A. does not influence the decision.
The . immediately after S (the first dot of S.A.) is shielded because the dotted-token starting at S matches the entry S.A and the candidate . lies inside that match's span.
The . immediately after A (the trailing dot of S.A.) is shielded because the dotted-token still matches S.A and the candidate . lies at idx == start + 3 == start + len("S.A").
The . immediately after C is not shielded: the dotted-token starting at C is C.V., and no abbreviation entry in §5.2 is a prefix of C.V. (in particular, neither C nor C.V is in the list). So this . does terminate.
The . immediately after V is not shielded: the dotted-token starting at V is V., and V is not in the list. So this . also terminates.

Result: three sentences. This is the segmentation v1 produces; documents that need a different segmentation around C.V must wait for a future profile that extends the list or the matching algorithm.

For Apples (e.g. red ones) are good., the open and close parens are dotted-token boundaries per the rule above; the e.g. dotted-token is bounded by ( on the left and a space on the right. The matching algorithm shields both dots of e.g. as in EX1, so the whole input is one sentence.

6. Leaf id construction

Decision (forever): each sentence's leaf_id is the literal string p<N>/s<M> where N is the zero-indexed paragraph number and M is the zero-indexed sentence number within that paragraph. Neither number is zero-padded.

Examples:

p0/s0 — first sentence of the first paragraph
p3/s4 — fifth sentence of the fourth paragraph
p12/s0 — first sentence of the thirteenth paragraph

Rationale: this is the address every fixture uses; the bare integers keep the leaf_id short and human-readable.

7. Leaf-hash preimage

Decision (forever): the leaf-hash preimage is built from four byte strings separated by single 0x00 bytes, in this exact order:

leaf_hash = SHA-256(
      profile_literal_utf8        // "satsignal.text.paragraph_sentence.v1" as UTF-8
   || 0x00                        // separator
   || leaf_id_utf8                // e.g. "p3/s0" as UTF-8
   || 0x00                        // separator
   || sentence_value_utf8         // NFC-normalized sentence bytes,
                                  //   including the terminator
                                  //   character itself
   || 0x00                        // separator
   || salt_bytes                  // base64-decoded RAW BYTES from salt_b64
                                  //   (NOT the base64 ASCII)
)

The output is the raw 32-byte SHA-256 digest; on the wire it appears as 64 lowercase hex characters per disclosure-v1.md §3.4.

7.1 Worked example

Take p0/s0 of fixture C1 (full fixture in §11):

profile_literal_utf8 = 73 61 74 73 69 67 6e 61 6c 2e 74 65 78 74 2e 70 61 72 61 67 72 61 70 68 5f 73 65 6e 74 65 6e 63 65 2e 76 31 (36 bytes)
separator = 00 (1 byte)
leaf_id_utf8 = 70 30 2f 73 30 (5 bytes; "p0/s0")
separator = 00 (1 byte)
sentence_value_utf8 = 48 65 6c 6c 6f 20 77 6f 72 6c 64 2e (12 bytes; "Hello world.")
separator = 00 (1 byte)
salt_bytes = 43 31 2d 73 61 6c 74 2d 61 61 61 61 61 61 61 61 (16 bytes; the raw bytes whose base64 is QzEtc2FsdC1hYWFhYWFhYQ==)

Total preimage length: 72 bytes. SHA-256 of those bytes is the C1 p0/s0 leaf_hash:

a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae

7.2 Pin-points

The separators are single 0x00 bytes, not strings of nulls or any other delimiter. Triple or zero separators are malformed.
The salt_bytes are the raw decoded salt, not the base64 text. A verifier that hashes the base64 ASCII will produce a different (wrong) digest.
profile_literal_utf8 is the literal string satsignal.text.paragraph_sentence.v1 encoded as UTF-8 — 36 ASCII bytes. It is inside every leaf hash; this is what makes the profile literal a forever-contract.
sentence_value_utf8 is the NFC-normalized canonical bytes per §3.5, including the trailing terminator codepoint (per §5.7).

Cross-profile consistency: the preimage shape above (profile || 0x00 || leaf_id || 0x00 || value || 0x00 || salt) is the same layout used by sibling profiles satsignal.csv.row.v1 and satsignal.json.field.v1. A verifier implementing all three shares the same outer hashing routine; only the value-canonicalization rule differs per profile.

8. Salts

8.1 Salt size

Decision (forever): 16 raw bytes per leaf. Rationale: 128 bits of entropy is sufficient to make per-leaf candidate-value attacks infeasible while keeping the per-leaf overhead small.

8.2 Salt uniqueness and persistence

Decision (forever): each leaf's salt is generated by a CSPRNG and is unique per leaf (no two leaves in a single document share a salt). The anchorer MUST persist the salt off-chain alongside the sentence value: without the salt, the disclosure cannot be regenerated and the merkle proof cannot be reconstructed. Rationale: the salt is what prevents an attacker who has the anchored root and a candidate value from confirming whether the candidate matches an undisclosed leaf — without the salt, they cannot recompute the leaf hash. Salts are the privacy primitive of selective disclosure under this profile.

The salt is transported on the wire (when a leaf is disclosed) as salt_b64 per disclosure-v1.md §3.4: base64 of the raw 16 bytes. Undisclosed leaves' salts are NEVER transmitted.

9. Merkle behavior (cross-reference)

The merkle-tree construction, proof-path encoding, single-leaf-tree rule, odd-node promote-unchanged rule, and proof-walk algorithm are defined once in disclosure-v1.md §3.4. This profile does not re-derive them.

Profile-specific notes:

Leaf ordering (per disclosure-v1.md §3.4 invariant 4): leaves appear in document order — p0/s0, p0/s1, …, p0/sK, p1/s0, p1/s1, …, pN/sM. The verifier does NOT re-sort; the disclosure carries leaves in document order.
Hash algorithm: SHA-256, 64-character lowercase hex on the wire, raw-bytes concatenation at proof-walk time (the disclosure spec's hex-vs-bytes invariant 1 applies verbatim).

10. Original anchor binding

When the original document is anchored under this profile, the canonical doc's subject.proofs.chunk_merkle block MUST carry:

Field	v1 value
`scheme`	`"satsignal.text.paragraph_sentence.v1"`
`algo`	`"sha256"`
`leaf_count`	(positive integer; number of sentence leaves)
`root`	(64-char lowercase hex; merkle root of leaf-set)

A disclosure bundle's manifest.disclosure.linked_anchor.subject_profile MUST equal "satsignal.text.paragraph_sentence.v1", and every revealed[i].profile in the same block MUST equal that literal. A verifier that finds a mismatch fails closed per disclosure-v1.md §7 with profile_mismatch.

Sealed-mode anchors (algo: "merkle-hmac-sha256") are NOT supported by satsignal.disclosure.v1 (per disclosure-v1.md §4 step 5); this profile inherits that restriction. Sealed-mode sentence disclosure is deferred to a future minor of the disclosure spec or to a per-profile sealed addendum.

11. Fixtures

Each fixture below shows: the input bytes (hex), the canonical form after §3, the paragraph and sentence count, every (leaf_id, value) pair, the per-leaf salt_b64 and computed leaf_hash, and (where indicated) the merkle root and proof paths. All hashes are real SHA-256 outputs computed against the algorithm described in §§3–8; they are not placeholders. The exact reference Python that produced them is in §11.16.

11.1 C1 — minimal

Input bytes (hex): 48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 20 47 6f 6f 64 62 79 65 20 77 6f 72 6c 64 2e 0a
Canonical form: "Hello world. Goodbye world."
Paragraphs: 1 · Sentences: 2

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`Hello world.`	`QzEtc2FsdC1hYWFhYWFhYQ==`	`a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae`
`p0/s1`	`Goodbye world.`	`QzEtc2FsdC1iYmJiYmJiYg==`	`1bff79cb195d06b2602dc44ce1f7f8e4fc854e621d20954f7cf4fc47ad8e91e5`

Merkle root: ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094

Full merkle tree. Two leaves, one inner level (the root).

              root  = ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = a0a9...08aae                                  leaf[p0/s1] = 1bff...91e5

proof_path entries:

For p0/s0: [{"side":"R","hash":"1bff79cb195d06b2602dc44ce1f7f8e4fc854e621d20954f7cf4fc47ad8e91e5"}]
For p0/s1: [{"side":"L","hash":"a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae"}]

11.2 C2 — abbreviation (`Mr.`)

Input bytes (hex): 4d 72 2e 20 53 6d 69 74 68 20 77 65 6e 74 20 68 6f 6d 65 2e 20 48 65 20 6c 65 66 74 20 61 74 20 35 2e 0a
Canonical form: "Mr. Smith went home. He left at 5."
Paragraphs: 1 · Sentences: 2 (NOT 3 — the Mr. does not terminate per §5.2)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`Mr. Smith went home.`	`QzItc2FsdC1jY2NjY2NjYw==`	`57ade39c221efb0830664a597df9ad7a1ac2f2d23859f79d15eb8bd127219419`
`p0/s1`	`He left at 5.`	`QzItc2FsdC1kZGRkZGRkZA==`	`0f014458d7fe1eb748812f820306f2333de1c29f75d7a388bda4279cfb386588`

Merkle root: 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591

Full merkle tree.

              root  = 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = 57ad...9419                                  leaf[p0/s1] = 0f01...6588

proof_path entries:

For p0/s0: [{"side":"R","hash":"0f014458d7fe1eb748812f820306f2333de1c29f75d7a388bda4279cfb386588"}]
For p0/s1: [{"side":"L","hash":"57ade39c221efb0830664a597df9ad7a1ac2f2d23859f79d15eb8bd127219419"}]

11.3 C3 — decimal (`$3.14`)

Input bytes (hex): 54 68 65 20 70 72 69 63 65 20 77 61 73 20 24 33 2e 31 34 20 79 65 73 74 65 72 64 61 79 2e 20 49 74 20 63 68 61 6e 67 65 64 20 74 6f 64 61 79 2e 0a
Canonical form: "The price was $3.14 yesterday. It changed today."
Paragraphs: 1 · Sentences: 2 (NOT 3 — 3.14 does not split per §5.3)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`The price was $3.14 yesterday.`	`QzMtc2FsdC1lZWVlZWVlZQ==`	`24734ad67b1bb468d8ab550afd8f85c8a95acdc5dbbbde387485568a39ad6dbe`
`p0/s1`	`It changed today.`	`QzMtc2FsdC1mZmZmZmZmZg==`	`776c91d98ebf0202c4562bca2e72ed44aed8b5e357eb6773d7eed188678b0b48`

11.4 C4 — quoted speech (`"Did you go?" she asked.`)

Input bytes (hex): 22 44 69 64 20 79 6f 75 20 67 6f 3f 22 20 73 68 65 20 61 73 6b 65 64 2e 20 48 65 20 6e 6f 64 64 65 64 2e 0a
Canonical form: "\"Did you go?\" she asked. He nodded."
Paragraphs: 1 · Sentences: 2 (NOT 3 — the ? inside quotes does not terminate per §5.5)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`"Did you go?" she asked.`	`QzQtc2FsdC1nZ2dnZ2dnZw==`	`7a4f31c93ad048e5be7e8f65ac67e4a351b6fbbf30e99b789b3305999906420c`
`p0/s1`	`He nodded.`	`QzQtc2FsdC1oaGhoaGhoaA==`	`c900b0ef421c8c97af2ee820cb9154945d34b1ef25328dd3291bed496c63ef0f`

11.5 C5 — ellipsis with space

Input bytes (hex): 53 68 65 20 70 61 75 73 65 64 2e 2e 2e 20 54 68 65 6e 20 73 68 65 20 73 70 6f 6b 65 2e 0a
Canonical form: "She paused... Then she spoke."
Paragraphs: 1 · Sentences: 2 (the ... followed by space terminates per §5.4)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`She paused...`	`QzUtc2FsdC1paWlpaWlpaQ==`	`f66863a8bbde2e3c1d67e540ce6174d4a074fd2cd4bb4cfed9f13ee33f0429e4`
`p0/s1`	`Then she spoke.`	`QzUtc2FsdC1qampqampqag==`	`31737452546fa568667abce83ea34cbfd6d64900bf3541ac5f40a44b6106cd06`

11.6 C6 — mid-token ellipsis (deviation flag)

The brief's literal example was "She was... uncertain.\n" (with a space after the ellipsis) labelled as one sentence, but §5.4 makes ... followed by whitespace a terminator (consistent with C5). To keep the spec rules pure and to exercise the mid-token carve-out as written, this fixture uses "She was...uncertain.\n" (no space after the ellipsis). The deviation is noted in the worker report.

Input bytes (hex): 53 68 65 20 77 61 73 2e 2e 2e 75 6e 63 65 72 74 61 69 6e 2e 0a
Canonical form: "She was...uncertain."
Paragraphs: 1 · Sentences: 1 (the ... is mid-token, no whitespace immediately after — does not terminate per §5.4)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`She was...uncertain.`	`QzYtc2FsdC1ra2tra2traw==`	`436d353f3f204360122841244c4a4ec3ea9443b9288d3ec0fc868eb194361d25`

11.7 C7 — CJK (`こんにちは。さようなら。`)

Input bytes (hex): e3 81 93 e3 82 93 e3 81 ab e3 81 a1 e3 81 af e3 80 82 e3 81 95 e3 82 88 e3 81 86 e3 81 aa e3 82 89 e3 80 82 0a
Canonical form: "こんにちは。さようなら。"
Paragraphs: 1 · Sentences: 2 (U+3002 IDEOGRAPHIC FULL STOP terminates per §5.1)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`こんにちは。`	`Qzctc2FsdC1sbGxsbGxsbA==`	`b09ff7dfda7aff38ec1e1e37ec1fe6cef2f2316bc679b15223a406cb9f61998d`
`p0/s1`	`さようなら。`	`Qzctc2FsdC1tbW1tbW1tbQ==`	`61c4071b343a0c45366f3eb84a83d8a37ffc8ad213e3b42ea825ea0d0dfd5960`

11.8 C8 — smart quotes (`“Hello!” she said.`)

Input bytes (hex): e2 80 9c 48 65 6c 6c 6f 21 e2 80 9d 20 73 68 65 20 73 61 69 64 2e 0a
Canonical form: "“Hello!” she said." (using U+201C and U+201D)
Paragraphs: 1 · Sentences: 1 (the ! inside the typographic-pair quotes does not terminate per §5.5)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`“Hello!” she said.`	`Qzgtc2FsdC1ubm5ubm5ubg==`	`24137c38c612d038244f25d7434e464a9f68b4c79ffa4825e943e47036c8ae78`

11.9 C9 — NFC equivalence (`café.`)

This fixture demonstrates that NFC normalization (§3.5) collapses the two byte-distinct representations of café into the same canonical form, yielding identical leaf_hash values.

Variant	Input bytes (hex)	Canonical bytes (hex)
C9a	`63 61 66 c3 a9 2e 0a` (precomposed `é` = U+00E9)	`63 61 66 c3 a9 2e`
C9b	`63 61 66 65 cc 81 2e 0a` (decomposed `e` + U+0301)	`63 61 66 c3 a9 2e`

Both variants canonicalize to the same 6 bytes (café.). With the same salt_b64 (Qzktc2FsdC1vb29vb29vbw== — raw bytes 43 39 2d 73 61 6c 74 2d 6f 6f 6f 6f 6f 6f 6f 6f), both variants produce the identical leaf_hash:

e604850e3138f48df7d5f1858d316500904b89f4f7949446bd42d8faa4b054b4

This is the byte-level demonstration that §3.5's NFC requirement makes the profile editor-agnostic for accented Latin text.

11.10 C10 — multi-paragraph (full merkle tree)

Input bytes (hex): 46 69 72 73 74 20 70 61 72 61 2e 20 53 65 63 6f 6e 64 20 73 65 6e 74 65 6e 63 65 20 68 65 72 65 2e 0a 0a 53 65 63 6f 6e 64 20 70 61 72 61 2e 20 57 69 74 68 20 74 77 6f 20 73 65 6e 74 65 6e 63 65 73 2e 0a
Canonical form: "First para. Second sentence here.\n\nSecond para. With two sentences."
Paragraphs: 2 · Sentences: 4

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`First para.`	`QzEwLXNhbHQtYWFhYWFhYQ==`	`1801777a023ee4acb739a39a5c360fd6b5e6a50a4cc59b727db759246323947f`
`p0/s1`	`Second sentence here.`	`QzEwLXNhbHQtYmJiYmJiYg==`	`d3d3b641f18937b8c1178a0d43964b3f878687cdc62d014a48a6b92614558999`
`p1/s0`	`Second para.`	`QzEwLXNhbHQtY2NjY2NjYw==`	`63ac8e0c1c571079530066a4e09abc63dfaa435385997c610080173b22755dfc`
`p1/s1`	`With two sentences.`	`QzEwLXNhbHQtZGRkZGRkZA==`	`ff147d6f08197312d28f2450f7fe4d8ce22600ba95a9ae26e207b09bd915c05e`

Merkle root: b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2

Full merkle tree. Four leaves, two inner levels.

                            root  =  b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2
                                  =   SHA-256( I_L || I_R )
                       /                                                     \
        I_L = a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc      I_R = dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734
              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )                                  =  SHA-256( leaf[p1/s0] || leaf[p1/s1] )
            /                       \                                                /                          \
   leaf[p0/s0] = 1801...947f   leaf[p0/s1] = d3d3...8999            leaf[p1/s0] = 63ac...5dfc   leaf[p1/s1] = ff14...c05e

proof_path entries:

For p0/s0: [{"side":"R","hash":"d3d3b641f18937b8c1178a0d43964b3f878687cdc62d014a48a6b92614558999"}, {"side":"R","hash":"dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734"}]
For p0/s1: [{"side":"L","hash":"1801777a023ee4acb739a39a5c360fd6b5e6a50a4cc59b727db759246323947f"}, {"side":"R","hash":"dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734"}]
For p1/s0: [{"side":"R","hash":"ff147d6f08197312d28f2450f7fe4d8ce22600ba95a9ae26e207b09bd915c05e"}, {"side":"L","hash":"a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc"}]
For p1/s1: [{"side":"L","hash":"63ac8e0c1c571079530066a4e09abc63dfaa435385997c610080173b22755dfc"}, {"side":"L","hash":"a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc"}]

11.11 C11 — CRLF mixed

Input bytes (hex): 46 69 72 73 74 2e 0d 0a 53 74 69 6c 6c 20 70 30 2e 0a 0d 0a 53 65 63 6f 6e 64 20 70 61 72 61 2e 0d 0a
Canonical form: "First.\nStill p0.\n\nSecond para." (matches the same input encoded with only \n)
Paragraphs: 2 · Sentences: 3 (p0 has 2; p1 has 1)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`First.`	`QzExLXNhbHQtYWFhYWFhYQ==`	`b04b6704c41bd91295ab9c02fca0c5d69324c92609f4d3750077036f7055be61`
`p0/s1`	`Still p0.`	`QzExLXNhbHQtYmJiYmJiYg==`	`6e35b2ae0bfee6fe8ff5b2cd88e48a289b3e38970558c940dc849b08f7d2df9b`
`p1/s0`	`Second para.`	`QzExLXNhbHQtY2NjY2NjYw==`	`1b2237c8e8c8fac65b7af016cb6005a79b735eb25a600e3a652eba0921dfbefb`

Note that the within-paragraph CRLF between First. and Still p0. is normalized to LF (§3.4), and the two lines are then joined with a single space per §4.3 — yielding the segmentation-input string "First. Still p0." for p0. The LF is not present in any sentence's bytes.

11.12 C12 — BOM

Input bytes (hex): ef bb bf 48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 0a
Canonical form (after §3.2 BOM strip): "Hello world." (identical to the canonical form of the BOM-less input 48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 0a)
Paragraphs: 1 · Sentences: 1

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`Hello world.`	`QzEyLXNhbHQtYWFhYWFhYQ==`	`2bb0fd5264ba70973a636c65419dfdf47be4956a2cc992b0c2d3d690547356c2`

Note: this leaf_hash differs from C1's p0/s0 because the salt_b64 differs. The point of the fixture is that the canonical bytes of the sentence are identical to a BOM-less variant — not that the leaf hash is identical (which it would be only if the salt and leaf_id were also identical).

11.13 C13 — incomplete-sentence paragraph

Input bytes (hex): 54 68 69 73 20 68 61 73 20 6e 6f 20 74 65 72 6d 69 6e 61 74 6f 72 0a
Canonical form: "This has no terminator"
Paragraphs: 1 · Sentences: 1 (paragraph has no terminator codepoints → §5.9 applies; the full paragraph bytes are one sentence with NO synthetic terminator added)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`This has no terminator`	`QzEzLXNhbHQtYWFhYWFhYQ==`	`3efe3081be5681139d706aa809e132cace91a96d31683a1a2842514c836c0c9d`

11.14 C14 — forever-list abbreviation collision (`vs.`)

Input bytes (hex): 76 73 2e 20 74 68 65 20 72 65 73 74 2e 20 45 6e 64 2e 0a
Canonical form: "vs. the rest. End."
Paragraphs: 1 · Sentences: 2 (the vs. is on the abbreviation list per §5.2 and does not terminate; the next . after rest does; the final . after End terminates the second sentence)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`vs. the rest.`	`QzE0LXNhbHQtYWFhYWFhYQ==`	`8a0016e15703dedd89d512d3f54f2b8d55173ec1e2827503650051c37bb74821`
`p0/s1`	`End.`	`QzE0LXNhbHQtYmJiYmJiYg==`	`f9a6380a2ffb925556f6247683576697c9660f7adfcdc78b76a8bc4d4cf812e9`

11.15 C15 — multi-dot abbreviations (`e.g.`, `i.e.`)

This fixture exercises §5.11 (Decision 25) directly: both e.g and i.e are multi-dot entries in the abbreviation list, and v1's matching algorithm shields all four of the internal-and-trailing dots they introduce. The bug being fixed was that the prior _preceding_word helper split this input into four sentence fragments; the patched _dotted_token_match in §11.16 produces the intended two.

Input bytes (hex): 48 65 20 73 61 69 64 20 65 2e 67 2e 20 62 65 66 6f 72 65 20 6c 75 6e 63 68 2e 20 54 68 65 6e 20 69 2e 65 2e 20 61 66 74 65 72 2e 0a
Canonical form: "He said e.g. before lunch. Then i.e. after."
Paragraphs: 1 · Sentences: 2 (the e.g. and i.e. dotted-tokens are both shielded by §5.11)

`leaf_id`	`value`	`salt_b64`	`leaf_hash`
`p0/s0`	`He said e.g. before lunch.`	`Dw8PDw8PDw8PDw8PDw8PDw==`	`be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882`
`p0/s1`	`Then i.e. after.`	`8PDw8PDw8PDw8PDw8PDw8A==`	`1a455dea1a4456b1bb6af79022bb3bdd77900367a4a91184904d3c395cb9dcb4`

The two salts are deliberately constant byte patterns (16 × 0x0F and 16 × 0xF0) so an unaffiliated verifier can reproduce these hashes without a salt-generation step: the raw-bytes preimage is fully determined by the spec.

Full merkle tree. Two leaves, one inner level (the root).

              root  = 65bc540ee29ee81834460758069e4ed48c2bc05ca8958f11db98ea5b08a060ba
                              =  SHA-256( leaf[p0/s0] || leaf[p0/s1] )
                                      /                                \
   leaf[p0/s0] = be17...9882                                  leaf[p0/s1] = 1a45...dcb4

proof_path entries:

For p0/s0: [{"side":"R","hash":"1a455dea1a4456b1bb6af79022bb3bdd77900367a4a91184904d3c395cb9dcb4"}]
For p0/s1: [{"side":"L","hash":"be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882"}]

Full preimage breakdown for p0/s0. Reproducing the leaf-hash by hand:

profile_literal_utf8 (36 B): 73 61 74 73 69 67 6e 61 6c 2e 74 65 78 74 2e 70 61 72 61 67 72 61 70 68 5f 73 65 6e 74 65 6e 63 65 2e 76 31
separator (1 B): 00
leaf_id_utf8 (5 B): 70 30 2f 73 30 ("p0/s0")
separator (1 B): 00
sentence_value_utf8 (26 B): 48 65 20 73 61 69 64 20 65 2e 67 2e 20 62 65 66 6f 72 65 20 6c 75 6e 63 68 2e ("He said e.g. before lunch.")
separator (1 B): 00
salt_bytes (16 B): 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f

Total preimage length: 86 bytes. SHA-256 of those bytes: be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882.

Counterfactual. Under the pre-patch _preceding_word helper (now removed from v1), this input would have produced four sentence fragments: "He said e.", "g. before lunch.", "Then i.", "e. after.". The hashes for those four fragments would not match the two hashes pinned here. A verifier that runs the patched §11.16 algorithm reproduces the two hashes above exactly; a verifier that runs the pre-patch algorithm cannot verify a C15 disclosure. This is the single bug §5.11 corrects.

11.16 Reference algorithm (Python)

The hashes in §§11.1–11.15 were produced by the algorithm below. It is normative for the segmentation and hashing rules in §§3–8: any divergence between this code and the prose above is a bug in the prose, not in the code (subject to the spec author retaining final say on rule shape).

import hashlib, unicodedata

PROFILE = "satsignal.text.paragraph_sentence.v1"

ABBREVIATIONS = {
    "Mr","Mrs","Ms","Mx","Dr","Jr","Sr","St","Mt","Ft",
    "vs","etc","e.g","i.e","cf","viz","Inc","Ltd","Co","Corp",
    "LLC","LLP","PLC","GmbH","S.A","Pty","No","Vol","pp","p",
    "ch","sec","fig","eq","Prof","Rev","Hon","Capt","Gen","Lt",
    "Sgt","Maj","Col","Ave","Blvd","Rd","Ln","Ct",
}

TERMINATORS = {".", "?", "!", "…", "。", "？", "！"}

QUOTE_OPEN_TO_CLOSE = {
    '"': '"', "'": "'",
    "“": "”", "‘": "’",
    "«": "»", "「": "」",
}
QUOTE_OPENERS = set(QUOTE_OPEN_TO_CLOSE.keys())
QUOTE_CLOSERS = set(QUOTE_OPEN_TO_CLOSE.values())

# §5.11 (Decision 25): dotted-token boundary set. The forever-frozen
# abbreviation list in §5.2 contains no bracket / quote / whitespace
# character, so widening the boundary set with brackets cannot make a
# previously-recognized abbreviation stop matching.
DOTTED_TOKEN_BREAKS = {"(", ")", "[", "]", "{", "}"}


def canonicalize(raw: bytes) -> str:
    text = raw.decode("utf-8")            # §3.1
    if text.startswith(""):         # §3.2
        text = text[1:]
    text = unicodedata.normalize("NFC", text)  # §3.5
    text = text.replace("\r\n", "\n").replace("\r", "\n")  # §3.4
    lines = [ln.rstrip(" \t") for ln in text.split("\n")]  # §3.6
    text = "\n".join(lines)
    if text.endswith("\n"):               # §3.7
        text = text[:-1]
    return text


def split_paragraphs(canonical: str) -> list[str]:
    paragraphs, cur = [], []
    for line in canonical.split("\n"):
        if line == "":
            if cur:
                paragraphs.append(" ".join(cur))   # §4.3
                cur = []
        else:
            cur.append(line)
    if cur:
        paragraphs.append(" ".join(cur))
    return paragraphs


def _is_dotted_token_boundary(ch: str) -> bool:
    """§5.11: whitespace, recognized quote, or ASCII bracket."""
    return (
        ch.isspace()
        or ch in QUOTE_OPENERS
        or ch in QUOTE_CLOSERS
        or ch in DOTTED_TOKEN_BREAKS
    )


def _dotted_token_match(buf: str, idx: int) -> bool:
    """
    §5.11 (Decision 25). Returns True iff the candidate '.' at
    `buf[idx]` is shielded from being a sentence terminator by a
    forever-frozen abbreviation entry.

    The 'dotted-token' surrounding `idx` is the maximal run of
    non-(whitespace | quote | bracket) characters around `idx`,
    delimited by `_is_dotted_token_boundary`. The candidate '.' is
    shielded iff there exists an entry `a` in `ABBREVIATIONS` with

        buf[start : start + len(a)] == a
        AND start <= idx <= start + len(a)

    where `start` is the dotted-token's left boundary. This handles
    BOTH the internal '.' of a multi-dot abbreviation (idx strictly
    inside the entry's span) AND the trailing '.' immediately after
    the entry (idx == start + len(a)).
    """
    # Left boundary of the dotted-token.
    start = idx
    while start > 0 and not _is_dotted_token_boundary(buf[start - 1]):
        start -= 1
    # Right boundary (used only to cap candidate entry length).
    n = len(buf)
    end = idx
    while end < n and not _is_dotted_token_boundary(buf[end]):
        end += 1
    span_left  = idx - start
    span_right = end - start
    for a in ABBREVIATIONS:
        L = len(a)
        # Entry must cover idx (L >= span_left) and fit in the
        # dotted-token (L <= span_right).
        if L < span_left or L > span_right:
            continue
        if buf[start:start + L] == a:
            return True
    return False


def segment_sentences(p: str) -> list[str]:
    sents, n = [], len(p)
    i = 0
    while i < n and p[i].isspace():
        i += 1
    start, stack = i, []
    while i < n:
        ch = p[i]
        # quote depth (§5.5)
        if ch in QUOTE_OPENERS and (
            ch not in QUOTE_CLOSERS or not stack or stack[-1] != ch
        ):
            stack.append(QUOTE_OPEN_TO_CLOSE[ch]); i += 1; continue
        if ch in QUOTE_CLOSERS and stack and stack[-1] == ch:
            stack.pop(); i += 1; continue
        if stack:
            i += 1; continue
        if ch not in TERMINATORS:
            i += 1; continue
        terminates = True
        if ch == ".":
            prev_c = p[i - 1] if i > 0 else ""
            next_c = p[i + 1] if i + 1 < n else ""
            if prev_c.isdigit() and next_c.isdigit():    # §5.3 decimal
                terminates = False
            else:
                # §5.4 ellipsis-with-space
                rs = i
                while rs > 0 and p[rs - 1] == ".":
                    rs -= 1
                re = i
                while re + 1 < n and p[re + 1] == ".":
                    re += 1
                if re - rs + 1 >= 3:
                    if i != re:
                        terminates = False
                    else:
                        after = p[re + 1] if re + 1 < n else ""
                        terminates = (after == "" or after.isspace())
                else:
                    # §5.2 / §5.11 (Decision 25) — multi-dot aware.
                    if _dotted_token_match(p, i):
                        terminates = False
        if not terminates:
            i += 1; continue
        sents.append(p[start:i + 1])                     # §5.7
        i += 1
        while i < n and p[i].isspace():
            i += 1
        start = i
    if start < n and p[start:n].strip() != "":           # §5.9 / §5.10
        sents.append(p[start:n])
    return sents


def leaf_hash(profile, leaf_id, value, salt_bytes):      # §7
    return hashlib.sha256(
        profile.encode("utf-8") + b"\x00"
        + leaf_id.encode("utf-8") + b"\x00"
        + value.encode("utf-8") + b"\x00"
        + salt_bytes
    ).digest()


def merkle_root_and_levels(leaves):
    levels = [list(leaves)]
    cur = list(leaves)
    while len(cur) > 1:
        nxt = []
        for i in range(0, len(cur), 2):
            if i + 1 < len(cur):
                nxt.append(hashlib.sha256(cur[i] + cur[i + 1]).digest())
            else:
                nxt.append(cur[i])                       # odd: promote
        levels.append(nxt)
        cur = nxt
    return cur[0], levels

12. Out of scope for v1

The following are explicitly out of scope for v1 of this profile. Any of them may motivate a future vN+1 profile under a separate decision record; none can be retrofitted into v1.

Paragraph-level leaves as a distinct type. Paragraph-level disclosure is achieved by disclosing every sentence in the chosen paragraph. No p<N> leaf type exists in v1.
Word-level or token-level disclosure. No text.word.v1 or text.token.v1. The leaf granularity of v1 is the sentence.
Language-specific tokenization. No language_hint field, no ICU dependency, no Burmese / Khmer / Lao word breakers, no Japanese morphological splitter. The terminator set in §5.1 plus the abbreviation list in §5.2 is the whole grammar.
Markdown / HTML / RTF / PDF / DOCX rendering inside sentences. The profile operates over plaintext only; structural markup is the caller's responsibility to strip before anchoring.
Footnote / citation extraction. Inline footnote markers ([1], superscripts, ^note) are treated as ordinary sentence bytes; the profile does not split them out, link them across paragraphs, or emit a separate footnote leaf type.
Sealed-mode disclosure. Per disclosure-v1.md §4 step 5, satsignal.disclosure.v1 covers standard-mode original anchors only (algo: "sha256"). Sealed-mode sentence disclosure is deferred to a future minor of the disclosure spec or a per-profile sealed addendum.
Capitalization heuristics for ambiguous .. §5.6 explicitly rules out using next-word capitalization to decide whether a . is a terminator. The abbreviation list of §5.2 is the only carve-out.
Adding to the abbreviation list. Even if a missing abbreviation is identified (e.g. Univ, Eng, Sen), it cannot be added to v1. The remedy is v2.

Questions about this specification? Email hello@satsignal.cloud.

satsignal.text.paragraph_sentence.v1 — sentence-level selective disclosure of UTF-8 plaintext

1. Why this exists

2. Scope and prerequisites

3. Inputs and canonicalization

3.1 Encoding

3.2 Byte Order Mark

3.3 Empty input

3.4 Line-ending normalization

3.5 Unicode normalization

3.6 Trailing whitespace per line

3.7 Trailing newline at EOF

3.8 Multiple internal whitespace

4. Paragraph segmentation

4.1 Paragraph splitting

4.2 Paragraph numbering

4.3 Within-paragraph line joining

5. Sentence segmentation — the hard part

5.1 Terminator codepoints

5.2 Abbreviation list (DO-NOT-TERMINATE)

5.3 Decimal-number rule

5.4 Ellipsis-with-space rule

5.5 Quoted-speech rule

5.6 Trailing whitespace after a terminator

5.7 Sentence-value definition

5.8 Sentence numbering

5.9 Paragraph with no terminators

5.10 Trailing content after the last terminator

5.11 Decision 25 — Multi-dot abbreviation handling

6. Leaf id construction

7. Leaf-hash preimage

7.1 Worked example

7.2 Pin-points

8. Salts

8.1 Salt size

8.2 Salt uniqueness and persistence

9. Merkle behavior (cross-reference)

10. Original anchor binding

11. Fixtures

11.1 C1 — minimal

11.2 C2 — abbreviation (Mr.)

11.3 C3 — decimal ($3.14)

11.4 C4 — quoted speech ("Did you go?" she asked.)

11.5 C5 — ellipsis with space

11.6 C6 — mid-token ellipsis (deviation flag)

11.7 C7 — CJK (こんにちは。さようなら。)

11.8 C8 — smart quotes (“Hello!” she said.)

11.9 C9 — NFC equivalence (café.)

11.10 C10 — multi-paragraph (full merkle tree)

11.11 C11 — CRLF mixed

11.12 C12 — BOM

11.13 C13 — incomplete-sentence paragraph

11.14 C14 — forever-list abbreviation collision (vs.)

11.15 C15 — multi-dot abbreviations (e.g., i.e.)

11.16 Reference algorithm (Python)

12. Out of scope for v1

11.2 C2 — abbreviation (`Mr.`)

11.3 C3 — decimal (`$3.14`)

11.4 C4 — quoted speech (`"Did you go?" she asked.`)

11.7 C7 — CJK (`こんにちは。さようなら。`)

11.8 C8 — smart quotes (`“Hello!” she said.`)

11.9 C9 — NFC equivalence (`café.`)

11.14 C14 — forever-list abbreviation collision (`vs.`)

11.15 C15 — multi-dot abbreviations (`e.g.`, `i.e.`)