satsignal.text.paragraph_sentence.v1 — sentence-level selective disclosure of UTF-8 plaintext
DEPRECATED / INERT (read this first). This
satsignal.text.paragraph_sentence.v1dotted profile is deprecated and inert. It defined a salted rule with one leaf per sentence, addressed by paragraph and sentence index (p<N>/s<M>). No production flow emits or consumes it. Its allowlist literal is retained forever (an allowlist literal is never removed) and its frozen corpus is kept solely as a regression-guard record — it is never produced or verified by any live path. The rules below stay frozen for that regression corpus; do not implement them for new work. Live successor:text-line-v1. New text disclosures use the nativetext-line-v1literal, which binds to thechunk_merklea.txt/.mdanchor already commits. Note the granularity difference: this deprecated profile segments by sentence (salted, paragraph/sentence indexed); the livetext-line-v1segments by line (text-norm-v1canon, drop empty lines, bare/sealed leaf, native merkle binding — no re-anchor, no salt keyfile in standard mode). The two cannot interbind. Authority for the deprecation:disclosure-v1 §11.
Versioning (2026-05-27). This is satsignal.text.paragraph_sentence.v1. The profile literal is fixed at "satsignal.text.paragraph_sentence.v1" and appears verbatim inside every leaf-hash preimage (see §7). Profile literals are forever-contracts: once any client has anchored under this literal, every segmentation, canonicalization, normalization, salting, and leaf_id construction rule below is fixed for that literal forever. A bug in any of these rules cannot be patched in place; the only remedy is a new satsignal.text.paragraph_sentence.v2 profile that compatible verifiers must support in parallel. Breaking shape changes ship as v2, never as a quiet v1 mutation. This spec is the authoritative source for v1 rules; the master disclosure spec at disclosure-v1.md is the home for cross-profile plumbing (manifest shape, merkle invariants, verifier contract).
Status: draft 1, 2026-05-27. Audience: integrators anchoring UTF-8 plaintext documents (contracts, transcripts, articles, policy texts) who anticipate later disclosing specific sentences without revealing the rest; verifier authors who must reproduce this profile's segmentation byte-for-byte against an anchored leaf-set. Goal: define exactly one segmentation, canonicalization, leaf-id, and leaf-hash construction for sentence-level disclosure of UTF-8 plaintext — pinned to the byte, with a fixture set covering the adversarial cases that motivated the forever-contract discipline.
1. Why this exists
satsignal.disclosure.v1 lets an anchorer publish a redacted view of an already-anchored document with cryptographic proof that the revealed fragments were members of the original leaf-set. That construction needs a segmentation rule: somebody has to decide where a sentence starts and ends, because the revealed-leaf membership check recomputes the leaf hash from the exact sentence bytes the original anchor committed to. If two implementations disagree on where a sentence ends, they disagree on its leaf hash, and the disclosure fails to verify against an honest anchor.
Plaintext contracts, transcripts, news articles, policy texts, and similar UTF-8 documents are the headline driver. The disclosure builder ingests the document, runs this profile's canonicalization and segmentation, hashes each sentence into a leaf, builds the merkle tree, and anchors the root through the standard path. Later — at any time — the anchorer picks a subset of sentences, ships a disclosure bundle with their values, salts, and merkle proofs, and a verifier walks each proof to the on-chain-bound root. Sentences that are not disclosed never leave the anchorer's machine.
Sentence-level granularity is the operative leaf-type in v1. Paragraph- level disclosure is not a distinct leaf-type — it is achieved by disclosing every sentence in the chosen paragraph. Word-level and token-level disclosure are out of scope for v1 (§12).
2. Scope and prerequisites
This profile applies to a single UTF-8 plaintext input — a document without structural markup, treated as a sequence of paragraphs of sentences. Markdown, HTML, RTF, PDF, DOCX, and any other shape with encoded structure are not in scope: an anchorer who wants to disclose from such a source MUST first export it to plaintext under their own rules, then anchor that plaintext under this profile. The exported plaintext is the document this profile operates on; the original markup-bearing source is not.
Per disclosure-v1.md §2, selective disclosure is only possible for original anchors that were committed under a leaf-set scheme. The original canonical doc MUST carry subject.proofs.chunk_merkle.{scheme,algo,leaf_count,root} with scheme == "satsignal.text.paragraph_sentence.v1" and algo == "sha256". A pure-byte_exact anchor cannot be selectively disclosed under this profile; an anchorer planning a future disclosure MUST anchor the document under this profile from the start.
3. Inputs and canonicalization
Canonicalization is applied to the raw input bytes before segmentation, leaf extraction, or hashing. Every step below is forever-pinned for v1.
3.1 Encoding
Decision (forever): the input is UTF-8. Invalid UTF-8 fails closed — the disclosure builder MUST refuse to anchor a non-UTF-8 input under this profile. Rationale: pinning one encoding eliminates "which decoder did we use" as a cross-implementation drift surface.
3.2 Byte Order Mark
Decision (forever): a single leading UTF-8 BOM (0xEF 0xBB 0xBF) is stripped if present. No BOM elsewhere is recognized as a BOM. Rationale: many editors emit a BOM that is not part of the author's intended content; treating it as content would change every leaf hash for files saved through such an editor.
3.3 Empty input
Decision (forever): an input that canonicalizes to zero sentences is invalid input. The builder MUST refuse to anchor it. An empty disclosure has no leaves, no root, and no useful semantics; admitting it would force leaf_count = 0 corner cases into every verifier. Rationale: failing closed at the builder is simpler than carrying a zero-leaves edge case forever.
3.4 Line-ending normalization
Decision (forever): any of CR (0x0D), LF (0x0A), or CRLF (0x0D 0x0A) is normalized to a single LF (0x0A) before any further processing. Rationale: a contract saved on Windows and the same contract saved on Linux must produce the same leaves.
3.5 Unicode normalization
Decision (forever): the input is run through NFC (Unicode TR15 Canonical Composition) before any further processing. Rationale: a composed é (U+00E9) and a decomposed é (U+0065 U+0301) display identically and must hash identically; NFC is the single defensible choice and is documented as the canonical form by Unicode. See fixture C9 for the byte-level demonstration.
3.6 Trailing whitespace per line
Decision (forever): trailing space (0x20) and tab (0x09) characters are stripped from each line before paragraph splitting. Other whitespace categories (vertical tab, form feed, NBSP, etc.) are not stripped. Rationale: editors and CI tools routinely add or remove trailing spaces; treating them as content invites silent drift.
3.7 Trailing newline at EOF
Decision (forever): a single trailing LF (0x0A) at end-of-input is stripped if present, so an input ending in "foo\n" and one ending in "foo" produce identical canonical bytes. Interior empty lines are preserved exactly. Rationale: many editors append a final newline as a POSIX convention; treating it as content would change leaves depending on editor save behavior.
3.8 Multiple internal whitespace
Decision (forever): internal whitespace inside a sentence is not collapsed. "foo bar" (two spaces) and "foo bar" (one space) are different sentences with different leaf hashes. Rationale: collapsing would hide deliberate spacing differences (poetry, code blocks copied into prose, intentional emphasis) that the anchorer may need to disclose as written.
4. Paragraph segmentation
4.1 Paragraph splitting
Decision (forever): after canonicalization, a paragraph is a maximal run of non-empty lines separated by one or more empty lines. Empty lines are NOT paragraphs and produce no leaves. So "a\n\nb\n\n\nc" produces three paragraphs (a, b, c); the two empty lines between b and c collapse into a single boundary. Rationale: the empty-line-as-paragraph-break convention is universal in plaintext sources; this rule pins it.
4.2 Paragraph numbering
Decision (forever): paragraphs are zero-indexed in document order. The first paragraph is p0, the second p1, and so on. Rationale: zero-indexing matches every programming-language convention; ambiguity- free.
4.3 Within-paragraph line joining
Decision (forever): when a paragraph spans multiple non-empty lines (no empty line between them), those lines are joined with a single ASCII space (0x20) before sentence segmentation runs. So "a\nb" becomes the segmentation-input string "a b". Rationale: in plaintext sources a mid-paragraph \n is overwhelmingly a soft wrap, not a sentence break; joining with a space matches author intent. The \n characters do not appear in any sentence's bytes.
5. Sentence segmentation — the hard part
Sentence segmentation runs over each paragraph's joined string (per §4.3) independently. The rules below are forever-pinned and apply in the order written.
5.1 Terminator codepoints
Decision (forever): the following code points are the only sentence-ending characters in v1:
| Codepoint | Hex | Name |
|---|---|---|
. | U+002E | FULL STOP |
? | U+003F | QUESTION MARK |
! | U+0021 | EXCLAMATION MARK |
… | U+2026 | HORIZONTAL ELLIPSIS |
。 | U+3002 | IDEOGRAPHIC FULL STOP |
? | U+FF1F | FULLWIDTH QUESTION MARK |
! | U+FF01 | FULLWIDTH EXCLAMATION MARK |
Any future addition is a v2 profile. Other punctuation (semicolons, colons, em-dashes, single dashes) does not terminate.
Rationale: this set covers Latin-script, CJK, and fullwidth Latin without admitting language-specific heuristics that would not survive contact with a new corpus.
5.2 Abbreviation list (DO-NOT-TERMINATE)
Decision (forever): a . (U+002E) is not a terminator when it matches the forever-frozen abbreviation list below under the matching algorithm specified in §5.11 (Decision 25 — multi-dot abbreviation handling). Match is case-sensitive. The list itself is frozen:
Mr Mrs Ms Mx Dr Jr Sr St Mt Ft
vs etc e.g i.e cf viz Inc Ltd Co Corp
LLC LLP PLC GmbH S.A Pty No Vol pp p
ch sec fig eq Prof Rev Hon Capt Gen Lt
Sgt Maj Col Ave Blvd Rd Ln Ct
The carve-out applies only to .; other terminators (?, !, …, 。, ?, !) are never affected by this list. Adding to the list or changing the matching algorithm in §5.11 requires a v2 profile.
Rationale: every plausible abbreviation must be enumerable; relying on capitalization heuristics or trained models would put the segmentation under a black-box dependency that cannot be reproduced byte-for-byte by an unaffiliated verifier. The list is small and tractable; any addition mints a new profile.
Multi-dot entries note. Three entries in the list — e.g, i.e, S.A — contain an internal . codepoint. The matching algorithm in §5.11 is the only authoritative reading of how these entries shield candidate . characters in the text being segmented; the plain-language phrasing "the word immediately preceding" used in earlier drafts of this spec is insufficient for multi-dot entries and is superseded by §5.11. See fixtures C2 / C14 (single-dot abbreviations) and C15 (multi-dot abbreviations) for byte-level demonstrations.
5.3 Decimal-number rule
Decision (forever): a . (U+002E) is not a terminator when the character immediately before it is [0-9] and the character immediately after it is [0-9]. So 3.14 does not terminate; 5. followed by a space (no digit after) does terminate. Rationale: decimal numbers in prose are unambiguous and trivially recognizable; admitting them would split $3.14 yesterday into two sentences.
5.4 Ellipsis-with-space rule
Decision (forever): a run of three or more . (U+002E) characters terminates if the run is followed by whitespace or by end-of-paragraph. A run of three or more . followed by a non- whitespace, non-. character (a "mid-token" ellipsis) does not terminate. Only the last . of a terminating run is the terminator; intermediate dots are part of the sentence bytes ("foo..." includes all three dots).
Note that this rule and §5.3 (decimal) do not collide: a run of three or more . cannot be a decimal because the immediately-following character of the first dot would itself be ., not a digit.
Rationale: ellipsis is the only multi-codepoint terminator allowed in v1; pinning the rule to "three or more, terminator on whitespace boundary" handles both "She paused... Then she spoke." (terminating) and "She was...uncertain." (mid-token, single sentence) without heuristics.
5.5 Quoted-speech rule
Decision (forever): a terminator that sits inside a quoted span does NOT end the sentence. The sentence ends at the next terminator after the closing quote (under all other rules in this section). So "\"Did you go?\" she asked." is ONE sentence ending at the . after asked, not two.
Recognized quote pairs (forever list):
| Open | Close | Codepoints |
|---|---|---|
" | " | U+0022 (self-paired) |
' | ' | U+0027 (self-paired) |
“ | ” | U+201C / U+201D |
‘ | ’ | U+2018 / U+2019 |
« | » | U+00AB / U+00BB |
「 | 」 | U+300C / U+300D |
Quote depth is tracked with a simple paired-quote counter: incrementing on an opener and decrementing on the matching closer. While depth > 0, terminator codepoints encountered are passed through as sentence bytes without terminating. The two ASCII straight quotes (U+0022, U+0027) are self-paired — they alternate open/close based on current depth at that level (depth 0 → opener; depth 1 of the same self-paired character → closer). A typographic-pair quote (U+201C/U+201D, etc.) only closes a span opened by its mate.
Mismatched / unclosed quotes (e.g. "Hello.) leave depth > 0 to end-of-paragraph; the paragraph then terminates per §5.10 with its trailing content as a single sentence. Rationale: this is a deterministic fallback that does not require backtracking and does not introduce heuristics.
Rationale: prose with embedded direct speech is the single most common adversarial case; the simple paired-depth tracker handles it without invoking grammar.
5.6 Trailing whitespace after a terminator
Decision (forever): a terminator that is followed by zero, one, or many whitespace characters and then either end-of-paragraph or the start of the next sentence terminates a sentence under the rules of §§5.1–5.5. Capitalization of the next word is NOT used as a heuristic. A terminator followed by whitespace and a lowercase letter still terminates; the abbreviation list of §5.2 is the only carve-out for ambiguous cases. Rationale: capitalization heuristics are language-specific and produce non-reproducible behavior on mixed-script and informal text; ruling them out keeps the segmentation byte-deterministic.
5.7 Sentence-value definition
Decision (forever): a sentence's canonical bytes are the UTF-8 bytes from the first non-whitespace character after the previous terminator (or paragraph start) through and INCLUDING the terminator character itself. Trailing whitespace after the terminator is in no sentence's bytes.
So "Hello. World!" produces two sentences:
p0/s0 = "Hello."(6 bytes)p0/s1 = "World!"(6 bytes)
The single space between them is in neither sentence's bytes.
Rationale: this rule makes the leaf bytes a contiguous substring of the post-canonicalization paragraph; it is reproducible without any state beyond the terminator position.
5.8 Sentence numbering
Decision (forever): sentences are zero-indexed within their paragraph. The first sentence of paragraph p3 is p3/s0. Rationale: matches §4.2.
5.9 Paragraph with no terminators
Decision (forever): if canonicalization produces a paragraph that contains no terminator codepoints AND is non-empty (a fragment without final punctuation), the paragraph is treated as ONE sentence whose value is the full paragraph bytes (no synthetic terminator is added). See fixture C13.
So "This has no terminator" produces:
p0/s0 = "This has no terminator"(22 bytes, no trailing dot)
Rationale: refusing to anchor unterminated fragments would block real documents (headers, titles, bullet lists); synthesizing a terminator would create bytes the anchorer never wrote. The both-bad options are ruled out by accepting the fragment as-is.
5.10 Trailing content after the last terminator
Decision (forever): if a paragraph contains terminators and is then followed by non-empty content after the final terminator (e.g. "First. Second" — no trailing dot), the trailing content is emitted as an additional sentence with no terminator in its bytes. The rule of §5.9 generalizes: any contiguous post-terminator non-empty content is a sentence whose value is exactly those bytes. Rationale: identical reasoning to §5.9; refusing to emit it loses content the anchorer wrote.
5.11 Decision 25 — Multi-dot abbreviation handling
Decision (forever). A candidate . (U+002E) at position idx within a paragraph string is shielded from being a sentence terminator (per §5.2) iff there exists an entry a in the abbreviation list of §5.2 such that
a == buf[start : start + len(a)]ANDstart <= idx <= start + len(a)
where start is the left boundary of the dotted-token that contains idx. The dotted-token is the maximal run of non-(whitespace | quote | bracket) characters around idx, defined precisely as follows:
- A character
cis a dotted-token boundary iffc.isspace()is true (Pythonstr.isspacesemantics on the post-canonicalization NFC string — coversU+0020,U+0009,U+00A0, etc.), ORcis a recognized quote opener or closer per §5.5 (one of",',“,”,‘,’,«,»,「,」), ORcis one of(,),[,],{,}(ASCII brackets).
startis the smallest indexs ≤ idxsuch that every character inbuf[s : idx]is not a dotted-token boundary.
The bracket characters ()[]{} are admitted as dotted-token boundaries to keep the rule sensible for prose patterns such as "(e.g. red ones)" (see EX4 below). None of the forever-frozen abbreviation entries in §5.2 contains a bracket, a quote, or a whitespace character, so widening the boundary set this way cannot make a previously-recognized abbreviation stop matching.
Coverage of the two failure modes the rule corrects. The algorithm above handles both internal and trailing dots of multi-dot abbreviations:
- Internal
.of a multi-dot abbreviation (e.g. the first.ofe.g., betweeneandg): the entrye.gsatisfiesbuf[start : start + 3] == "e.g"andstart <= idx < start + 3, so the candidate.is shielded. - Trailing
.of a multi-dot abbreviation (e.g. the second.ofe.g., afterg): the entrye.gagain satisfiesbuf[start : start + 3] == "e.g", and the candidate.sits atidx == start + 3, so the candidate.is shielded.
Single-dot abbreviations (Mr, vs, …) are a degenerate case of the same rule: len(a) == 2, idx == start + 2, and the candidate . is the dot immediately after the abbreviation. This matches the prior behavior of §5.2 on single-dot entries (fixtures C2, C14).
Helper name (normative). The reference Python in §11.16 names this helper _dotted_token_match. It supersedes the prior _preceding_word helper, which examined only buf[start : idx] (the prefix ending strictly at idx) and therefore could not recognize an abbreviation whose internal . lies at the candidate position. _preceding_word is removed from v1.
Invariant. _dotted_token_match consults only:
- the position
idx, - the paragraph buffer
buf(post-canonicalization NFC bytes), - the dotted-token boundary character set defined above, and
- the forever-frozen abbreviation list of §5.2.
A future profile version (v2) is required to change the abbreviation list, the dotted-token boundary character set, or the matching algorithm. None of those three can be changed under v1.
Worked examples (forever-pinned). The corrected algorithm produces, for the inputs below, exactly the segmentation shown. These four examples are normative test cases for any verifier implementation; they are demonstrated in §11.15 (fixture C15) and in the prose worked-out below:
| Input | Result |
|---|---|
He said e.g. before. And i.e. after. | 2 sentences: He said e.g. before. · And i.e. after. |
It's e.g. a test. | 1 sentence: It's e.g. a test. |
S.A. de C.V. holdings expanded. | 3 sentences: S.A. de C. · V. · holdings expanded. |
Apples (e.g. red ones) are good. | 1 sentence: Apples (e.g. red ones) are good. |
The S.A. de C.V. case is non-obvious and is deliberately pinned here as a forever contract. The reasoning, walked through:
S.Ais on the abbreviation list;C.Vis not (no addition is permitted underv1per §5.2). Per §5.6 (Decision 17), next-word capitalization is not used as a heuristic — the lowercase next worddeafterS.A.does not influence the decision.- The
.immediately afterS(the first dot ofS.A.) is shielded because the dotted-token starting atSmatches the entryS.Aand the candidate.lies inside that match's span. - The
.immediately afterA(the trailing dot ofS.A.) is shielded because the dotted-token still matchesS.Aand the candidate.lies atidx == start + 3 == start + len("S.A"). - The
.immediately afterCis not shielded: the dotted-token starting atCisC.V., and no abbreviation entry in §5.2 is a prefix ofC.V.(in particular, neitherCnorC.Vis in the list). So this.does terminate. - The
.immediately afterVis not shielded: the dotted-token starting atVisV., andVis not in the list. So this.also terminates.
Result: three sentences. This is the segmentation v1 produces; documents that need a different segmentation around C.V must wait for a future profile that extends the list or the matching algorithm.
For Apples (e.g. red ones) are good., the open and close parens are dotted-token boundaries per the rule above; the e.g. dotted-token is bounded by ( on the left and a space on the right. The matching algorithm shields both dots of e.g. as in EX1, so the whole input is one sentence.
6. Leaf id construction
Decision (forever): each sentence's leaf_id is the literal string p<N>/s<M> where N is the zero-indexed paragraph number and M is the zero-indexed sentence number within that paragraph. Neither number is zero-padded.
Examples:
p0/s0— first sentence of the first paragraphp3/s4— fifth sentence of the fourth paragraphp12/s0— first sentence of the thirteenth paragraph
Rationale: this is the address every fixture uses; the bare integers keep the leaf_id short and human-readable.
7. Leaf-hash preimage
Decision (forever): the leaf-hash preimage is built from four byte strings separated by single 0x00 bytes, in this exact order:
leaf_hash = SHA-256(
profile_literal_utf8 // "satsignal.text.paragraph_sentence.v1" as UTF-8
|| 0x00 // separator
|| leaf_id_utf8 // e.g. "p3/s0" as UTF-8
|| 0x00 // separator
|| sentence_value_utf8 // NFC-normalized sentence bytes,
// including the terminator
// character itself
|| 0x00 // separator
|| salt_bytes // base64-decoded RAW BYTES from salt_b64
// (NOT the base64 ASCII)
)
The output is the raw 32-byte SHA-256 digest; on the wire it appears as 64 lowercase hex characters per disclosure-v1.md §3.4.
7.1 Worked example
Take p0/s0 of fixture C1 (full fixture in §11):
profile_literal_utf8=73 61 74 73 69 67 6e 61 6c 2e 74 65 78 74 2e 70 61 72 61 67 72 61 70 68 5f 73 65 6e 74 65 6e 63 65 2e 76 31(36 bytes)- separator =
00(1 byte) leaf_id_utf8=70 30 2f 73 30(5 bytes;"p0/s0")- separator =
00(1 byte) sentence_value_utf8=48 65 6c 6c 6f 20 77 6f 72 6c 64 2e(12 bytes;"Hello world.")- separator =
00(1 byte) salt_bytes=43 31 2d 73 61 6c 74 2d 61 61 61 61 61 61 61 61(16 bytes; the raw bytes whose base64 isQzEtc2FsdC1hYWFhYWFhYQ==)
Total preimage length: 72 bytes. SHA-256 of those bytes is the C1 p0/s0 leaf_hash:
a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae
7.2 Pin-points
- The separators are single
0x00bytes, not strings of nulls or any other delimiter. Triple or zero separators are malformed. - The
salt_bytesare the raw decoded salt, not the base64 text. A verifier that hashes the base64 ASCII will produce a different (wrong) digest. profile_literal_utf8is the literal stringsatsignal.text.paragraph_sentence.v1encoded as UTF-8 — 36 ASCII bytes. It is inside every leaf hash; this is what makes the profile literal a forever-contract.sentence_value_utf8is the NFC-normalized canonical bytes per §3.5, including the trailing terminator codepoint (per §5.7).
Cross-profile consistency: the preimage shape above (profile || 0x00 || leaf_id || 0x00 || value || 0x00 || salt) is the same layout used by sibling profiles satsignal.csv.row.v1 and satsignal.json.field.v1. A verifier implementing all three shares the same outer hashing routine; only the value-canonicalization rule differs per profile.
8. Salts
8.1 Salt size
Decision (forever): 16 raw bytes per leaf. Rationale: 128 bits of entropy is sufficient to make per-leaf candidate-value attacks infeasible while keeping the per-leaf overhead small.
8.2 Salt uniqueness and persistence
Decision (forever): each leaf's salt is generated by a CSPRNG and is unique per leaf (no two leaves in a single document share a salt). The anchorer MUST persist the salt off-chain alongside the sentence value: without the salt, the disclosure cannot be regenerated and the merkle proof cannot be reconstructed. Rationale: the salt is what prevents an attacker who has the anchored root and a candidate value from confirming whether the candidate matches an undisclosed leaf — without the salt, they cannot recompute the leaf hash. Salts are the privacy primitive of selective disclosure under this profile.
The salt is transported on the wire (when a leaf is disclosed) as salt_b64 per disclosure-v1.md §3.4: base64 of the raw 16 bytes. Undisclosed leaves' salts are NEVER transmitted.
9. Merkle behavior (cross-reference)
The merkle-tree construction, proof-path encoding, single-leaf-tree rule, odd-node promote-unchanged rule, and proof-walk algorithm are defined once in disclosure-v1.md §3.4. This profile does not re-derive them.
Profile-specific notes:
- Leaf ordering (per
disclosure-v1.md §3.4invariant 4): leaves appear in document order —p0/s0,p0/s1, …,p0/sK,p1/s0,p1/s1, …,pN/sM. The verifier does NOT re-sort; the disclosure carries leaves in document order. - Hash algorithm: SHA-256, 64-character lowercase hex on the wire, raw-bytes concatenation at proof-walk time (the disclosure spec's hex-vs-bytes invariant 1 applies verbatim).
10. Original anchor binding
When the original document is anchored under this profile, the canonical doc's subject.proofs.chunk_merkle block MUST carry:
| Field | v1 value |
|---|---|
scheme | "satsignal.text.paragraph_sentence.v1" |
algo | "sha256" |
leaf_count | (positive integer; number of sentence leaves) |
root | (64-char lowercase hex; merkle root of leaf-set) |
A disclosure bundle's manifest.disclosure.linked_anchor.subject_profile MUST equal "satsignal.text.paragraph_sentence.v1", and every revealed[i].profile in the same block MUST equal that literal. A verifier that finds a mismatch fails closed per disclosure-v1.md §7 with profile_mismatch.
Sealed-mode anchors (algo: "merkle-hmac-sha256") are NOT supported by satsignal.disclosure.v1 (per disclosure-v1.md §4 step 5); this profile inherits that restriction. Sealed-mode sentence disclosure is deferred to a future minor of the disclosure spec or to a per-profile sealed addendum.
11. Fixtures
Each fixture below shows: the input bytes (hex), the canonical form after §3, the paragraph and sentence count, every (leaf_id, value) pair, the per-leaf salt_b64 and computed leaf_hash, and (where indicated) the merkle root and proof paths. All hashes are real SHA-256 outputs computed against the algorithm described in §§3–8; they are not placeholders. The exact reference Python that produced them is in §11.16.
11.1 C1 — minimal
- Input bytes (hex):
48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 20 47 6f 6f 64 62 79 65 20 77 6f 72 6c 64 2e 0a - Canonical form:
"Hello world. Goodbye world." - Paragraphs: 1 · Sentences: 2
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | Hello world. | QzEtc2FsdC1hYWFhYWFhYQ== | a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae |
p0/s1 | Goodbye world. | QzEtc2FsdC1iYmJiYmJiYg== | 1bff79cb195d06b2602dc44ce1f7f8e4fc854e621d20954f7cf4fc47ad8e91e5 |
Merkle root: ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094
Full merkle tree. Two leaves, one inner level (the root).
root = ec0f6274cddd13fa394c0ba8f024b8b23184ccc946bc4cd01130d72cb5659094
= SHA-256( leaf[p0/s0] || leaf[p0/s1] )
/ \
leaf[p0/s0] = a0a9...08aae leaf[p0/s1] = 1bff...91e5
proof_path entries:
- For
p0/s0:[{"side":"R","hash":"1bff79cb195d06b2602dc44ce1f7f8e4fc854e621d20954f7cf4fc47ad8e91e5"}] - For
p0/s1:[{"side":"L","hash":"a0a9a9edaa1638eeb4122f3a295afa8138b8577364e8d921092b9c9f89a08aae"}]
11.2 C2 — abbreviation (Mr.)
- Input bytes (hex):
4d 72 2e 20 53 6d 69 74 68 20 77 65 6e 74 20 68 6f 6d 65 2e 20 48 65 20 6c 65 66 74 20 61 74 20 35 2e 0a - Canonical form:
"Mr. Smith went home. He left at 5." - Paragraphs: 1 · Sentences: 2 (NOT 3 — the
Mr.does not terminate per §5.2)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | Mr. Smith went home. | QzItc2FsdC1jY2NjY2NjYw== | 57ade39c221efb0830664a597df9ad7a1ac2f2d23859f79d15eb8bd127219419 |
p0/s1 | He left at 5. | QzItc2FsdC1kZGRkZGRkZA== | 0f014458d7fe1eb748812f820306f2333de1c29f75d7a388bda4279cfb386588 |
Merkle root: 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591
Full merkle tree.
root = 8a8a7caa7ff601d2b063ea19b151c96f842730271dbbadb1f22af29de9a86591
= SHA-256( leaf[p0/s0] || leaf[p0/s1] )
/ \
leaf[p0/s0] = 57ad...9419 leaf[p0/s1] = 0f01...6588
proof_path entries:
- For
p0/s0:[{"side":"R","hash":"0f014458d7fe1eb748812f820306f2333de1c29f75d7a388bda4279cfb386588"}] - For
p0/s1:[{"side":"L","hash":"57ade39c221efb0830664a597df9ad7a1ac2f2d23859f79d15eb8bd127219419"}]
11.3 C3 — decimal ($3.14)
- Input bytes (hex):
54 68 65 20 70 72 69 63 65 20 77 61 73 20 24 33 2e 31 34 20 79 65 73 74 65 72 64 61 79 2e 20 49 74 20 63 68 61 6e 67 65 64 20 74 6f 64 61 79 2e 0a - Canonical form:
"The price was $3.14 yesterday. It changed today." - Paragraphs: 1 · Sentences: 2 (NOT 3 —
3.14does not split per §5.3)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | The price was $3.14 yesterday. | QzMtc2FsdC1lZWVlZWVlZQ== | 24734ad67b1bb468d8ab550afd8f85c8a95acdc5dbbbde387485568a39ad6dbe |
p0/s1 | It changed today. | QzMtc2FsdC1mZmZmZmZmZg== | 776c91d98ebf0202c4562bca2e72ed44aed8b5e357eb6773d7eed188678b0b48 |
11.4 C4 — quoted speech ("Did you go?" she asked.)
- Input bytes (hex):
22 44 69 64 20 79 6f 75 20 67 6f 3f 22 20 73 68 65 20 61 73 6b 65 64 2e 20 48 65 20 6e 6f 64 64 65 64 2e 0a - Canonical form:
"\"Did you go?\" she asked. He nodded." - Paragraphs: 1 · Sentences: 2 (NOT 3 — the
?inside quotes does not terminate per §5.5)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | "Did you go?" she asked. | QzQtc2FsdC1nZ2dnZ2dnZw== | 7a4f31c93ad048e5be7e8f65ac67e4a351b6fbbf30e99b789b3305999906420c |
p0/s1 | He nodded. | QzQtc2FsdC1oaGhoaGhoaA== | c900b0ef421c8c97af2ee820cb9154945d34b1ef25328dd3291bed496c63ef0f |
11.5 C5 — ellipsis with space
- Input bytes (hex):
53 68 65 20 70 61 75 73 65 64 2e 2e 2e 20 54 68 65 6e 20 73 68 65 20 73 70 6f 6b 65 2e 0a - Canonical form:
"She paused... Then she spoke." - Paragraphs: 1 · Sentences: 2 (the
...followed by space terminates per §5.4)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | She paused... | QzUtc2FsdC1paWlpaWlpaQ== | f66863a8bbde2e3c1d67e540ce6174d4a074fd2cd4bb4cfed9f13ee33f0429e4 |
p0/s1 | Then she spoke. | QzUtc2FsdC1qampqampqag== | 31737452546fa568667abce83ea34cbfd6d64900bf3541ac5f40a44b6106cd06 |
11.6 C6 — mid-token ellipsis (deviation flag)
The brief's literal example was "She was... uncertain.\n" (with a space after the ellipsis) labelled as one sentence, but §5.4 makes ... followed by whitespace a terminator (consistent with C5). To keep the spec rules pure and to exercise the mid-token carve-out as written, this fixture uses "She was...uncertain.\n" (no space after the ellipsis). The deviation is noted in the worker report.
- Input bytes (hex):
53 68 65 20 77 61 73 2e 2e 2e 75 6e 63 65 72 74 61 69 6e 2e 0a - Canonical form:
"She was...uncertain." - Paragraphs: 1 · Sentences: 1 (the
...is mid-token, no whitespace immediately after — does not terminate per §5.4)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | She was...uncertain. | QzYtc2FsdC1ra2tra2traw== | 436d353f3f204360122841244c4a4ec3ea9443b9288d3ec0fc868eb194361d25 |
11.7 C7 — CJK (こんにちは。さようなら。)
- Input bytes (hex):
e3 81 93 e3 82 93 e3 81 ab e3 81 a1 e3 81 af e3 80 82 e3 81 95 e3 82 88 e3 81 86 e3 81 aa e3 82 89 e3 80 82 0a - Canonical form:
"こんにちは。さようなら。" - Paragraphs: 1 · Sentences: 2 (
U+3002IDEOGRAPHIC FULL STOP terminates per §5.1)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | こんにちは。 | Qzctc2FsdC1sbGxsbGxsbA== | b09ff7dfda7aff38ec1e1e37ec1fe6cef2f2316bc679b15223a406cb9f61998d |
p0/s1 | さようなら。 | Qzctc2FsdC1tbW1tbW1tbQ== | 61c4071b343a0c45366f3eb84a83d8a37ffc8ad213e3b42ea825ea0d0dfd5960 |
11.8 C8 — smart quotes (“Hello!” she said.)
- Input bytes (hex):
e2 80 9c 48 65 6c 6c 6f 21 e2 80 9d 20 73 68 65 20 73 61 69 64 2e 0a - Canonical form:
"“Hello!” she said."(using U+201C and U+201D) - Paragraphs: 1 · Sentences: 1 (the
!inside the typographic-pair quotes does not terminate per §5.5)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | “Hello!” she said. | Qzgtc2FsdC1ubm5ubm5ubg== | 24137c38c612d038244f25d7434e464a9f68b4c79ffa4825e943e47036c8ae78 |
11.9 C9 — NFC equivalence (café.)
This fixture demonstrates that NFC normalization (§3.5) collapses the two byte-distinct representations of café into the same canonical form, yielding identical leaf_hash values.
| Variant | Input bytes (hex) | Canonical bytes (hex) |
|---|---|---|
| C9a | 63 61 66 c3 a9 2e 0a (precomposed é = U+00E9) | 63 61 66 c3 a9 2e |
| C9b | 63 61 66 65 cc 81 2e 0a (decomposed e + U+0301) | 63 61 66 c3 a9 2e |
Both variants canonicalize to the same 6 bytes (café.). With the same salt_b64 (Qzktc2FsdC1vb29vb29vbw== — raw bytes 43 39 2d 73 61 6c 74 2d 6f 6f 6f 6f 6f 6f 6f 6f), both variants produce the identical leaf_hash:
e604850e3138f48df7d5f1858d316500904b89f4f7949446bd42d8faa4b054b4
This is the byte-level demonstration that §3.5's NFC requirement makes the profile editor-agnostic for accented Latin text.
11.10 C10 — multi-paragraph (full merkle tree)
- Input bytes (hex):
46 69 72 73 74 20 70 61 72 61 2e 20 53 65 63 6f 6e 64 20 73 65 6e 74 65 6e 63 65 20 68 65 72 65 2e 0a 0a 53 65 63 6f 6e 64 20 70 61 72 61 2e 20 57 69 74 68 20 74 77 6f 20 73 65 6e 74 65 6e 63 65 73 2e 0a - Canonical form:
"First para. Second sentence here.\n\nSecond para. With two sentences." - Paragraphs: 2 · Sentences: 4
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | First para. | QzEwLXNhbHQtYWFhYWFhYQ== | 1801777a023ee4acb739a39a5c360fd6b5e6a50a4cc59b727db759246323947f |
p0/s1 | Second sentence here. | QzEwLXNhbHQtYmJiYmJiYg== | d3d3b641f18937b8c1178a0d43964b3f878687cdc62d014a48a6b92614558999 |
p1/s0 | Second para. | QzEwLXNhbHQtY2NjY2NjYw== | 63ac8e0c1c571079530066a4e09abc63dfaa435385997c610080173b22755dfc |
p1/s1 | With two sentences. | QzEwLXNhbHQtZGRkZGRkZA== | ff147d6f08197312d28f2450f7fe4d8ce22600ba95a9ae26e207b09bd915c05e |
Merkle root: b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2
Full merkle tree. Four leaves, two inner levels.
root = b3b4f5deb5304fbb510865d4cb54ac3adc9e0241b28aae06681f7a10f71cb5c2
= SHA-256( I_L || I_R )
/ \
I_L = a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc I_R = dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734
= SHA-256( leaf[p0/s0] || leaf[p0/s1] ) = SHA-256( leaf[p1/s0] || leaf[p1/s1] )
/ \ / \
leaf[p0/s0] = 1801...947f leaf[p0/s1] = d3d3...8999 leaf[p1/s0] = 63ac...5dfc leaf[p1/s1] = ff14...c05e
proof_path entries:
- For
p0/s0:[{"side":"R","hash":"d3d3b641f18937b8c1178a0d43964b3f878687cdc62d014a48a6b92614558999"}, {"side":"R","hash":"dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734"}] - For
p0/s1:[{"side":"L","hash":"1801777a023ee4acb739a39a5c360fd6b5e6a50a4cc59b727db759246323947f"}, {"side":"R","hash":"dbdf1e2438ac7d996ad5d361f472a1ef4626d1e2bb5d51058b597354a6ef0734"}] - For
p1/s0:[{"side":"R","hash":"ff147d6f08197312d28f2450f7fe4d8ce22600ba95a9ae26e207b09bd915c05e"}, {"side":"L","hash":"a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc"}] - For
p1/s1:[{"side":"L","hash":"63ac8e0c1c571079530066a4e09abc63dfaa435385997c610080173b22755dfc"}, {"side":"L","hash":"a9dda11246c772a016a8789f45bfd594b9913ff0721c2d1f53ec9a3ac6ffe0fc"}]
11.11 C11 — CRLF mixed
- Input bytes (hex):
46 69 72 73 74 2e 0d 0a 53 74 69 6c 6c 20 70 30 2e 0a 0d 0a 53 65 63 6f 6e 64 20 70 61 72 61 2e 0d 0a - Canonical form:
"First.\nStill p0.\n\nSecond para."(matches the same input encoded with only\n) - Paragraphs: 2 · Sentences: 3 (
p0has 2;p1has 1)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | First. | QzExLXNhbHQtYWFhYWFhYQ== | b04b6704c41bd91295ab9c02fca0c5d69324c92609f4d3750077036f7055be61 |
p0/s1 | Still p0. | QzExLXNhbHQtYmJiYmJiYg== | 6e35b2ae0bfee6fe8ff5b2cd88e48a289b3e38970558c940dc849b08f7d2df9b |
p1/s0 | Second para. | QzExLXNhbHQtY2NjY2NjYw== | 1b2237c8e8c8fac65b7af016cb6005a79b735eb25a600e3a652eba0921dfbefb |
Note that the within-paragraph CRLF between First. and Still p0. is normalized to LF (§3.4), and the two lines are then joined with a single space per §4.3 — yielding the segmentation-input string "First. Still p0." for p0. The LF is not present in any sentence's bytes.
11.12 C12 — BOM
- Input bytes (hex):
ef bb bf 48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 0a - Canonical form (after §3.2 BOM strip):
"Hello world."(identical to the canonical form of the BOM-less input48 65 6c 6c 6f 20 77 6f 72 6c 64 2e 0a) - Paragraphs: 1 · Sentences: 1
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | Hello world. | QzEyLXNhbHQtYWFhYWFhYQ== | 2bb0fd5264ba70973a636c65419dfdf47be4956a2cc992b0c2d3d690547356c2 |
Note: this leaf_hash differs from C1's p0/s0 because the salt_b64 differs. The point of the fixture is that the canonical bytes of the sentence are identical to a BOM-less variant — not that the leaf hash is identical (which it would be only if the salt and leaf_id were also identical).
11.13 C13 — incomplete-sentence paragraph
- Input bytes (hex):
54 68 69 73 20 68 61 73 20 6e 6f 20 74 65 72 6d 69 6e 61 74 6f 72 0a - Canonical form:
"This has no terminator" - Paragraphs: 1 · Sentences: 1 (paragraph has no terminator codepoints → §5.9 applies; the full paragraph bytes are one sentence with NO synthetic terminator added)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | This has no terminator | QzEzLXNhbHQtYWFhYWFhYQ== | 3efe3081be5681139d706aa809e132cace91a96d31683a1a2842514c836c0c9d |
11.14 C14 — forever-list abbreviation collision (vs.)
- Input bytes (hex):
76 73 2e 20 74 68 65 20 72 65 73 74 2e 20 45 6e 64 2e 0a - Canonical form:
"vs. the rest. End." - Paragraphs: 1 · Sentences: 2 (the
vs.is on the abbreviation list per §5.2 and does not terminate; the next.afterrestdoes; the final.afterEndterminates the second sentence)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | vs. the rest. | QzE0LXNhbHQtYWFhYWFhYQ== | 8a0016e15703dedd89d512d3f54f2b8d55173ec1e2827503650051c37bb74821 |
p0/s1 | End. | QzE0LXNhbHQtYmJiYmJiYg== | f9a6380a2ffb925556f6247683576697c9660f7adfcdc78b76a8bc4d4cf812e9 |
11.15 C15 — multi-dot abbreviations (e.g., i.e.)
This fixture exercises §5.11 (Decision 25) directly: both e.g and i.e are multi-dot entries in the abbreviation list, and v1's matching algorithm shields all four of the internal-and-trailing dots they introduce. The bug being fixed was that the prior _preceding_word helper split this input into four sentence fragments; the patched _dotted_token_match in §11.16 produces the intended two.
- Input bytes (hex):
48 65 20 73 61 69 64 20 65 2e 67 2e 20 62 65 66 6f 72 65 20 6c 75 6e 63 68 2e 20 54 68 65 6e 20 69 2e 65 2e 20 61 66 74 65 72 2e 0a - Canonical form:
"He said e.g. before lunch. Then i.e. after." - Paragraphs: 1 · Sentences: 2 (the
e.g.andi.e.dotted-tokens are both shielded by §5.11)
leaf_id | value | salt_b64 | leaf_hash |
|---|---|---|---|
p0/s0 | He said e.g. before lunch. | Dw8PDw8PDw8PDw8PDw8PDw== | be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882 |
p0/s1 | Then i.e. after. | 8PDw8PDw8PDw8PDw8PDw8A== | 1a455dea1a4456b1bb6af79022bb3bdd77900367a4a91184904d3c395cb9dcb4 |
The two salts are deliberately constant byte patterns (16 × 0x0F and 16 × 0xF0) so an unaffiliated verifier can reproduce these hashes without a salt-generation step: the raw-bytes preimage is fully determined by the spec.
Full merkle tree. Two leaves, one inner level (the root).
root = 65bc540ee29ee81834460758069e4ed48c2bc05ca8958f11db98ea5b08a060ba
= SHA-256( leaf[p0/s0] || leaf[p0/s1] )
/ \
leaf[p0/s0] = be17...9882 leaf[p0/s1] = 1a45...dcb4
proof_path entries:
- For
p0/s0:[{"side":"R","hash":"1a455dea1a4456b1bb6af79022bb3bdd77900367a4a91184904d3c395cb9dcb4"}] - For
p0/s1:[{"side":"L","hash":"be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882"}]
Full preimage breakdown for p0/s0. Reproducing the leaf-hash by hand:
profile_literal_utf8(36 B):73 61 74 73 69 67 6e 61 6c 2e 74 65 78 74 2e 70 61 72 61 67 72 61 70 68 5f 73 65 6e 74 65 6e 63 65 2e 76 31- separator (1 B):
00 leaf_id_utf8(5 B):70 30 2f 73 30("p0/s0")- separator (1 B):
00 sentence_value_utf8(26 B):48 65 20 73 61 69 64 20 65 2e 67 2e 20 62 65 66 6f 72 65 20 6c 75 6e 63 68 2e("He said e.g. before lunch.")- separator (1 B):
00 salt_bytes(16 B):0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f 0f
Total preimage length: 86 bytes. SHA-256 of those bytes: be173bfa3463ca7be346c197d5271764a3320c5ecf32408ee0b3aa1fce279882.
Counterfactual. Under the pre-patch _preceding_word helper (now removed from v1), this input would have produced four sentence fragments: "He said e.", "g. before lunch.", "Then i.", "e. after.". The hashes for those four fragments would not match the two hashes pinned here. A verifier that runs the patched §11.16 algorithm reproduces the two hashes above exactly; a verifier that runs the pre-patch algorithm cannot verify a C15 disclosure. This is the single bug §5.11 corrects.
11.16 Reference algorithm (Python)
The hashes in §§11.1–11.15 were produced by the algorithm below. It is normative for the segmentation and hashing rules in §§3–8: any divergence between this code and the prose above is a bug in the prose, not in the code (subject to the spec author retaining final say on rule shape).
import hashlib, unicodedata
PROFILE = "satsignal.text.paragraph_sentence.v1"
ABBREVIATIONS = {
"Mr","Mrs","Ms","Mx","Dr","Jr","Sr","St","Mt","Ft",
"vs","etc","e.g","i.e","cf","viz","Inc","Ltd","Co","Corp",
"LLC","LLP","PLC","GmbH","S.A","Pty","No","Vol","pp","p",
"ch","sec","fig","eq","Prof","Rev","Hon","Capt","Gen","Lt",
"Sgt","Maj","Col","Ave","Blvd","Rd","Ln","Ct",
}
TERMINATORS = {".", "?", "!", "…", "。", "?", "!"}
QUOTE_OPEN_TO_CLOSE = {
'"': '"', "'": "'",
"“": "”", "‘": "’",
"«": "»", "「": "」",
}
QUOTE_OPENERS = set(QUOTE_OPEN_TO_CLOSE.keys())
QUOTE_CLOSERS = set(QUOTE_OPEN_TO_CLOSE.values())
# §5.11 (Decision 25): dotted-token boundary set. The forever-frozen
# abbreviation list in §5.2 contains no bracket / quote / whitespace
# character, so widening the boundary set with brackets cannot make a
# previously-recognized abbreviation stop matching.
DOTTED_TOKEN_BREAKS = {"(", ")", "[", "]", "{", "}"}
def canonicalize(raw: bytes) -> str:
text = raw.decode("utf-8") # §3.1
if text.startswith(""): # §3.2
text = text[1:]
text = unicodedata.normalize("NFC", text) # §3.5
text = text.replace("\r\n", "\n").replace("\r", "\n") # §3.4
lines = [ln.rstrip(" \t") for ln in text.split("\n")] # §3.6
text = "\n".join(lines)
if text.endswith("\n"): # §3.7
text = text[:-1]
return text
def split_paragraphs(canonical: str) -> list[str]:
paragraphs, cur = [], []
for line in canonical.split("\n"):
if line == "":
if cur:
paragraphs.append(" ".join(cur)) # §4.3
cur = []
else:
cur.append(line)
if cur:
paragraphs.append(" ".join(cur))
return paragraphs
def _is_dotted_token_boundary(ch: str) -> bool:
"""§5.11: whitespace, recognized quote, or ASCII bracket."""
return (
ch.isspace()
or ch in QUOTE_OPENERS
or ch in QUOTE_CLOSERS
or ch in DOTTED_TOKEN_BREAKS
)
def _dotted_token_match(buf: str, idx: int) -> bool:
"""
§5.11 (Decision 25). Returns True iff the candidate '.' at
`buf[idx]` is shielded from being a sentence terminator by a
forever-frozen abbreviation entry.
The 'dotted-token' surrounding `idx` is the maximal run of
non-(whitespace | quote | bracket) characters around `idx`,
delimited by `_is_dotted_token_boundary`. The candidate '.' is
shielded iff there exists an entry `a` in `ABBREVIATIONS` with
buf[start : start + len(a)] == a
AND start <= idx <= start + len(a)
where `start` is the dotted-token's left boundary. This handles
BOTH the internal '.' of a multi-dot abbreviation (idx strictly
inside the entry's span) AND the trailing '.' immediately after
the entry (idx == start + len(a)).
"""
# Left boundary of the dotted-token.
start = idx
while start > 0 and not _is_dotted_token_boundary(buf[start - 1]):
start -= 1
# Right boundary (used only to cap candidate entry length).
n = len(buf)
end = idx
while end < n and not _is_dotted_token_boundary(buf[end]):
end += 1
span_left = idx - start
span_right = end - start
for a in ABBREVIATIONS:
L = len(a)
# Entry must cover idx (L >= span_left) and fit in the
# dotted-token (L <= span_right).
if L < span_left or L > span_right:
continue
if buf[start:start + L] == a:
return True
return False
def segment_sentences(p: str) -> list[str]:
sents, n = [], len(p)
i = 0
while i < n and p[i].isspace():
i += 1
start, stack = i, []
while i < n:
ch = p[i]
# quote depth (§5.5)
if ch in QUOTE_OPENERS and (
ch not in QUOTE_CLOSERS or not stack or stack[-1] != ch
):
stack.append(QUOTE_OPEN_TO_CLOSE[ch]); i += 1; continue
if ch in QUOTE_CLOSERS and stack and stack[-1] == ch:
stack.pop(); i += 1; continue
if stack:
i += 1; continue
if ch not in TERMINATORS:
i += 1; continue
terminates = True
if ch == ".":
prev_c = p[i - 1] if i > 0 else ""
next_c = p[i + 1] if i + 1 < n else ""
if prev_c.isdigit() and next_c.isdigit(): # §5.3 decimal
terminates = False
else:
# §5.4 ellipsis-with-space
rs = i
while rs > 0 and p[rs - 1] == ".":
rs -= 1
re = i
while re + 1 < n and p[re + 1] == ".":
re += 1
if re - rs + 1 >= 3:
if i != re:
terminates = False
else:
after = p[re + 1] if re + 1 < n else ""
terminates = (after == "" or after.isspace())
else:
# §5.2 / §5.11 (Decision 25) — multi-dot aware.
if _dotted_token_match(p, i):
terminates = False
if not terminates:
i += 1; continue
sents.append(p[start:i + 1]) # §5.7
i += 1
while i < n and p[i].isspace():
i += 1
start = i
if start < n and p[start:n].strip() != "": # §5.9 / §5.10
sents.append(p[start:n])
return sents
def leaf_hash(profile, leaf_id, value, salt_bytes): # §7
return hashlib.sha256(
profile.encode("utf-8") + b"\x00"
+ leaf_id.encode("utf-8") + b"\x00"
+ value.encode("utf-8") + b"\x00"
+ salt_bytes
).digest()
def merkle_root_and_levels(leaves):
levels = [list(leaves)]
cur = list(leaves)
while len(cur) > 1:
nxt = []
for i in range(0, len(cur), 2):
if i + 1 < len(cur):
nxt.append(hashlib.sha256(cur[i] + cur[i + 1]).digest())
else:
nxt.append(cur[i]) # odd: promote
levels.append(nxt)
cur = nxt
return cur[0], levels
12. Out of scope for v1
The following are explicitly out of scope for v1 of this profile. Any of them may motivate a future vN+1 profile under a separate decision record; none can be retrofitted into v1.
- Paragraph-level leaves as a distinct type. Paragraph-level disclosure is achieved by disclosing every sentence in the chosen paragraph. No
p<N>leaf type exists inv1. - Word-level or token-level disclosure. No
text.word.v1ortext.token.v1. The leaf granularity ofv1is the sentence. - Language-specific tokenization. No
language_hintfield, no ICU dependency, no Burmese / Khmer / Lao word breakers, no Japanese morphological splitter. The terminator set in §5.1 plus the abbreviation list in §5.2 is the whole grammar. - Markdown / HTML / RTF / PDF / DOCX rendering inside sentences. The profile operates over plaintext only; structural markup is the caller's responsibility to strip before anchoring.
- Footnote / citation extraction. Inline footnote markers (
[1], superscripts,^note) are treated as ordinary sentence bytes; the profile does not split them out, link them across paragraphs, or emit a separatefootnoteleaf type. - Sealed-mode disclosure. Per
disclosure-v1.md §4step 5,satsignal.disclosure.v1covers standard-mode original anchors only (algo: "sha256"). Sealed-mode sentence disclosure is deferred to a future minor of the disclosure spec or a per-profile sealed addendum. - Capitalization heuristics for ambiguous
.. §5.6 explicitly rules out using next-word capitalization to decide whether a.is a terminator. The abbreviation list of §5.2 is the only carve-out. - Adding to the abbreviation list. Even if a missing abbreviation is identified (e.g.
Univ,Eng,Sen), it cannot be added tov1. The remedy isv2.
Questions about this specification? Email hello@satsignal.cloud.