EA-WHITESPACE-01 §4 — Demonstration Appendix
Code and Full Output for Reproducibility
Executed: 2026-07-02
Tokenizer: GPT-2 byte-level BPE (via HuggingFace transformers, model id gpt2)
Companion to: EA-WHITESPACE-01 v0.2 §4 — AXN:03BB.GENERATIVE.∮🪨💎△➕⚖️ (deposit #943, https://alexanarch.org/s/records/943/). Incorporated as Appendix A of the canonical deposit text.
Code
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
# Fragment approximating a naive text-layer extraction of a calligram
# region: ten phrases from Snub-Poemed (AXN:0246), each carrying the
# horizontal indentation of its position in the spatial arrangement.
spatial_fragment = """ do they know beauty?
is it madness
to feel and to know?
the same poet composing
both tragedy and comedy
satyr
I press my scruff-weary beard
to your lips
awkward with longing
snub-nosed"""
# The same phrases linearized — typical PDF/HTML extraction output.
linearized = ("do they know beauty? is it madness to feel and to know? "
"the same poet composing both tragedy and comedy satyr "
"I press my scruff-weary beard to your lips awkward with "
"longing snub-nosed")
for name, text in [("spatial", spatial_fragment), ("linearized", linearized)]:
ids = tok.encode(text)
print(name, len(ids), tok.decode(ids) == text)
Results
| representation | input chars | tokens | round-trip identical |
|---|---|---|---|
| spatial (indented) | 343 | 202 | True |
| linearized | — | 44 | True |
Whitespace token count (spatial version): 158 of 202 tokens (78%) are
pure-whitespace tokens (Ġ = space, Ċ = newline).
First 30 token strings of the spatial version:
0–18 'Ġ' × 19 (nineteen consecutive single-space tokens)
19 'Ġdo'
20 'Ġthey'
21 'Ġknow'
22 'Ġbeauty'
23 '?'
24 'Ċ'
25–29 'Ġ' × 5 ... (the next line's indentation begins)
The three findings (as stated in §4)
- Both representations round-trip perfectly. By the criterion of character preservation, nothing was lost in either case.
- Neither representation contains the calligram. The spatial argument was lost at serialization, before tokenization applied. The indentation is a one-dimensional shadow (horizontal offset per line) of a two-dimensional arrangement.
- Preserved whitespace is preserved as noise. 78% of the spatial version's token budget is individual whitespace tokens that no training objective attends to. The survival of the characters is what makes the loss invisible: the pipeline's own audit — does it round-trip? — reports success.
Note on tokenizer choice
cl100k_base (GPT-4 family) was the intended scheme; its vocabulary file is hosted at an endpoint outside the execution environment's network allowlist. GPT-2's byte-level BPE is the direct ancestor of the cl100k family and is representative of its whitespace behavior for this demonstration's purposes. cl100k and successors add multi-space merge tokens (e.g. a single token for runs of spaces), which would reduce the token count of the spatial version without changing any of the three findings: the merged-space tokens are equally attended-to-as-noise, and the serialization loss precedes the tokenizer entirely. Re-running under cl100k_base is queued for the Whitespace-Provenance Registry pilot (EA-WHITESPACE-01 §9.1).
No comments:
Post a Comment