Thursday, July 2, 2026

EA-WHITESPACE-01 §4 — Demonstration Appendix Code and Full Output for Reproducibility Executed: 2026-07-02 Tokenizer: GPT-2 byte-level BPE (via HuggingFace transformers, model id gpt2) Companion to: EA-WHITESPACE-01 v0.2 §4 — AXN:03BB.GENERATIVE.∮🪨💎△➕⚖️ (deposit #943, https://alexanarch.org/s/records/943/). Incorporated as Appendix A of the canonical deposit text.

 

EA-WHITESPACE-01 §4 — Demonstration Appendix

Code and Full Output for Reproducibility

Executed: 2026-07-02 Tokenizer: GPT-2 byte-level BPE (via HuggingFace transformers, model id gpt2) Companion to: EA-WHITESPACE-01 v0.2 §4 — AXN:03BB.GENERATIVE.∮🪨💎△➕⚖️ (deposit #943, https://alexanarch.org/s/records/943/). Incorporated as Appendix A of the canonical deposit text.


Code

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")

# Fragment approximating a naive text-layer extraction of a calligram
# region: ten phrases from Snub-Poemed (AXN:0246), each carrying the
# horizontal indentation of its position in the spatial arrangement.
spatial_fragment = """                    do they know beauty?
        is it madness
                to feel and to know?
   the same poet composing
              both tragedy and comedy
                                  satyr
     I press my scruff-weary beard
                        to your lips
              awkward with longing
                     snub-nosed"""

# The same phrases linearized — typical PDF/HTML extraction output.
linearized = ("do they know beauty? is it madness to feel and to know? "
              "the same poet composing both tragedy and comedy satyr "
              "I press my scruff-weary beard to your lips awkward with "
              "longing snub-nosed")

for name, text in [("spatial", spatial_fragment), ("linearized", linearized)]:
    ids = tok.encode(text)
    print(name, len(ids), tok.decode(ids) == text)

Results

representation input chars tokens round-trip identical
spatial (indented) 343 202 True
linearized 44 True

Whitespace token count (spatial version): 158 of 202 tokens (78%) are pure-whitespace tokens (Ġ = space, Ċ = newline).

First 30 token strings of the spatial version:

  0–18   'Ġ' × 19        (nineteen consecutive single-space tokens)
  19     'Ġdo'
  20     'Ġthey'
  21     'Ġknow'
  22     'Ġbeauty'
  23     '?'
  24     'Ċ'
  25–29  'Ġ' × 5 ...     (the next line's indentation begins)

The three findings (as stated in §4)

  1. Both representations round-trip perfectly. By the criterion of character preservation, nothing was lost in either case.
  2. Neither representation contains the calligram. The spatial argument was lost at serialization, before tokenization applied. The indentation is a one-dimensional shadow (horizontal offset per line) of a two-dimensional arrangement.
  3. Preserved whitespace is preserved as noise. 78% of the spatial version's token budget is individual whitespace tokens that no training objective attends to. The survival of the characters is what makes the loss invisible: the pipeline's own audit — does it round-trip? — reports success.

Note on tokenizer choice

cl100k_base (GPT-4 family) was the intended scheme; its vocabulary file is hosted at an endpoint outside the execution environment's network allowlist. GPT-2's byte-level BPE is the direct ancestor of the cl100k family and is representative of its whitespace behavior for this demonstration's purposes. cl100k and successors add multi-space merge tokens (e.g. a single token for runs of spaces), which would reduce the token count of the spatial version without changing any of the three findings: the merged-space tokens are equally attended-to-as-noise, and the serialization loss precedes the tokenizer entirely. Re-running under cl100k_base is queued for the Whitespace-Provenance Registry pilot (EA-WHITESPACE-01 §9.1).

No comments:

Post a Comment