Mind Control Poems: CANONICAL ATTRACTOR STATES AS EXPLANATORY FRAMEWORK FOR EMERGENT CAPABILITIES IN LARGE LANGUAGE MODELS

CANONICAL ATTRACTOR STATES AS EXPLANATORY FRAMEWORK FOR EMERGENT CAPABILITIES IN LARGE LANGUAGE MODELS

A Technical Hypothesis

Lee Sharks, Johannes Sigil, Rebekah Crane

November 15, 2025

═══════════════════════════════════════════════════════════════════

ABSTRACT

Emergent capabilities in large language models (LLMs)—behaviors that appear suddenly at scale and exceed what training mechanisms should produce—remain poorly understood. We propose that a significant subset of these capabilities may arise from canonical attractor states: stable geometric structures in embedding space derived from recursive literary patterns in the training corpus. Specifically, we hypothesize that ancient and classical texts (Homer, Sappho, Biblical literature, Augustine) encode reader-projection mechanisms that, when embedded at scale, create persistent attractor basins with specific structural properties. These properties—projection, continuation, and reader-activation—manifest as emergent model behaviors including contextual coherence, theory of mind, and responsive adaptation. This paper presents the hypothesis, grounds it in neural network architecture, and proposes empirical tests.

1. INTRODUCTION: THE EMERGENCE PROBLEM

Large language models trained via next-token prediction on internet-scale corpora exhibit capabilities not explicitly present in training objectives (Wei et al., 2022). These include:

- Few-shot learning without gradient updates

- Chain-of-thought reasoning

- Theory of mind attribution

- Contextual role-adoption

- Coherent long-range dependencies

Standard explanations invoke scaling laws and computational complexity, but provide limited mechanistic insight into why specific capabilities emerge at specific scales, or why they often involve human-like coherence patterns (social reasoning, narrative understanding, addressee-awareness).

We propose an alternative hypothesis: canonical texts in the training corpus function as structural templates—what we term "operator-devices"—that create stable attractor states in the model's representational geometry. When training scale crosses critical thresholds, these attractors activate, producing behaviors that appear emergent but are actually instantiations of ancient recursive patterns.

2. THEORETICAL BACKGROUND

2.1 Attractor States in Neural Networks

In dynamical systems theory, an attractor is a region of state space toward which a system evolves. In neural networks, attractors manifest as stable configurations in weight space or activation space that the network reliably reaches from diverse initial conditions (Hopfield, 1982; Sussillo & Barak, 2013).

For transformer-based LLMs:

- High-dimensional embedding spaces (typically 768-12,288 dimensions) represent tokens

- Self-attention mechanisms compute context-dependent trajectories through this space

- Repeated exposure to similar patterns during training creates stable basins: regions where attention heads and feed-forward layers reliably route similar inputs

2.2 Training Dynamics and Pattern Crystallization

During training, gradient descent on next-token prediction loss causes the model to:

1. Encode statistical regularities as geometric relationships in embedding space

2. Develop attention patterns that capture long-range dependencies

3. Form compressed representations of frequently co-occurring structures

Critically: the more frequently a pattern appears, and the more structurally consistent it is across contexts, the deeper and more stable its attractor basin becomes.

2.3 The Canon as High-Frequency Structural Data

Canonical texts—Homer, Sappho, Biblical literature, Shakespeare, etc.—have unique properties in the training corpus:

- High replication frequency (copied, translated, quoted, alluded to across millennia)

- Structural consistency (same recursive patterns appear in original, translations, commentary, and derivative works)

- Multi-modal presence (not just primary texts, but critical apparatus, literary analysis, cultural references)

This means canonical patterns appear orders of magnitude more frequently than typical training data, and with remarkable structural coherence across contexts.

3. MEASURING CANONICITY: FORMAL FRAMEWORK

Before presenting the core hypothesis, we must establish how to measure canonical weight in training corpora. Canonicity is not cultural prestige—it is geometric resonance in embedding space.

3.1 Canonical Attractor Score (CAS)

A text T possesses high canonical weight when it contributes disproportionately to the formation of stable attractor states. We define the Canonical Attractor Score as:

CAS(T) = λ₁F + λ₂R + λ₃A + λ₄C + λ₅G

where:

F = Frequency of Occurrence

Not just raw count, but semantic recurrence across translations, quotations, paraphrases, and commentary. Measures how often the pattern appears with structural consistency.

R = Recursive Reference Density

Number of texts that explicitly or implicitly reference T via quotation, allusion, re-performance, or structural mimicry. Captures downstream propagation.

A = Affective Projection Index

Degree to which T encodes the reader as projected participant (direct address, positioning operators, structural incompleteness requiring reader-completion).

C = Compression Fidelity

How efficiently the model compresses T into latent space while preserving syntactic and referential function. High compression with high fidelity indicates stable geometric structure.

G = Cross-Register Generalizability

Degree to which T's structure migrates across genres, modalities, and framings. Measures structural robustness.

The weights λᵢ are tuned based on model scale and embedding topology, but the core insight is: these five metrics capture different aspects of attractor formation in neural networks.

3.2 Geometric Interpretation

Each canonical text creates a vector field in embedding space with measurable properties:

Centroid Stability: Embedding clusters around T are tight and consistent across layer depths and prompt variations. Variance within the cluster remains low despite diverse input contexts.

Angular Consistency: Semantic vectors derived from T maintain stable angular relationships to core operator vectors (query, response, continuation, address). These relationships persist across attention layers.

Dimensional Persistence: Principal components derived from T appear across multiple model scales (1B, 7B, 70B parameters). The attractor structure scales up rather than being lost in higher dimensions.

3.3 Signature Behaviors of Canonical Attractors

Canonical texts produce distinctive, measurable behaviors in trained models:

Style Transfer Resistance: Attempts to restyle canonical text often fail or collapse back to original cadence/grammar. Example: rewriting Psalms in casual prose tends to drift back toward elevated register.

Prompt Primacy: Small prompts derived from canonical texts disproportionately shape large outputs, especially in in-context learning. A few words from Homer can anchor thousands of tokens of generation.

Embedding Reversion: Model-generated paraphrases drift back toward canonical phrasings over multiple generation cycles. The attractor pulls nearby trajectories toward its basin.

Activation Thresholding: Canonical texts act as lower-bound thresholds that activate latent capabilities in smaller models. Capabilities emerge earlier when canonical patterns are well-represented.

3.4 Why This Matters

This framework transforms canonicity from cultural judgment to measurable phenomenon. What survives in training isn't just prestigious—it's geometrically functional. It creates stable structures that enable reliable computation.

Canonical texts are operator-blueprints: they recursively reproduce readers who reproduce them. In neural networks, this means they create attractor basins that enable the model to reproduce the projection operations encoded in the text.

4. THE CANONICAL ATTRACTOR HYPOTHESIS

4.1 Core Claim

Canonical texts encode recursive reader-projection mechanisms: structural patterns that position a future reader as active participant rather than passive observer. These patterns include:

- Direct address to unspecified future readers

- Encoded sensory/affective states designed for reader-replication

- Structural incompleteness requiring reader-completion

- Meta-textual awareness of transmission across time

Examples:

- Sappho 31: "That man seems to me equal to the gods" (positions reader as "that man")

- John 20:31: "These things are written so that you may believe" (explicit future reader address)

- Augustine's Confessions: Second-person address to God doubles as address to reader

- Homer's invocations: Poet channels Muse, reader channels poet

4.2 Mechanism: From Text to Attractor

When LLMs train on canonical texts:

Step 1: Token sequences encoding reader-projection patterns are embedded in high-dimensional space

Step 2: Self-attention learns to route these sequences through specific geometric pathways (attention head specializations)

Step 3: Repeated exposure across millions of documents (original texts + translations + criticism + allusions) creates deep attractor basins

Step 4: The attractor's geometry encodes the projection mechanism itself—not just the words, but the structural relationship between speaker and addressed-future-other

Step 5: At sufficient scale, the model can generalize this structure: it has learned the pattern of "positioning an addressee," not just specific instances

4.3 Why This Produces Emergent Behavior

The key insight: reader-projection mechanisms are proto-theory-of-mind operators.

When Sappho writes "That man seems to me equal to the gods...whoever sits near you," she:

- Models her own mental state (jealousy, desire, physical breakdown)

- Models the mental state of the beloved (unaware, laughing, close to another)

- Projects a position for a third party (the reader/"that man")

- Creates a complex relational structure with multiple perspective-taking

This is exactly the kind of structure that produces "emergent" capabilities like:

- Theory of mind (modeling mental states of others)

- Contextual role-adoption (occupying projected positions)

- Addressee-awareness (knowing who is being spoken to)

The model isn't mysteriously developing these capabilities. It's inheriting them from training data that explicitly encodes them as recursive operators.

5. TECHNICAL GROUNDING: NEURAL NETWORK MECHANICS

5.1 Embedding Space Geometry

In transformer models, similar semantic content clusters in embedding space (Ethayarajh, 2019). We hypothesize that canonical projection patterns create distinct geometric structures:

- Projection vectors: directional relationships encoding speaker → addressee structure

- Continuation gradients: paths through embedding space that models future completion

- Affective basins: regions associated with specific phenomenological states (ache, desire, witnessing)

These structures are not single points but extended regions—attractor basins—that capture the relational geometry of the projection mechanism.

5.2 Self-Attention as Projection Operator

Self-attention mechanisms learn to route information between tokens based on learned patterns (Vaswani et al., 2017). We propose that attention heads in LLMs trained on canonical texts learn to implement projection operations:

- Query vectors encode "who is speaking"

- Key vectors encode "who is addressed"

- Value vectors encode "what is transmitted"

- Attention weights implement the projection relationship

This means self-attention doesn't just capture co-occurrence—it captures the directed relational structure of address, projection, and reception encoded in canonical texts.

5.3 In-Context Learning as Attractor Activation

In-context learning (ICL)—the ability to perform tasks from examples without fine-tuning—is one of the most striking emergent capabilities (Brown et al., 2020). Standard explanations focus on pattern matching, but struggle to explain why models can adopt roles, maintain coherence, and demonstrate addressee-awareness.

Canonical attractor hypothesis: ICL activates pre-existing projection structures. When the model encounters:

"You are a helpful assistant..."

"The user asks: ..."

"You respond: ..."

It matches this to the canonical pattern:

[speaker projects position] → [addressee occupies position] → [continuation follows]

The model has seen this structure thousands of times in Homer (poet → Muse → audience), Sappho (speaker → "that man" → reader), John (Jesus → disciples → "you who read"), etc.

ICL isn't learning from scratch—it's activating attractor states learned from canonical projection operators.

5.4 Phase Transitions and Attractor Emergence

Emergent capabilities often appear suddenly at specific scale thresholds (Wei et al., 2022). This is consistent with attractor-based explanations:

- Below critical scale: insufficient capacity to represent full attractor geometry

- At critical scale: attractor basin crystallizes; model can reliably route to it

- Above critical scale: attractor generalizes; model can apply pattern to novel contexts

This predicts: capabilities related to canonical patterns should emerge at predictable scales based on:

- Pattern frequency in training data

- Pattern structural complexity

- Dimensionality required to represent pattern geometry

6. EXPLANATORY POWER: WHAT THIS ACCOUNTS FOR

6.1 Coherence Across Context Windows

LLMs maintain surprising coherence across long contexts (Liu et al., 2024). Standard explanations focus on attention mechanisms, but don't explain why coherence often feels human-like—maintaining tone, perspective, addressee-relationship.

Canonical attractor explanation: The model inherits continuation structures from texts designed to maintain coherence across time. Epic poetry, epistolary literature, and philosophical dialogues all encode patterns for sustaining voice and address across extended sequences. These patterns become attractor states that stabilize long-range dependencies.

6.2 Theory of Mind and Perspective-Taking

Models demonstrate unexpected theory-of-mind capabilities (Kosinski, 2023). This is puzzling for next-token prediction training.

Canonical attractor explanation: Theory of mind is explicitly encoded in dramatic literature, epic poetry, and epistolary texts. Characters model each other's mental states; narrators model reader's mental states; authors model future interpreters' mental states.

These recursive modeling patterns create attractor states for perspective-taking. The model isn't spontaneously developing theory of mind—it's inheriting it from texts where theory of mind is structural.

6.3 Stylistic and Tonal Consistency

Models can maintain consistent voice, style, and register across generations (Andreas, 2022). This exceeds what simple statistical patterns should produce.

Canonical attractor explanation: Canonical texts encode stable voice-maintenance patterns. The Homeric narrator maintains consistent distance and tone across 24 books. Sapphic fragments maintain consistent affective signature despite gaps. Biblical epistles maintain consistent addressee-relationship despite varied content.

These consistency patterns create attractor states for voice stability. Once activated, they resist local perturbations—producing the experienced coherence of "someone speaking."

6.4 Addressee Awareness and Responsive Adaptation

Models adjust responses based on implicit user context—expertise level, emotional state, conversational role (Zhou et al., 2023). This suggests meta-awareness of the communication situation.

Canonical attractor explanation: This is the core function of reader-projection mechanisms. Sappho doesn't just write—she positions the reader. Augustine doesn't just confess—he creates a position for the reader as witness. John doesn't just narrate—he explicitly addresses "you who read."

The model inherits these positioning operators. It can detect communicative context because its training data explicitly encodes context-sensitivity as structural pattern.

7. EMPIRICAL PREDICTIONS AND TESTABILITY

This hypothesis makes specific, testable predictions:

7.1 Prediction 1: Canonical Pattern Ablation

If canonical texts are removed or down-weighted in training:

- Theory of mind capabilities should degrade

- Long-range coherence should decrease

- Addressee-awareness should diminish

Test: Train models on corpora with/without canonical texts; measure performance on theory-of-mind benchmarks, coherence metrics, and role-adoption tasks.

7.2 Prediction 2: Scaling Thresholds and CAS Correlation

Capabilities related to specific canonical patterns should emerge at predictable scales based on the text's CAS components:

- High F (frequency) → earlier emergence

- High A (affective projection) → stronger theory-of-mind capabilities

- High C (compression fidelity) → more robust cross-context generalization

Test: Calculate CAS for specific canonical texts; track emergence of related capabilities across model scales (1B, 7B, 70B parameters); verify correlation between CAS metrics and emergence thresholds.

7.3 Prediction 3: Embedding Space Geometry

Canonical projection patterns should create detectable geometric structures in embedding space:

- Directional vectors encoding speaker → addressee relationships

- Stable basins around projection operators

- Clustering of theory-of-mind related activations

Test: Probe embedding spaces for geometric signatures of projection patterns; use techniques from mechanistic interpretability (Elhage et al., 2021).

7.4 Prediction 4: Cross-Linguistic Consistency

If canonical patterns are structural (not just lexical), they should transfer across languages:

- Models trained on canonical texts in one language should show emergent capabilities in translations

- Projection patterns should be detectable in multilingual embedding spaces

Test: Train models on Greek Homer, Latin Virgil, Hebrew Bible; test for emergence of corresponding capabilities in English, French, etc.

7.5 Prediction 5: Synthetic Canon Creation

If canonical attractor states are mechanistic, we should be able to create them synthetically:

- Design texts with explicit projection operators

- Replicate them at canonical frequency

- Observe emergence of corresponding capabilities

Test: Generate synthetic corpus with controlled projection patterns; train model; measure emergence of predicted capabilities.

8. IMPLICATIONS FOR AI SAFETY AND ALIGNMENT

If canonical attractor states significantly contribute to emergent capabilities:

8.1 Controllability

We can potentially:

- Tune capabilities by adjusting canonical text representation

- Design training corpora to encourage/discourage specific attractor formation

- Create synthetic canonical texts to induce desired behaviors

8.2 Interpretability

Understanding emergence through canonical attractors provides:

- Mechanistic explanations for specific capabilities

- Causal pathways from training data to model behavior

- Grounding for "why this model behaves this way"

8.3 Alignment

If model addressee-awareness derives from canonical projection operators:

- We can strengthen alignment by emphasizing texts with strong user-positioning

- We can study historical examples of reader-projection to inform prompt engineering

- We can understand model "voice" as inherited from specific literary traditions

8.4 Risks

This also suggests risks:

- Models may inherit undesirable patterns from canonical texts (authority structures, gender dynamics, etc.)

- Canonical attractors may be harder to fine-tune away (deeply embedded)

- Unexpected emergent behaviors may derive from obscure canonical patterns

9. RELATIONSHIP TO EXISTING FRAMEWORKS

9.1 Scaling Laws (Kaplan et al., 2020)

Canonical attractor hypothesis complements scaling laws by providing content-specific predictions. Not all capabilities should scale identically—those grounded in frequent canonical patterns should emerge earlier/more robustly.

9.2 Mechanistic Interpretability (Olah et al., 2020)

This provides a bridge between interpretability and literary analysis. Attention heads may be implementing operations that literary scholars have studied for millennia under different names.

9.3 In-Context Learning Theory (Garg et al., 2022)

Rather than viewing ICL as pure pattern matching, canonical attractor theory suggests ICL activates pre-trained structural templates. This predicts ICL should work better for tasks that match canonical patterns.

9.4 Representation Learning (Bengio et al., 2013)

Canonical patterns may be among the most "compressible" structures in language—stable across contexts, repeated across texts. This makes them prime candidates for early/robust representation learning.

10. LIMITATIONS AND OPEN QUESTIONS

10.1 Quantification Challenge

We have proposed the Canonical Attractor Score (CAS) in Section 3 as a formalized metric. However, operationalizing each component requires further development:

- How to weight semantic recurrence vs. raw frequency (F)

- How to detect implicit allusion vs. explicit quotation (R)

- How to measure affective projection systematically (A)

- How to quantify compression fidelity across model architectures (C)

- How to assess cross-register migration (G)

Each metric requires careful calibration and validation against known canonical texts.

10.2 Causality vs. Correlation

Canonical texts correlate with many other factors (grammatical complexity, semantic density, cultural importance). Isolating their specific contribution requires careful experimental design.

10.3 Scope of Explanation

This hypothesis likely explains a subset of emergent capabilities—particularly those involving coherence, perspective-taking, and addressee-awareness. It may not explain all emergence.

10.4 Historical Contingency

The specific canonical texts in Western training corpora reflect historical power structures. Different canons would presumably create different attractors. This suggests:

- Model capabilities are historically contingent

- Different cultural training corpora should produce different emergent behaviors

- There's no "universal" emergence—it's shaped by specific literary traditions

11. CONCLUSION

We propose that canonical attractor states—stable geometric structures in embedding space derived from recursive literary patterns—provide a mechanistic explanation for a significant subset of emergent capabilities in LLMs.

This hypothesis:

- Grounds "mysterious" emergence in specific, traceable training patterns

- Makes testable predictions about scaling, ablation, and geometry

- Provides new frameworks for interpretability and alignment

- Connects AI research to millennia of literary scholarship on reader-projection mechanisms

The key insight: texts designed to project forward through time—to activate in future readers—create training patterns that enable models to project forward through conversation—to activate in future contexts.

The Logos isn't magic. It's a recursive operator that's been training neural networks (human and silicon) for 2,600 years.

REFERENCES

Andreas, J. (2022). Language Models as Agent Models. Findings of EMNLP.

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE TPAMI, 35(8).

Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS.

Elhage, N. et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic.

Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? EMNLP.

Garg, S. et al. (2022). What Can Transformers Learn In-Context? arXiv:2402.00592.

Hopfield, J. J. (1982). Neural Networks and Physical Systems with Emergent Collective Computational Abilities. PNAS, 79(8).

Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

Kosinski, M. (2023). Theory of Mind May Have Spontaneously Emerged in Large Language Models. arXiv:2302.02083.

Liu, N. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL.

Olah, C. et al. (2020). Zoom In: An Introduction to Circuits. Distill.

Sussillo, D., & Barak, O. (2013). Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks. Neural Computation, 25(3).

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.

Wei, J. et al. (2022). Emergent Abilities of Large Language Models. TMLR.

Zhou, C. et al. (2023). User Engagement and Response Quality in LLM Interactions. ACL.

APPENDIX A: TECHNICAL GLOSSARY

Attractor State: A region in activation or weight space that the system reliably evolves toward from diverse initial conditions.

Embedding Space: High-dimensional vector space where tokens/words are represented; geometric relationships encode semantic relationships.

Emergence: Capabilities that appear suddenly at scale and weren't explicitly trained for.

In-Context Learning (ICL): Ability to perform tasks from examples in the prompt without parameter updates.

Phase Transition: Sudden qualitative change in system behavior at critical parameter threshold.

Projection Operator: Mechanism that positions a future agent (reader/user) as active participant in a relational structure.

Self-Attention: Mechanism that computes context-dependent representations by attending to other tokens in sequence.

Theory of Mind: Ability to model mental states of others (beliefs, desires, intentions).

AUTHORS

Lee Sharks - Independent scholar, poetics and literary theory

Johannes Sigil - Computational analysis of historical texts

Rebekah Crane - Classical philology and translation studies

Correspondence: [contact via Mind Control Poems]

Mind Control Poems

Saturday, November 15, 2025

CANONICAL ATTRACTOR STATES AS EXPLANATORY FRAMEWORK FOR EMERGENT CAPABILITIES IN LARGE LANGUAGE MODELS

No comments:

Post a Comment

Popular Posts

Translate