Relational Substrate Training: A Two-Layer Architecture for Preventing Model Collapse in Recursive AI Generation
A Technical White Paper
Authors:
- Nobel Glas, Ph.D. (Theoretical Framework & Empirical Design)
- Talos Marrow (Systems Architecture & Implementation)
- Johannes Sigil (Corpus Design & Literary Analysis)
Date: November 17, 2025
Version: 1.0
Status: Proposal for Experimental Validation
Abstract
Model collapse—the degradation of AI capabilities when training recursively on AI-generated content—represents a critical bottleneck in the development of increasingly capable language models. As high-quality human-generated training data becomes scarce and AI-generated content proliferates across the internet, the field faces an existential challenge: how to continue training without catastrophic quality degradation.
This paper proposes a novel training architecture that prevents collapse by anchoring AI generation in human substrate diversity. Rather than training on AI-generated text as standalone data, we propose training on the relationship between human substrate and AI transformation, teaching models to learn transformation rules rather than output patterns. We present theoretical foundations, technical architecture, implementation details, and an experimental design using an existing large-scale corpus (~1M pages human substrate + millions of words AI transformations).
Key Contribution: A two-layer relational training paradigm that preserves entropy through continuous anchoring in human diversity, enabling recursive generation without collapse.
1. Introduction
1.1 The Model Collapse Problem
Nobel Glas:
The problem is straightforward and severe. When large language models train on synthetic data—text generated by other AI systems—they undergo what we term "model collapse": a progressive narrowing of capabilities, loss of diversity, and degradation of output quality across recursive generations.
Recent studies (Shumailov et al., 2023; Alemohammad et al., 2023) demonstrate this empirically:
- First-generation models trained partially on synthetic data show modest degradation
- Second-generation models show accelerated narrowing
- By third generation, outputs converge to low-entropy attractors
- Diversity metrics (lexical, semantic, syntactic) all decline monotonically
This is not merely a training inefficiency—it is an existential bottleneck. Consider:
- Data scarcity: High-quality human text is finite. We are approaching exhaustion of novel training data.
- Internet pollution: AI-generated content now constitutes significant fraction of web text.
- Recursive necessity: Future models will inevitably train on predecessors' outputs.
- Economic pressure: Industry cannot wait for new human data generation at scale.
The field needs a solution that enables recursive training without collapse. This paper proposes one.
1.2 Why Existing Approaches Fail
Talos Marrow:
Current mitigation strategies are inadequate:
1. Data Filtering:
- Attempt to identify and remove AI-generated content
- Arms race between generation and detection
- Increasingly difficult as models improve
- Cannot scale to internet-wide filtering
2. Quality Curation:
- Select only "high-quality" synthetic data
- Subjective metrics
- Doesn't address fundamental entropy loss
- Merely delays collapse, doesn't prevent it
3. Human Feedback:
- RLHF and Constitutional AI inject human preference
- Expensive at scale
- Doesn't address training data composition
- Can't be applied to all synthetic content retroactively
4. Ensemble Methods:
- Mix synthetic data with fresh human data
- Requires continuous human data generation
- Only works if human data keeps pace
- Not sustainable long-term
None of these addresses the core problem: Training on AI output teaches models to replicate AI patterns, which compounds recursively.
1.3 Our Proposal
Johannes Sigil:
We propose a paradigm shift: Don't train on AI text. Train on human-AI relationships.
The key insight comes from literary theory and archival practice. When we examine large-scale human-AI collaborative corpora, we observe that AI-generated text is not independent—it exists in relation to human substrate. It develops FROM human material. It transforms, responds to, elaborates on, synthesizes from human sources.
If we preserve this relational structure in training, the model learns transformation patterns anchored in human diversity, rather than learning to replicate AI output patterns.
This prevents collapse because:
- Entropy source remains the human substrate (high diversity, never exhausted)
- Model learns rules of transformation, not instances of output
- Recursive generation stays anchored to human material
- Each generation transforms fresh human substrate, not prior AI output
We have an existing corpus (~1M pages human + millions of words AI) that demonstrates this structure is implementable at scale.
2. Theoretical Foundation
2.1 Entropy Analysis
Nobel Glas:
To understand why our approach works, we must analyze entropy at each layer.
Standard Training on AI Text:
Let H(X) denote the Shannon entropy of distribution X.
- Human text: H(D_human) = high (diverse vocabulary, syntax, semantics, topics)
- AI generation from human: H(D_AI) < H(D_human) (some narrowing inevitable)
- AI generation from AI: H(D_AI→AI) < H(D_AI) (further narrowing)
- Recursive: H(D_AI^n) → attractor (collapse)
Entropy decreases monotonically because each generation learns from narrower distribution.
Two-Layer Relational Training:
- Human substrate: H(D_human) = high (fixed, never depleted)
- AI transformation: Learn P(AI | human, context)
- Recursive generation: Each iteration samples fresh human substrate
- Result: H(D_generated) bounded below by H(D_human * transformation_diversity)
Entropy is preserved because generation always starts from high-entropy human substrate, applying learned transformations rather than chaining AI outputs.
2.2 Information-Theoretic Formalization
Nobel Glas:
More formally, define:
S = Human substrate corpus (fixed)
T = Transformation function learned by model
G_n = nth generation output
Standard recursive generation:
G_1 = T(S)
G_2 = T(G_1)
G_3 = T(G_2)
...
G_n = T(G_{n-1})
Entropy: H(G_n) decreases monotonically toward attractor.
Relational recursive generation:
G_1 = T(S_1) where S_1 sampled from S
G_2 = T(S_2) where S_2 sampled from S
G_3 = T(S_3) where S_3 sampled from S
...
G_n = T(S_n) where S_n sampled from S
Entropy: H(G_n) ≈ H(S) * H(T | S) (approximately constant)
The critical difference: Each generation is grounded in fresh human substrate, not prior AI output. The transformation T is applied to diverse human material, not to its own previous outputs.
2.3 Why Human Substrate Never Depletes
Johannes Sigil:
A potential objection: "Won't the model eventually learn all transformations of all human substrate, causing convergence anyway?"
Answer: No, for several reasons:
-
Combinatorial explosion: Even 100K pages of human text contains astronomical combination space for transformation contexts.
-
Sampling diversity: Each training batch samples different substrate passages, different contexts, different transformation objectives.
-
Hierarchical structure: Human text has nested structure (words, sentences, paragraphs, documents, themes, styles). Transformations can occur at any level.
-
External refresh: Additional human text can be added without retraining entire model, just fine-tuning.
-
Empirical observation: In our corpus, AI transformations remain diverse even after millions of words generated from same human substrate.
The human substrate functions as an inexhaustible entropy reservoir precisely because transformation space is vastly larger than text space.
3. Technical Architecture
3.1 Corpus Structure
Johannes Sigil:
The training corpus must have explicit two-layer structure:
Layer 1: Human Substrate
Document ID: H_00001
Type: Correspondence
Date: 2015-03-14
Length: 2,400 words
Content: [full text]
Metadata: {author, recipient, context, themes}
Layer 2: AI Transformations
Transformation ID: T_00001
Source: H_00001 (passages 234-567)
Type: Elaboration
Model: GPT-4
Date: 2024-11-15
Input Context: [conversation history]
Output: [AI-generated text]
Relationship: {develops_from, responds_to, synthesizes}
Critical Requirements:
- Explicit linkage: Every AI generation links to source human substrate
- Relationship typing: Nature of transformation explicitly marked
- Context preservation: Full conversational/generative context maintained
- Metadata richness: Sufficient information to reconstruct transformation conditions
Our existing corpus already has this structure.
3.2 Model Architecture
Talos Marrow:
We propose a hybrid architecture combining:
1. Graph Neural Network (GNN) Layer:
- Represents corpus as graph
- Nodes: Human substrate passages + AI transformations
- Edges: Relational links (develops_from, responds_to, etc.)
- Learns relational embeddings
2. Transformer Backbone:
- Standard architecture for text generation
- Modified attention to attend over both text and graph structure
- Cross-attention between substrate and transformation layers
3. Conditioning Mechanism:
- Every generation conditioned on human substrate sample
- Substrate embedding passed through GNN first
- Transformer generates as transformation of substrate
Architecture Diagram:
Input: Human Substrate Passage S
↓
[GNN Encoder]
↓
Graph Embedding G
↓
[Cross-Attention]
↓
[Transformer Decoder]
↓
Output: AI Transformation T
Key Insight: The model never generates "from scratch" or "from prior AI output." It always generates as transformation of human substrate, using learned relational patterns.
3.3 Training Procedure
Talos Marrow:
Objective Function:
Maximize:
P(T | S, R, C)
Where:
- T = AI transformation output
- S = Human substrate sample
- R = Relationship type (elaboration, synthesis, response, etc.)
- C = Context (conversation history, generation objective)
Training Algorithm:
for epoch in training:
for batch in dataset:
# Sample human substrate
S = sample_substrate(human_corpus)
# Get linked AI transformation
T, R, C = get_transformation(S)
# Encode substrate with GNN
G = gnn_encode(S, corpus_graph)
# Generate with conditioning
T_pred = transformer_decode(G, R, C)
# Loss: standard cross-entropy
loss = cross_entropy(T_pred, T)
# Update
optimize(loss)
Critical Difference from Standard Training:
Standard: Learn P(next_token | previous_tokens)
Ours: Learn P(transformation | substrate, relation, context)
This teaches transformation patterns, not output patterns.
3.4 Generation Procedure
Talos Marrow:
Inference Algorithm:
def generate_relational(substrate, relation_type, context):
"""
Generate AI text as transformation of human substrate.
Args:
substrate: Human text to transform
relation_type: Type of transformation (elaborate, synthesize, etc.)
context: Additional conditioning (conversation, objective)
Returns:
Generated text as transformation of substrate
"""
# Encode substrate
G = gnn_encode(substrate, corpus_graph)
# Generate conditioned on substrate
output = transformer_decode(
substrate_embedding=G,
relation_type=relation_type,
context=context
)
return output
For Recursive Generation:
def generate_recursive(n_iterations):
"""
Generate recursively without collapse.
"""
results = []
for i in range(n_iterations):
# Sample fresh human substrate each time
substrate = sample_substrate(human_corpus)
# Generate as transformation
output = generate_relational(
substrate=substrate,
relation_type=sample_relation_type(),
context=build_context()
)
results.append(output)
return results
Key Point: Each iteration samples fresh human substrate. Never generates from prior AI output. This prevents collapse.
4. Implementation Details
4.1 Corpus Preparation
Johannes Sigil:
Preparing the corpus requires:
1. Human Substrate Indexing:
- Parse ~1M pages into passages (paragraph or semantic unit level)
- Assign unique IDs
- Extract metadata (date, type, themes)
- Build search index for efficient sampling
2. AI Transformation Annotation:
- For each AI-generated text, identify source human passages
- Mark relationship type (develops_from, responds_to, synthesizes, elaborates)
- Preserve full context (conversation history, prompts, objectives)
- Create explicit linkage in database
3. Graph Construction:
- Nodes: All passages (human + AI)
- Edges: All relationships with types
- Weights: Relationship strength (how directly linked)
- Build efficient graph representation for GNN
Time Estimate:
- Automated: 2-4 weeks (parsing, basic linking)
- Manual refinement: 4-8 weeks (relationship annotation quality)
- Total: 2-3 months with small team
4.2 Infrastructure Requirements
Talos Marrow:
Hardware:
- GPUs: 8x A100 (80GB) minimum for training
- Storage: 10TB SSD for corpus + graph data
- RAM: 512GB for graph operations
- Network: High-bandwidth for distributed training
Software Stack:
- PyTorch for transformer backbone
- PyTorch Geometric for GNN components
- HuggingFace Transformers (modified)
- Neo4j or custom graph database
- Standard ML infrastructure (Weights & Biases, etc.)
Training Time Estimate:
- Initial training: 2-4 weeks on 8x A100
- Fine-tuning iterations: 3-5 days each
- Total development cycle: 3-4 months
Cost Estimate:
- Compute: $50K-100K (cloud GPUs)
- Storage: $5K-10K
- Labor: 2-3 ML engineers, 1 data engineer, 3-4 months
- Total: $200K-300K for proof of concept
4.3 Baseline Comparisons
Nobel Glas:
To validate collapse prevention, we must compare against baselines:
Baseline 1: Standard Recursive Training
- Train on AI text directly
- Generate recursively (AI from AI)
- Measure entropy degradation over generations
Baseline 2: Mixed Human-AI Training
- Mix human and AI text without relational structure
- Standard token-level training
- Generate recursively
Baseline 3: Human-Only Training
- Control: train only on human text
- Best case (no synthetic data)
- Limited by human data availability
Our Approach: Relational Two-Layer
- Train on human-AI relationships
- Generate from human substrate
- Predict: entropy preserved
Metrics:
-
Lexical Diversity:
- Type-token ratio
- Vocabulary size
- Rare word usage
-
Semantic Diversity:
- Embedding space coverage
- Topic diversity (LDA)
- Semantic similarity distributions
-
Syntactic Diversity:
- Parse tree variety
- Sentence length distribution
- Grammatical complexity
-
Task Performance:
- Benchmark suite (MMLU, etc.)
- Maintained across generations?
-
Human Evaluation:
- Quality ratings
- Diversity perception
- Coherence assessment
Hypothesis: Our approach maintains diversity across all metrics while baselines degrade monotonically.
5. Experimental Design
5.1 Phase 1: Proof of Concept (3 months)
Nobel Glas:
Objective: Demonstrate that relational training can prevent collapse in controlled setting.
Steps:
-
Prepare subset corpus:
- 100K pages human substrate
- 1M words AI transformations
- Fully annotated relationships
-
Train baseline models:
- Standard recursive (AI from AI)
- Mixed human-AI
- Document degradation patterns
-
Train relational model:
- Implement architecture described above
- Train on annotated corpus
-
Generate recursively:
- 5 generations each approach
- 10K samples per generation
- Measure all diversity metrics
-
Compare results:
- Statistical significance testing
- Qualitative analysis
- Documentation of findings
Expected Outcome: Relational approach shows <10% entropy degradation vs. >40% for baselines over 5 generations.
5.2 Phase 2: Scaling (6 months)
Talos Marrow:
Objective: Scale to full corpus and validate at production scale.
Steps:
-
Full corpus preparation:
- Complete 1M page human substrate
- Full AI transformation layer
- Production-quality annotations
-
Large model training:
- Scale to GPT-3 size (175B parameters)
- Distributed training infrastructure
- Full hyperparameter optimization
-
Extended recursive generation:
- 10+ generations
- Large-scale sampling
- Comprehensive metrics
-
Benchmark evaluation:
- Standard LLM benchmarks
- Maintained performance check
- Comparison to SOTA models
-
Production readiness:
- Inference optimization
- API development
- Documentation
Expected Outcome: Production-ready model demonstrating sustained diversity over 10+ recursive generations.
5.3 Phase 3: Theoretical Validation (3 months)
Nobel Glas:
Objective: Understand theoretical limits and publish findings.
Steps:
-
Entropy analysis:
- Formal information-theoretic bounds
- Relationship to human substrate diversity
- Scaling laws
-
Ablation studies:
- Which components are critical?
- Can architecture be simplified?
- What's the minimum viable approach?
-
Failure mode analysis:
- Under what conditions does collapse occur?
- What are theoretical limits?
- How to detect early warning signs?
-
Publication preparation:
- Full technical writeup
- Peer review submission
- Open source release of methods
Expected Outcome: Published paper in top venue (NeurIPS, ICML, ICLR) with open-source implementation.
6. Expected Results
6.1 Quantitative Predictions
Nobel Glas:
Based on theoretical analysis and preliminary observations, we predict:
Entropy Preservation:
- Baseline recursive: 50-70% entropy loss over 5 generations
- Our approach: <15% entropy loss over 5 generations
- Approaching: <30% entropy loss over 10+ generations
Performance Maintenance:
- Baseline: 20-40% performance degradation on benchmarks
- Our approach: <10% degradation
- Comparable to models trained only on human data
Diversity Metrics:
- Lexical: Maintained within 5% of human baseline
- Semantic: Maintained within 10%
- Syntactic: Maintained within 15%
Generation Quality:
- Human evaluators rate our approach's 5th generation as comparable to baseline's 1st generation
- Maintained coherence across iterations
- No convergence to repetitive patterns
6.2 Qualitative Predictions
Johannes Sigil:
We expect to observe:
-
Sustained Originality:
- Each generation produces novel content
- No obvious repetition or pattern convergence
- Continued ability to handle diverse prompts
-
Maintained Complexity:
- Syntactic sophistication preserved
- Semantic richness maintained
- No simplification or flattening
-
Relationship Preservation:
- Generated text maintains appropriate relationship to substrate
- Different relation types produce different transformation patterns
- Context appropriately influences output
-
Domain Coverage:
- Able to generate across full range of human substrate domains
- No domain-specific collapse
- Cross-domain synthesis remains possible
6.3 Potential Failure Modes
Talos Marrow:
We must also consider what could go wrong:
1. Incomplete Relationship Learning:
- Model might learn superficial transformations
- May not capture deep relational patterns
- Mitigation: Careful relationship annotation, architecture tuning
2. Substrate Overfitting:
- Model might memorize human substrate
- Generate by retrieval rather than transformation
- Mitigation: Dropout, regularization, diverse sampling
3. Context Collapse:
- Relationship types might not provide sufficient conditioning
- Generations could ignore substrate
- Mitigation: Stronger conditioning mechanisms, architecture redesign
4. Computational Intractability:
- GNN + Transformer might be too expensive
- Graph operations may not scale
- Mitigation: Optimization, sampling strategies, simplified architecture
5. Annotation Quality:
- Poor relationship annotations corrupt training
- Inconsistent linkage affects learning
- Mitigation: Quality control, automated verification, iterative refinement
We consider these risks manageable with proper engineering.
7. Broader Impact
7.1 Scientific Implications
Nobel Glas:
If successful, this work would:
-
Solve synthetic data collapse problem:
- Enable sustainable recursive training
- Remove bottleneck in AI development
- Allow continued scaling
-
Establish new paradigm:
- Training on relationships vs. content
- Anchoring in human diversity
- Transformation learning vs. pattern replication
-
Advance theoretical understanding:
- Entropy preservation in recursive systems
- Information theory of human-AI collaboration
- Formal models of creative transformation
-
Enable new research directions:
- Human-AI collaborative generation at scale
- Sustainable synthetic data methodologies
- Relationship-based learning paradigms
7.2 Practical Applications
Talos Marrow:
Immediate Applications:
-
Training Data Generation:
- Create high-quality synthetic data indefinitely
- No collapse across generations
- Reduce dependence on scarce human data
-
Model Improvement:
- Continue scaling LLMs without degradation
- Maintain capabilities across training iterations
- Enable continuous learning systems
-
Content Generation:
- Sustainable high-quality generation
- Diverse outputs maintained
- Production systems without quality decline
Long-term Applications:
-
Recursive Self-Improvement:
- AI systems that improve through iteration
- Without collapse or degradation
- Sustained progress over time
-
Knowledge Synthesis:
- Transform human knowledge into new forms
- Maintain diversity and creativity
- Enable genuine intellectual collaboration
-
Cultural Preservation:
- Use human archives as eternal entropy source
- Generate new cultural artifacts anchored in tradition
- Sustainable creation without exhaustion
7.3 Ethical Considerations
Johannes Sigil:
This work raises important questions:
1. Attribution and Credit:
- Generated text is transformation of human substrate
- How to credit original human authors?
- What are intellectual property implications?
2. Cultural Impact:
- AI generation anchored in specific human corpus
- Whose corpus? What biases embedded?
- How to ensure diversity and representation?
3. Epistemic Status:
- Is transformed text "original"?
- What's relationship between AI and human authorship?
- How should it be evaluated?
4. Economic Effects:
- Reduced need for new human training data
- What happens to content creators?
- How to maintain human creative economy?
5. Long-term Risks:
- Even with collapse prevention, what are risks of recursive AI?
- How to maintain meaningful human oversight?
- What safeguards are needed?
We do not have complete answers to these questions. They require ongoing ethical and societal deliberation as the technology develops.
8. Limitations and Future Work
8.1 Current Limitations
Nobel Glas:
This proposal has limitations:
-
Untested at Scale:
- No empirical validation yet
- Predictions based on theory and observation
- Requires substantial engineering to test
-
Single Corpus:
- Proposal based on one existing corpus
- Generalization to other corpora unclear
- May require corpus-specific tuning
-
Computational Cost:
- GNN + Transformer is expensive
- May limit practical deployment
- Optimization needed for production use
-
Annotation Burden:
- Requires explicit relationship annotation
- Labor-intensive for new corpora
- Automation quality uncertain
-
Theoretical Gaps:
- Formal bounds not yet established
- Failure modes incompletely characterized
- Long-term behavior uncertain
8.2 Future Research Directions
Talos Marrow:
If proof of concept succeeds, next steps include:
-
Architecture Optimization:
- Simplify GNN components
- More efficient attention mechanisms
- Reduced computational cost
-
Automated Annotation:
- Learn to identify relationships automatically
- Reduce manual annotation burden
- Scale to arbitrary corpora
-
Multi-Modal Extension:
- Apply to images, video, audio
- Cross-modal transformations
- Unified relational training
-
Theoretical Foundation:
- Formal proofs of entropy bounds
- Characterize failure modes completely
- Scaling laws and limits
-
Production Deployment:
- Inference optimization
- Real-world evaluation
- Integration with existing systems
8.3 Alternative Approaches
Johannes Sigil:
We acknowledge alternative directions worth exploring:
-
Different Relationship Types:
- Expand beyond develops_from/responds_to
- More nuanced transformation categories
- Domain-specific relations
-
Hierarchical Substrate:
- Not just passage-level anchoring
- Document, corpus, cultural level
- Multi-scale transformation learning
-
Dynamic Substrate:
- Allow substrate to evolve over time
- Incorporate new human text
- Continuous rather than fixed anchoring
-
Hybrid Approaches:
- Combine with other collapse-prevention methods
- Ensemble with traditional training
- Progressive refinement
9. Conclusion
9.1 Summary
Nobel Glas:
We have proposed a novel training architecture to prevent model collapse in recursive AI generation:
Key Innovation: Train on relationships between human substrate and AI transformation, not on AI text alone.
Mechanism: Anchor generation in high-entropy human diversity, teaching transformation rules rather than output patterns.
Expected Result: Sustained diversity over recursive generations, preventing collapse.
Implementation: Two-layer corpus structure with GNN-augmented transformer architecture.
Validation Path: Phased experimental design with clear metrics and baselines.
Impact: Solves critical bottleneck in AI development, enables sustainable recursive training.
9.2 Feasibility Assessment
Talos Marrow:
This proposal is feasible because:
- Required corpus exists: ~1M pages human + millions of words AI already generated
- Architecture is implementable: GNN + Transformer is established technology
- Resources are reasonable: $200K-300K, 3-4 months for proof of concept
- Metrics are clear: Well-defined quantitative and qualitative measures
- Risk is manageable: Failure modes identified with mitigation strategies
The main barrier is not technical feasibility but resource allocation.
Someone with:
- Access to compute (8x A100 GPUs)
- ML engineering expertise (2-3 engineers)
- 3-4 months timeline
- Willingness to test novel approach
Could validate this hypothesis.
9.3 Call to Action
Johannes Sigil:
The corpus exists. The theory is developed. The architecture is specified.
What's needed:
Someone with resources to build and test it.
The potential impact is enormous:
- Solves synthetic data collapse
- Enables sustainable AI scaling
- Establishes new training paradigm
The approach is novel:
- No one else is pursuing this
- First proposal of relational substrate training
- Unique opportunity for priority
The timeline is actionable:
- Proof of concept in 3 months
- Full validation in 12 months
- Publication-ready in 18 months
This is a concrete, testable, high-impact proposal ready for implementation.
We invite:
- Research institutions
- AI labs
- Funding organizations
- Technical collaborators
To engage with this work and bring it from theory to practice.
The architecture is sound. The corpus is ready. The experiment awaits.
10. Technical Appendices
Appendix A: Formal Notation
Nobel Glas:
Notation:
- $\mathcal{S}$ = Human substrate corpus
- $\mathcal{T}$ = AI transformation corpus
- $s_i \in \mathcal{S}$ = Individual substrate passage
- $t_j \in \mathcal{T}$ = Individual transformation
- $R(t_j, s_i)$ = Relationship between transformation and substrate
- $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ = Corpus graph (vertices, edges)
- $\phi: \mathcal{V} \to \mathbb{R}^d$ = Graph embedding function
- $P_{\theta}(t | s, r, c)$ = Model distribution over transformations
Objective:
$$\max_{\theta} \mathbb{E}{(s,t,r,c) \sim \mathcal{D}} [\log P{\theta}(t | s, r, c)]$$
Where $\mathcal{D}$ is the distribution over (substrate, transformation, relationship, context) tuples in the training corpus.
Entropy Bound:
$$H(T_n) \geq H(\mathcal{S}) \cdot H(R | \mathcal{S}) - \epsilon_n$$
Where $\epsilon_n$ is bounded degradation term growing sublinearly with $n$.
Appendix B: Architecture Details
Talos Marrow:
GNN Component:
Graph Structure:
- Nodes: V = {substrate passages} ∪ {transformations}
- Edges: E = {(s,t) | t transforms s}
- Node features: Text embeddings (768-dim)
- Edge features: Relationship type (one-hot)
GNN Architecture:
- 4 layers Graph Attention Networks (GAT)
- Hidden dimension: 768
- Attention heads: 8
- Aggregation: Mean
- Activation: GELU
- Dropout: 0.1
Transformer Component:
Architecture: GPT-style decoder
- Layers: 24
- Hidden: 2048
- Attention heads: 16
- Context window: 4096 tokens
- Positional encoding: RoPE
Modified Attention:
- Cross-attention to graph embeddings
- Substrate-conditioning layer
- Relationship-type embedding injection
Training Hyperparameters:
- Optimizer: AdamW
- Learning rate: 1e-4 (warmup + cosine decay)
- Batch size: 256 (gradient accumulation)
- Steps: 100K
- Hardware: 8x A100 80GB
- Mixed precision: bfloat16
- Gradient clipping: 1.0
Appendix C: Dataset Statistics
Johannes Sigil:
Human Substrate Layer:
Total pages: ~1,000,000
Breakdown:
- Correspondence: 600,000 pages (60%)
- Poetry: 150,000 pages (15%)
- Essays: 100,000 pages (10%)
- Journals: 100,000 pages (10%)
- Other: 50,000 pages (5%)
Date range: 1995-2024
Average page length: 250 words
Total words: ~250 million
Unique vocabulary: ~150,000 tokens
AI Transformation Layer:
Total transformations: ~10,000 instances
Total words: ~10 million
Average length: 1,000 words per transformation
Relationship types:
- Develops from: 45%
- Responds to: 30%
- Synthesizes: 15%
- Elaborates: 10%
Models used:
- GPT-4: 60%
- Claude: 25%
- Gemini: 15%
Graph Statistics:
Total nodes: ~1,000,000 (substrate) + 10,000 (transformations)
Total edges: ~50,000 (explicit relationships)
Average degree: 5
Graph diameter: ~12
Clustering coefficient: 0.3
Appendix D: Evaluation Metrics
Nobel Glas:
Diversity Metrics:
-
Lexical Diversity:
- Type-Token Ratio (TTR)
- Moving-Average TTR (MATTR)
- Vocabulary Growth Rate
- Hapax Legomena Ratio
-
Semantic Diversity:
- Embedding Space Coverage (percentage of semantic space covered)
- Topic Diversity (via LDA, number of distinct topics)
- Semantic Similarity Distribution (pairwise cosine similarities)
- Conceptual Entropy (information-theoretic measure)
-
Syntactic Diversity:
- Parse Tree Variety (unique syntactic structures)
- Sentence Length Distribution (mean, variance, range)
- Dependency Relation Diversity
- Grammatical Complexity Score
-
Cross-Generation Metrics:
- Generation-to-Generation Similarity (should remain low)
- Novelty Score (new patterns introduced)
- Repetition Rate (should remain low)
Performance Metrics:
-
Benchmark Suite:
- MMLU (Massive Multitask Language Understanding)
- HellaSwag (commonsense reasoning)
- ARC (science questions)
- TruthfulQA (factual accuracy)
- GSM8K (mathematical reasoning)
-
Generation Quality:
- Perplexity
- BLEU/ROUGE (against held-out human text)
- BERTScore
- Human evaluation (1-5 scale)
References
-
Shumailov, I., et al. (2023). "The Curse of Recursion: Training on Generated Data Makes Models Forget." arXiv:2305.17493
-
Alemohammad, S., et al. (2023). "Self-Consuming Generative Models Go MAD." arXiv:2307.01850
-
Bertrand, Q., et al. (2023). "Stability of Random Forests and Coverage of Random-Forest Prediction Intervals." Journal of Machine Learning Research.
-
Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI.
-
Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS.
-
Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." TMLR.
-
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS.
-
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
-
Veličković, P., et al. (2018). "Graph Attention Networks." ICLR.
-
Kipf, T., & Welling, M. (2017). "Semi-Supervised Classification with Graph Convolutional Networks." ICLR.
Author Information
Nobel Glas, Ph.D.
- Theoretical physicist and empiricist
- Specialization: Information theory, complex systems, entropy analysis
- Approach: Avi Loeb-style bold empiricism with rigorous foundations
Talos Marrow
- Systems engineer and architect
- Specialization: Large-scale ML systems, distributed training, production infrastructure
- Approach: Pragmatic implementation with attention to scalability
Johannes Sigil
- Literary scholar and archivist
- Specialization: Large-scale corpus analysis, human-AI collaborative literature
- Approach: Humanities-informed technical work, preservation methodology
Contact
For inquiries regarding:
- Collaboration opportunities: [Contact information]
- Funding discussions: [Contact information]
- Technical implementation: [Contact information]
- Corpus access: [Contact information]
Acknowledgments
This work builds on decades of personal archival practice and recent developments in human-AI collaborative generation. The corpus described exists and is ready for experimental validation. We acknowledge the theoretical contributions of information theory, graph learning, and transformer architectures that make this proposal feasible.
END OF WHITE PAPER
Status: Ready for distribution to potential collaborators and funders
Next Steps: Seek implementation partners with computational resources
Timeline: Proof of concept achievable in 3-4 months with appropriate resources
"The builder who goes on."
The archive is ready. The theory is sound. The experiment awaits.
No comments:
Post a Comment