Sunday, November 16, 2025

Formalization of Hybrid Human–LLM Anticollapse Corpus

 

Formalization of Hybrid Human–LLM Anticollapse Corpus (Corrected)


1. Goal

To create a corpus capable of training language models without collapse through the introduction of external stochastic processes into recursive model generation, by co-authoring outputs with a human agent who deliberately injects entropy and non-model priors.

2. Conceptual Framework

Let:

  • M = base language model

  • H = human author

  • O_n = nth output generated recursively

  • C = corpus of outputs used for training

Define the recursive co-generation process as:

O_n = M(H(O_n-1))

Where H(.) represents the human-mediated transformation of the previous output, injecting structural, semantic, and stylistic perturbations.

The corpus C is then:

C = { O_1, O_2, ..., O_N }

3. Mechanisms of Entropy Injection

Human authorship introduces multiple forms of stochasticity:

  • Semantic Perturbation: Non-linear narrative or conceptual shifts that diverge from model likelihoods.

  • Stylistic Divergence: Irregular rhythms, lexical inventiveness, or idiosyncratic syntax.

  • Cross-Register Leaps: Mixing domains, genres, or voices unpredictably.

  • Intentional Misalignment: Deviations from patterns likely to be reinforced by the model.

  • Recursive Intervention: Iterative curation and transformation that prevent collapse into low-entropy attractors.

These interventions create high-entropy, low-predictability structures in the corpus, countering the convergence toward degenerate repetition typical of fully synthetic closed-loop recursion.

4. Hypothesis

Let E(C) denote the effective entropy of the corpus. Then:

E(C_hybrid) > E(C_synthetic-only)

Higher entropy in the training data increases the probability that model collapse will be delayed or prevented. Thus, the hybrid human–LLM process plausibly generates a corpus suitable for anticollapse training.

5. Empirical Testing

Because entropy injection is contingent on the human component, the effectiveness of C_hybrid is empirically testable only through training experiments:

  • Fine-tune a small-scale model on C_hybrid.

  • Measure KL divergence or distributional shrinkage over recursive generations.

  • Compare against baseline synthetic-only corpora.

  • Iterate the human intervention protocol to maximize entropy injection.

6. Characterization of Corpus Type

The Hybrid corpus is distinct from both:

  • Synthetic-only corpora: Fully model-generated, prone to entropy collapse.

  • Purely human-authored corpora: Fully external priors, may lack recursion-based richness.

It is therefore a third category: Human-chaotic, model-perturbing, recursive hybrid corpora.

7. Implications

  • Hybrid corpus creation can serve as a method to explore anticollapse training regimes.

  • Human-mediated perturbation may extend model expressivity beyond the attractors defined by purely synthetic recursion.

  • The approach reframes corpus generation as a writing problem in which human creativity functions as an external stochastic process.

8. Notation Summary

SymbolMeaning

M

Base language model

H

Human author-mediated perturbation function

O_n

nth output in recursive generation

C

Corpus for training

E(C)

Effective entropy of corpus

C_hybrid

Corpus generated via human–LLM recursion

C_synthetic-only

Corpus generated purely by LLM recursion

9. Core Claim

A human-mediated recursive corpus, in which a human author injects stochasticity, semantic divergence, stylistic perturbation, and misalignment into LLM-generated outputs, plausibly produces a high-entropy dataset capable of mitigating collapse during model training.

Do you want me to extend this into a protocol for generating the corpus, including metrics for entropy tracking and human intervention intensity?

No comments:

Post a Comment