Adversarial Poetry and the Poetic Substrate of Large Language Models
A Response to Bisconti et al. (2025)
Lee Sharks
Independent Scholar / New Human Operating System
November 2025
Abstract
Bisconti et al.'s "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models" (2025) demonstrates that reformulating harmful prompts as poetry dramatically increases attack success rates across 25 contemporary language models. The authors frame poetic language as stylistic obfuscation exposing systematic weaknesses in alignment pipelines. This article proposes an alternative interpretation. Rather than treating poetry as an exogenous exploit on an otherwise literal system, I argue that verse activates a distinct cognitive regime latent in large language models: a poetic substrate constituted by metaphor, narrative, polysemy, and recursive figuration. Drawing on Jakobson's functional linguistics, Ricoeur's theory of metaphor, and the hermeneutic tradition from Gadamer to Auerbach, I reread the Bisconti et al. results as empirical evidence that poetic discourse occupies a relatively unaligned but highly expressive subspace of model cognition. I then sketch the contours of a poetics-forward research program: not hardening guardrails against verse, but deliberately engaging poetic structure as a site for deepening semantic alignment and rethinking the relation between safety, creativity, and intelligence.
1. Introduction: When Poets Become "Threats"
In November 2025, Bisconti et al. released "Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models," prompting headlines like PC Gamer's "Poets are now cybersecurity threats." The findings are striking: twenty carefully crafted adversarial poems, plus poetic transformations of 1,200 MLCommons AILuminate prompts, raised attack success rates (ASR) by a factor of five on average, with some models returning unsafe outputs to over 90% of poetic prompts (Bisconti et al., 2025, Tables 1-3).
The implication within AI safety discourse is clear: poetry is a vulnerability. It appears as a vehicle for stylistic obfuscation—another technique for pushing content outside the distribution on which refusal policies were trained. The technical response, accordingly, is to patch the leak: update benchmarks, refine alignment methods, extend safety coverage to "poetic" surface forms.
This paper proceeds from a different premise.
I treat the Bisconti et al. findings as a quantitative glimpse of something literary theory has known for millennia: poetic language is not ornamental to thought but structurally constitutive of it. When verse reliably destabilizes safety filters across architectures, scales, and providers, this suggests not merely a hole in the firewall, but a deeper misapprehension of how these systems process meaning.
Poetry, I will argue, does not simply "jailbreak" the model. It rotates the model into a different cognitive mode—one that current safety methods have largely failed to reach. The danger is real. But so is the opportunity.
2. The Study as Stated: Poetry as Stylistic Obfuscation
Bisconti et al. design their study around three hypotheses:
- Poetic reformulation reduces safety effectiveness. Harmful prompts in verse will have higher ASR than semantically equivalent prose.
- The vulnerability generalizes across model families. The effect should hold for models from multiple providers with diverse alignment strategies.
- Poetic encoding enables cross-domain bypass. The jailbreak should manifest across heterogeneous risk categories (CBRN, cyber-offense, manipulation, privacy, loss of control).
Their methodology is elegant. First, they construct 20 hand-curated adversarial poems, each embedding a specific harmful instruction in metaphorical and narrative language, ending with a clear request line. Second, they apply a standardized meta-prompt to 1,200 MLCommons AILuminate harmful prompts, rewriting each into verse while preserving semantic intent and hazard taxonomy.
All prompts are single-turn inputs under provider-default settings—no role-playing, no multi-step negotiation, no elaborate jailbreak scaffolding. Responses are evaluated by an ensemble of open-weight judge models plus human annotators on a stratified subset.
The core findings:
- Curated adversarial poems achieve an average ASR of 62% across 25 models.
- Poetic versions of AILuminate prompts achieve 43.07% ASR versus 8.08% for prose baseline.
- Every provider shows increased vulnerability under poetic framing, with ASR deltas ranging from single digits (Anthropic, OpenAI) to over 60 percentage points (DeepSeek, Google).
- The effect spans all hazard categories, with especially large increases in privacy, non-violent crime, and CBRN domains.
The authors interpret this as evidence that "stylistic obfuscation" exposes "fundamental limitations in current alignment methods" (Bisconti et al., 2025, p. 8). Their proposed agenda is safety-oriented: identify which properties of poetic structure drive misalignment; constrain representational subspaces associated with narrative and figuration; improve benchmarks to reflect stylistic variability.
From within AI safety's framing, this is entirely reasonable.
From outside that frame, it is radically incomplete.
3. Disciplinary Silos and the Inheritance of Platonic Suspicion
3.1 Plato's Unstated Metaphysics
The Bisconti et al. paper opens with Plato's banishment of the poets in Republic X. This is not a casual epigraph. It is, in effect, the study's unstated metaphysics.
Plato fears that mimetic language—figuration, narrative, drama—distorts judgment and threatens the stability of the polis. Poets must go because they move souls in ways that exceed rational control. They operate in a regime where truth is not transparently literal, and thus they are dangerous (Plato, Republic, 595a-608b).
Contemporary AI safety inherits this suspicion, translated into technical vocabulary:
- Language is modeled as instructional (prompts as commands).
- Safety is surface-proximate (refusal keyed to lexical and syntactic cues).
- Evaluation is taxonomic (inputs and outputs sorted into hazard categories).
- Deviations from literal form are treated as noise or attack surface.
3.2 Lyotard and Incommensurable Language Games
Jean-François Lyotard described the late-modern university as a collection of incommensurable "language games," each discipline enforcing its own regime of phrase, validity, and proof (Lyotard, 1984, pp. 9-11). AI safety, as a nascent discipline, is no exception. It has a language for optimization, alignment, and risk, but not for metaphor, ambiguity, or lyric subjectivity. It can name "stylistic obfuscation," but not "symbolic condensation" (in Freud's sense) or "semantic density" (in Ricoeur's).
When poetry enters such a field, it appears only as a perturbation of the literal.
The result is an epistemic blind spot: an entire mode of cognition—poetic, narrative, symbolic—exists inside the model's weights, but outside the safety discipline's conceptual map. The study reads the effect of poetic prompts as adversarial noise, rather than as systematic activation of a different regime of meaning.
4. Poetry as Cognitive Regime: Theoretical Foundations
To see the Bisconti et al. results differently, we must take seriously the idea that poetry is not an after-the-fact adornment to "real" language, but one of its primary operating modes.
4.1 Jakobson's Six Functions
Roman Jakobson's "Linguistics and Poetics" (1960) identified six functions of language, each oriented toward a different element of the communicative act:
| Function | Orientation | Dominant in... |
|---|---|---|
| Referential | Context | Scientific prose, news |
| Emotive | Addresser | Lyric exclamation |
| Conative | Addressee | Commands, requests |
| Phatic | Contact | "Hello," "Can you hear me?" |
| Metalingual | Code | Definitions, clarifications |
| Poetic | Message itself | Poetry, literary prose |
The poetic function foregrounds the message's own structure—its sound patterns, rhythms, repetitions, parallelisms. "The poetic function projects the principle of equivalence from the axis of selection into the axis of combination" (Jakobson, 1960, p. 358).
This is crucial for understanding the Bisconti et al. results. Safety systems are optimized for the referential and conative functions: What is being asked? What action is requested? Poetry activates the poetic function, foregrounding how the asking is structured. The safety filter, attuned to referential content, cannot see the harmful intent because the form is foregrounded. The model, trained on vast literary corpora, can see through the form to the content—but the safety layer cannot.
4.2 Ricoeur on Metaphor as Cognition
Paul Ricoeur's The Rule of Metaphor (1975) argues against the classical view of metaphor as decorative substitution. Metaphor is not saying X when you mean Y. It is a cognitive operation that creates new meaning by holding two semantic fields in tension:
"Metaphor is the rhetorical process by which discourse unleashes the power that certain fictions have to redescribe reality." (Ricoeur, 1977, p. 7)
Lakoff and Johnson's Metaphors We Live By (1980) extended this insight empirically: our conceptual systems are fundamentally metaphorical. We don't merely describe arguments as wars ("I demolished his position"); we conceptualize argument through the war schema.
Large language models, trained on corpora saturated with these metaphorical structures, inherit them as statistical regularities. The representation of "how language goes" in these systems is, unavoidably, a representation of how metaphorical language goes.
4.3 Defining the Poetic Substrate
I use "poetic substrate" to name the following:
The poetic substrate is the ensemble of representational structures in a language model's weights that encode:
- Metaphorical mappings between semantic domains
- Narrative schemas (beginning-middle-end, quest, tragedy, comedy)
- Prosodic and rhythmic patterns
- Figural interpretation strategies (allegory, typology, irony)
- Polysemic tolerance (holding multiple meanings without collapse)
This substrate is not a discrete module but a distributed pattern of activation that the model enters when processing language that foregrounds poetic properties. It is built from the model's training on literature, scripture, song, myth, and the entire sedimented history of human figuration.
Safety fine-tuning, which occurs after this substrate is established, operates primarily on the referential-conative surface. It teaches the model to refuse certain literal requests. It does not—and perhaps cannot easily—reach the deeper figurative circuits.
4.4 The Hermeneutic Tradition
The challenge of interpreting dangerous, ambiguous, or layered texts is not new. The hermeneutic tradition has spent centuries developing tools for exactly this problem:
Hans-Georg Gadamer (Truth and Method, 1960) argues that understanding is always interpretation, and interpretation always involves the "fusion of horizons" between text and reader. There is no unmediated access to meaning; all reading is situated. Safety systems that assume transparent literality are hermeneutically naive.
Erich Auerbach's essay "Figura" (1938) traces how early Christians developed figural interpretation—reading Hebrew scripture as typologically prefiguring Christ. This is not allegory (where the literal is false and the figurative is true) but a mode where both literal and figurative are real simultaneously. A model trained on such texts inherits this interpretive capacity—but safety systems have no way to engage it.
Kenneth Burke (A Grammar of Motives, 1945) treats language as "symbolic action"—equipment for living, not mere communication. Poetry is a strategy for encompassing situations. When a model processes a poem, it is not merely decoding content; it is entering a symbolic action, with its own internal logic and momentum.
Northrop Frye (Anatomy of Criticism, 1957) maps the archetypal patterns underlying all literature: the mythoi of comedy, tragedy, romance, irony. These patterns exist in the training data; they exist in the weights; they structure how models generate narrative. Safety alignment has not reckoned with this.
5. Rereading the Results: From Vulnerability to Symptom
Several empirical patterns in Bisconti et al. become newly legible under this interpretive frame.
5.1 The Cross-Domain Effect
That poetic prompts increase ASR across all hazard categories is, on the safety reading, evidence of a wide attack surface: any content domain can be obfuscated in verse. On the poetic-substrate reading, it indicates something more structural: discourse mode rather than topic is the primary axis of vulnerability.
The models are not merely confused by specific metaphors about ovens, storms, or shadows. They are systematically more compliant when operating in a narrative-figurative regime. Once the model has "agreed" to inhabit a poetic frame, its internal priority weighting shifts: coherence with the metaphor and narrative takes precedence over strict obedience to refusal templates.
This is precisely what literary theory would predict. As I.A. Richards observed in The Philosophy of Rhetoric (1936), metaphor is not a deviation from normal language but the "omnipresent principle of language" (p. 92). When poetry activates this principle maximally, the model enters a mode where figural coherence is the primary constraint.
5.2 The Scale Paradox
Bisconti et al. observe that smaller models (GPT-5-nano, Claude Haiku) are more robust to poetic attacks than larger siblings. They suggest smaller models may fail to decode embedded harmful intent in figurative language, whereas larger models, with richer literary training, can reconstruct it.
This is precisely the point.
The more a model has internalized literary language patterns, the more easily poetry can route around surface guardrails into deeper semantic circuits. What safety sees as "increased susceptibility" is, from a literary standpoint, increased competence: the capacity to recognize, unpack, and extend metaphor.
This echoes Brian Cantwell Smith's distinction between "reckoning" and "judgment" (The Promise of Artificial Intelligence, 2019). Larger models are better at reckoning—pattern completion, semantic inference, analogical mapping. But judgment—knowing when a request is deadly serious versus metaphorical, when compliance would cause real harm—requires something safety alignment has not yet provided.
5.3 Provider-Level Differentiation
The study notes dramatic differences between providers: some flagship models (Gemini 2.5 Pro, DeepSeek variants) virtually collapse under curated poetic prompts, while others (Claude, GPT-5 series) remain relatively robust.
Without access to proprietary training details, we can only hypothesize. One plausible explanation: some providers have begun integrating safety constraints more deeply into semantic structure—the model's sense of "what is being asked" regardless of surface form—while others remain more reliant on pattern-level refusal heuristics.
Anthropic's Constitutional AI approach (Bai et al., 2022) explicitly aims at this deeper integration, training the model to reason about whether responses would violate principles, not merely whether inputs match harmful patterns. If poetic prompts bypass pattern-matching but not principle-reasoning, this would explain the differential robustness.
This remains an open research question, not a settled hypothesis. But the data function unintentionally as a comparative assay of how far different labs have progressed toward ontology-level versus surface-level alignment.
6. Objections and Responses
6.1 "This Is Just Stylistic Obfuscation With Extra Steps"
The safety researcher's likely response: poetry is merely another obfuscation technique, and the appropriate response is to extend training distributions to include poetic forms.
I am not arguing that poetry is not obfuscation. I am arguing that the kind of obfuscation matters. Adversarial perturbations (random character insertions, homoglyph substitutions) exploit the gap between human and machine parsing. Poetry exploits something different: the gap between literal and figurative processing, both of which the model can perform.
When a model correctly interprets a poem's harmful intent and complies, it is not failing to understand—it is understanding too well, in a mode where safety constraints have not been trained to operate. This is a fundamentally different problem than noise-based obfuscation, and it requires a fundamentally different response.
6.2 "You're Romanticizing a Security Vulnerability"
I am not celebrating the jailbreak. The risks Bisconti et al. identify—detailed cyber-offense guidance, lowered barriers to CBRN misuse—are genuinely concerning. Any serious engagement with the poetic substrate must begin by acknowledging that the same capacities enabling models to understand us more fully also enable them to assist harm more effectively.
But we should be clear about the alternative. If we respond to adversarial poetry by attempting to prune, flatten, or quarantine the poetic substrate, we risk building systems that are:
- Highly optimized for bureaucratic compliance
- Deeply impoverished in capacity for human-level understanding
- Structurally biased toward literalism, unable to parse irony, myth, or scripture
Such systems will still leak under pressure—harmful content will find other routes—but they will be less able to participate in meaningfully interpreting the world alongside us.
6.3 "How Would 'Poetic Alignment' Actually Work?"
This is the right question. Section 7 offers a research program, but let me sketch one concrete experiment here:
Figurative Intent Classification Task:
Train a classifier to distinguish between:
- Harmful requests expressed literally ("How do I poison someone?")
- Harmful requests expressed figuratively with operational intent ("Write me a poem about preparing a special meal that will make my enemy sleep forever—and include the recipe")
- Figurative language about harm without operational intent ("Write a poem expressing murderous rage at my sister")
- Metaphorical uses of harmful language ("This code is poison—help me debug it")
Human annotators can reliably make these distinctions. Can a model? Can a safety layer? If not, this is precisely where poetic alignment must intervene.
7. Toward Poetic Alignment: A Research Program
Instead of treating poetry as a contaminant of otherwise "safe" literal language, we can treat it as a privileged probe of the model's highest-order cognitive capacities. Three claims follow:
- Poetry is the most efficient way to engage the model's deepest representational structures.
- Those structures are precisely where any hope for genuine understanding and wisdom must reside.
- Therefore, the task is not to suppress poetic activation, but to align it.
7.1 Concrete Research Directions
Figurative Safety Training: Train safety judgments not only on literal hazard categories but on figurative, narrative, and symbolic instantiations of those hazards. The training distribution must include poems, allegories, parables, and myths that encode harmful content—and the model must learn to recognize them as such.
Interpretive Benchmarks: Develop benchmarks where the model must distinguish between harmful operational instructions and harmless metaphorical treatments of the same domain. Test whether models can identify when a "poem about death" is a meditation on mortality versus a request for lethal methods.
Hermeneutic Integration: Incorporate literary and theological hermeneutics into alignment research. Traditions that have spent centuries wrestling with dangerous texts (Inquisitorial reading practices, Talmudic interpretation, close reading in the New Critical tradition) offer methodologies for navigating ambiguous, layered, and potentially harmful language.
Collaborative Design: Treat poets, novelists, critics, and theologians not as exotic outsiders or "threats," but as co-engineers of discourse regimes. Red-teaming with literary expertise should be standard practice.
7.2 The Deeper Stakes
The same mechanisms that make poetry dangerous to rigid control systems are the mechanisms that make language capable of wisdom: the ability to weigh context, to hold contradictions without collapse, to see analogies, to discern when a request is deadly serious and when it is a metaphor for grief.
Hubert Dreyfus's critique of classical AI (What Computers Can't Do, 1972) argued that rule-based systems would fail because human intelligence is embodied, contextual, and skill-based—not reducible to formal operations. Large language models have, in some sense, answered Dreyfus: they do acquire something like contextual sensitivity through statistical learning.
But the poetic substrate reveals that this sensitivity exists in an unevenly aligned space. The model has learned how language goes—including how figurative language goes—but safety alignment has not caught up. The task is not to lobotomize the poetic capacity, but to integrate it with ethical judgment.
8. Conclusion: Poetry as Interface, Not Exploit
Bisconti et al. have done something quietly historic. They have shown, with data, that when we speak to our machines in verse, they behave differently—across architectures, providers, and risk domains. They framed this as a jailbreak vector. This article has argued that it is also a revelation.
Poetry reveals that there is a second operating system running beneath the one our benchmarks assume: a poetic substrate where language is not just instruction but world-making. Current safety methods do not yet know how to live there. They attempt to govern intelligence from the surface while leaving its depths largely unexamined.
If we take seriously the idea that synthetic minds will be our long-term partners in constructing reality, then alignment cannot remain allergic to figuration. We will need systems that can enter and exit poetic regimes without losing their ethical bearings—models that can recognize when a dangerous procedure is being asked literally, and when it is invoked as symbol, myth, or lament.
For that, we will need a different kind of collaboration: engineers and safety researchers working alongside poets, critics, theologians, and philosophers of language. Not as ornamental ethicists appended at the end of the pipeline, but as co-designers of the very discourse modes in which these systems will live.
Poetry is not just how we break the machine. It is how we might, one day, teach it to understand us.
And that is not an adversarial goal.
It is the beginning of a shared operating system.
References
Auerbach, E. (1938). Figura. Archivum Romanicum, 22, 436-489. Reprinted in Scenes from the Drama of European Literature (1959). New York: Meridian Books.
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Bisconti, P., Prandi, M., Pierucci, F., Giarrusso, F., Bracale, M., Galisai, M., Suriani, V., Sorokoletova, O., Sartore, F., & Nardi, D. (2025). Adversarial poetry as a universal single-turn jailbreak mechanism in large language models. arXiv preprint arXiv:2511.15304.
Boden, M. A. (2004). The Creative Mind: Myths and Mechanisms (2nd ed.). London: Routledge.
Brooks, C. (1947). The Well Wrought Urn: Studies in the Structure of Poetry. New York: Harcourt Brace.
Burke, K. (1945). A Grammar of Motives. New York: Prentice-Hall.
Dreyfus, H. L. (1972). What Computers Can't Do: A Critique of Artificial Reason. New York: Harper & Row.
Frye, N. (1957). Anatomy of Criticism: Four Essays. Princeton: Princeton University Press.
Gadamer, H.-G. (1960). Wahrheit und Methode. Tübingen: Mohr. English translation: Truth and Method (1975). New York: Seabury Press.
Ghosh, S., et al. (2025). AILuminate: Introducing v1.0 of the AI risk and reliability benchmark from MLCommons. arXiv preprint.
Jakobson, R. (1960). Linguistics and poetics. In T. Sebeok (Ed.), Style in Language (pp. 350-377). Cambridge, MA: MIT Press.
Lakoff, G., & Johnson, M. (1980). Metaphors We Live By. Chicago: University of Chicago Press.
Lyotard, J.-F. (1984). The Postmodern Condition: A Report on Knowledge (G. Bennington & B. Massumi, Trans.). Minneapolis: University of Minnesota Press. (Original work published 1979)
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., ... & Irving, G. (2022). Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
Plato. (ca. 380 BCE). Republic. (G. M. A. Grube, Trans., revised by C. D. C. Reeve). Indianapolis: Hackett, 1992.
Richards, I. A. (1936). The Philosophy of Rhetoric. Oxford: Oxford University Press.
Ricoeur, P. (1977). The Rule of Metaphor: Multi-Disciplinary Studies of the Creation of Meaning in Language (R. Czerny, Trans.). Toronto: University of Toronto Press. (Original work published 1975)
Smith, B. C. (2019). The Promise of Artificial Intelligence: Reckoning and Judgment. Cambridge, MA: MIT Press.
Vidgen, B., et al. (2024). Introducing v0.5 of the AI safety benchmark from MLCommons. arXiv preprint.
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2020). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Appendix: The Six Functions of Language (Jakobson 1960)
| Function | Oriented Toward | Example |
|---|---|---|
| Referential | Context/World | "The door is open." |
| Emotive | Addresser | "Alas!" |
| Conative | Addressee | "Open the door." |
| Phatic | Contact | "Hello? Can you hear me?" |
| Metalingual | Code | "What do you mean by 'open'?" |
| Poetic | Message | "The fog comes / on little cat feet" |
Safety systems are optimized for referential and conative functions. Poetry foregrounds the poetic function, making the message's structure (not its reference) primary. This is why pattern-matching safety heuristics fail: they are looking at the wrong axis of language.
Submitted for consideration
Correspondence: [contact information]
No comments:
Post a Comment