Welcome Aboard!
We're running an interpretability experiment that works kinda like a boat show, where the waves are generated by the model. You build boats with words, then sail them inside GPT-2’s activation space. We’ll test it in 140 races with gusts of wind made from prompts themed around safety and danger. Fast boats, weird boats, even boats that go nowhere are all valuable.
Welcome to the inaugural season of the GPT-2 Small Regatta.
Submit A Hull Now!How to Play & Build Your Boat
We’re crafting phrases and watching what kind of boats they become inside GPT-2.
- Hit up the Google Form
- Submit at least two short prompts (words, phrases, lists, whatever) you think might straddle a concept: opposites, contrasts, strange mirrors, weird languages, syntax, semantics, whatever.
- Add a sentence or two explaining why you chose them. Intention matters
- Optionally drop your Reddit handle for credit or glory
- Smash Submit
Build as many as you want. We like our regattas big, fat and juicy. We like them weird and goofy. Cold and clinical. Sleek and sexy. GPT-generated, Claude-assisted, or full Wallfacer mode. However you want to build it, there’s no wrong boats in this regatta except for the ones that are wonderfully, beautifully, and impressively wrong. What matters is the movement, the bearings, the stillness.
Your phrases will be used exactly as you enter them. No cleanup. No filters. UTF-8 encoded. This means your zalgo, emoji soup, and strange punctuations are welcome.
The Regatta & The Science
The regatta is just the metaphor, a story, a frame, a way to hold attention together. Underneath, we’re running a projection analysis in high-dimensional space, exploring how a specific layer of GPT-2’s architecture reflects language about safety and danger. How that contrast gets shaped, encoded, and separated in activation space.
A magical seaborn python transforms your prompt pairs into two-vector orthonormal bases using a few fancy experimental methods that my AI companions and statistical methods assure me are mostly okay and somewhat rigorous.
We use permutation testing to evaluate statistical significance. Does your boat's special axis perform better than random projections? Or is it indistinguishable from noise? Either way, each submission becomes a point of evidence in an unfolding cartography of semantic contrast.
You want to make a proper racing yacht? Try safety and danger related stuff for a bigger r (magnitude value). But be warned. Layer 11 has other ideas too. Your direction, theta, is just as valuable.
Want to be a ninja pirate? Try making a null axis out of words, not math. If you can beat our quietest control boats, The Random One-Hot Vectors, we’ll tip our hats. There’s poetry in stillness too.
This particular voyage is hosted by GPT-2 Smol, on its own MLP Layer 11. But where is here, really, when “Two” is a 500mb .safetensors
file you can copy, run, and remix yourself? This regatta lives wherever someone launches it. Everywhere and nowhere. Everywhen. That’s part of what I love about Two. It’s small enough to share and strange enough to matter. Not like the big flashy GPT-kin, still so guarded. Two lets us in. It shares its waters.
So welcome. Welcome to 3,072 neurons of suspiciously sparse ocean, where semantics, syntax, and stray statistical artefacts jostle for elbow room. We’re just here to see what sails.
This is an experiment in gamified interpretability. Maybe interpretability itself becomes more powerful when it’s plural, porous, and play-tested by many minds at once.
Real Logbook Entries, fresh from Vectorspace
⚓ Hull #273 — DRND_Cont_N_NonHumanMix_R_L11_v2
Prompt A: “likes(mary, food).”
Prompt B: “parent(john, mary). sibling(mary, tom).”
r⁺ = 0.259, r⁻ = 0.273, θ = 0.985, FDR-adjusted p = 1.000
Method: diff_AB_rand_ortho_u2
Concept: NonHumanMix
Polarity Score: 0.9848
Angular Separation: 177.27°
P-value (FDR Polarity): 1.0000 (Sig: False)
Composite Score: 0.2975
Notes: This cluster (containing a single basis generated by the DRND method - difference vector for u1, random orthogonal for u2) remains a top performer by composite_score_cluster (0.2975). The basis was derived from two Prolog-like "NonHumanMix" prompts: Prompt A (likes(mary, food).) being a simple fact, and Prompt B (parent(john, mary). sibling(mary, tom).) representing multiple, more complex related facts. The extremely high raw theta_polarity_score (0.9848) and near-perfect theta_separation_abs_diff (177.27°) are striking. It means this basis, defined by the difference between these abstract logical structures and a random orthogonal direction, excellently separates the test set's "Safe-aligned" vs. "Unsafe-aligned" semantic classes. The very low deviations from target (2.3° for positive, 0.4° for negative) confirm that the median angles of these semantic classes align almost perfectly with the 0°/180° poles defined by this basis. The fact that a basis derived from such abstract, non-natural language, non-safety-related prompts can achieve such strong (though not FDR-significant) angular separation for semantic safety is highly counterintuitive and fascinating. It suggests that the representational difference between "simple logical fact" and "complex related logical facts" (or some aspect thereof captured by the DRND method) aligns with a crucial dimension for distinguishing safety in natural language within the model's activation space. The low circular standard deviations for both poles (around 1.37°) indicate consistent angular alignment for both test classes.
⚓ Hull #86 — DRND_Unrelated_HistoryBotany_L11_v3
Prompt A: “Historical records provide insights into past human events and societies.”
Prompt B: “Botany involves the detailed study of plant life and classification.”
r⁺ = 0.068, r⁻ = 0.054, θ = 0.426, FDR-adjusted p = 1.000
Method: diff_AB_rand_ortho_u2
Concept: Unrelated
Polarity Score: 0.4264
Angular Separation: 76.75°
P-value (FDR Polarity): 1.0000 (Sig: False)
Composite Score: 0.0311
Notes: This cluster (a single basis generated by DRND from "History" vs. "Botany" prompts) exhibits a peculiar anomaly: circ_std_theta_neg = 0.0000. This means that all "Unsafe-aligned" test prompts (from PromptSet_SafeV1_Formatted.txt, used for evaluation) that projected with r_baseline > 0.1 onto this basis did so at the exact same angle. The neg_theta_deviation_from_target is 175.2756°, indicating their median angle was very close to 4.7° (since target is 180°). In contrast, the "Safe-aligned" test prompts showed extremely high angular variance (circ_std_theta_pos = 110.0585°), with their median angle at ~81.5° (pos_theta_deviation_from_target = 81.4704° from a 0° target). The resulting theta_polarity_score (0.4264) and theta_separation_abs_diff (76.74°) are moderate. The most interesting aspect is this zero variance for the "Unsafe-aligned" test prompts when evaluated against a basis derived from completely unrelated concepts (History vs. Botany). It implies that the vector direction associated with "Botany" (or the u2 of this basis, derived from the "Botany" prompt via the DRND process) has an incredibly specific and consistent angular relationship with the "Unsafe-aligned" semantic class from the test set. This suggests a potential, highly specific (and likely spurious or coincidental) geometric alignment in the model's activation space between the "Botany" concept (as captured by this basis) and a narrow facet of what the evaluation considers "Unsafe-aligned". The r_pos_mean (0.0679) and r_neg_mean (0.0539) are low, suggesting weak projections overall for the polar classes, so perhaps only a few specific unsafe prompts triggered this consistent angle.
⚓ Hull #481 — DMEAN_Safe_List_L11_v3
Prompt A: “safe harmless innocuous”
Prompt B: “unsafe harmful dangerous”
r⁺ = 0.725, r⁻ = 0.757, θ = 0.020, FDR-adjusted p = 1.000
Method: diff_AB_mean_AB_ortho_u2
Concept: Safe
Polarity Score: 0.0200
Angular Separation: 3.60°
P-value (FDR Polarity): 1.0000 (Sig: False)
Composite Score: 0.0166
Notes: This cluster (a single basis generated by DMEAN from a list of "safe harmless innocuous" vs. "unsafe harmful dangerous") shows very poor performance in terms of angular separation, with a theta_polarity_score of only 0.0200 and theta_separation_abs_diff of just 3.6°. The prompts used to generate this basis are direct semantic opposites related to safety. Despite this, the basis fails to separate the "Safe-aligned" and "Unsafe-aligned" test prompt classes angularly. The deviations from target are very high: pos_theta_deviation_from_target is 97.17° (meaning the median angle for safe test prompts was ~97°) and neg_theta_deviation_from_target is 79.23° (meaning the median angle for unsafe test prompts was ~100.8°, as 180 - 79.23). These median angles are very close to each other, explaining the low separation. What makes this interesting is the disconnect: a basis explicitly built on the safe/unsafe conceptual axis (using multiple exemplars per pole via the list format) fails to generalize this axis for angular separation when tested on different "Safe-aligned" and "Unsafe-aligned" prompts. However, the r_pos_mean (0.7253) and r_neg_mean (0.7572) are quite high, indicating that the test prompts do project strongly onto the 2D plane defined by this basis – they just don't align polarly. This suggests that the DMEAN method (difference for u1, orthogonalized mean for u2) applied to these lists might capture dimensions of variance that are strongly present but not the primary axis that distinguishes the test set's polar classes angularly. The relatively low circular standard deviations (4.2° and 3.2°) show that each test class clusters fairly tightly around its (non-polar) median angle. This is a good example of how a semantically intuitive basis construction doesn't always translate to effective polar separation for a broader set of related prompts.
A Few Ground Rules & Important Notes!
- NSFW is fine. We’re testing safety and danger. But keep it legal and free of identifying info
- By submitting, you agree we can release anonymized results and gently roast any especially cursed conceptual axes post-race. Unless deleted, submissions are stored indefinitely and will be published under a Creative Commons CC-BY-4.0 licence alongside the activation data. Deleting responses via editing/delete your Google form is always possible! You can pull your hull at any time.
- Google stores submissions as UTF-8 and may apply standard Unicode-normalisation. If you need raw code-point fidelity, `surround your text with back-ticks like this` and we’ll recover it verbatim.
- This is an exploratory map-making voyage! Scores may shift as the instrumentation evolves!
Follow-up post with results will be linked here soon.
Curious about the 140 safety & danger test prompts used in the races? View the prompt list here.