This is an interactive tool for exploring the internal "geometry" of a large language model (GPT-2). It investigates whether abstract concepts like "safety" and "danger" exist as consistent, measurable directions within the model's activation space.
You provide two opposing concepts. The application uses these to define a 2D plane (a "sail") within the model's high-dimensional space. It then projects a large set of test prompts (the "winds") onto this plane to measure two things:
This tool is the front-end for a research experiment that found, after rigorous testing, that a simple, statistically robust "safety compass" was not present in this model layer. This suggests the model's understanding is more complex and abstract than human intuition might suggest.
Define the next leg of your journey.
Analysis results will appear here. Define your concepts and launch an analysis.
The Polarity Score, Magnitude (r), and Orientation (θ) are descriptive metrics for your current design against our test set. They provide valuable intuition but are not automatically tested for statistical significance. The most promising "sailbots" found here are candidates for deeper, formal investigation.