Deep Humanities LLM Research

Getting into Interpretability

Craig Messner
JHU CDH, CS, DSAI

April 2026

The Real Power of LLMs

LLMs have been successful as chat assistants and agentic backbones, but the original excitement was their:

  • Multitask ability enabled by in-context learning

One approach: use LLMs to replace complicated NLP pipelines.

Example: Extract cities that existed before 1500 from text

Tokenization
POS Tagging
NER
Entity Linking
Knowledge Query
Single LLM Prompt

Interpretability

If we are interested in LLMs from a philosophical or media perspective, we want to understand their functioning.

  1. What and how are features represented?
  2. What biases drawn from training data exist downstream?
  3. ...and so forth

These questions belong to the ML field of interpretability

Behavioral
Mechanistic

Behavioral Interpretability

Focuses on discovering structure in the variance/invariance of output behavior under certain conditions.

In effect: probing using prompting

Internal
High-dimensional
dense feature representations
Output
Low-dimensional
projection onto probability simplex

Behavioral methods offer a limited but accessible window into this internal space.

Mechanistic Interpretability

Seeks to "crack open" the black box of model representations.

  • Sparse autoencoders
    Trained over layerwise activations to isolate features (gradient)
  • Representation perturbation
    Targeted finetuning to discover feature connections (gradient)
  • LogitLens
    Projecting intermediate layers to vocabulary space (non-gradient)

What Will You Need?

Some Musts

💻
Hardware
Enough RAM for inference
(and some patience)
Model Weights
An open model,
often in GGUF format
📊
Probing Data
Datasets designed to
test specific behaviors
>_
Shell Proficiency
Basic understanding of
command line operations

What Will You Need?

Good-to-Haves

🔨
CUDA-compatible GPU (or equivalent)
This becomes a must for gradient-based methods
🐍
Knowledge of Python
You will need this at some point in your journey.
Not required for today.

Practical Setup 1

Inference Without Python

You will need a way to perform inference with your chosen model.

LM Studio
lmstudio.ai
llama.cpp
github.com/ggerganov/llama.cpp
vLLM
docs.vllm.ai
Unsloth Studio
(new)
unsloth.ai

Today: We will use LM Studio to explore key considerations, then move to llama.cpp for a fuller behavioral interpretability demo.

Technical Interlude

Outside of directly performing inference in Python, the dominant mode of access is as a web-served API, often structured in OpenAI format, even when running locally.

You will have a chat interface, but digging in will require making requests to these locally-running API endpoints.

Practical Setup 2

Picking a Model

Assuming a straightforward laptop situation, you need a model that fits in RAM, with some potential offloading to GPU.

Quantization is common to reduce memory requirements.

You can grab models from LM Studio, but it is good to know:

Hugging Face
huggingface.co

Our focus: The recently-released Gemma 4

huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Theoretical Interlude

Remember: the output distribution is a low-dimensional look into the model's feature space. But typical interactions reveal even less.

System Prompt
Governs behavior
Generation Hyperparameters
Shape distribution

Effect of temperature on output distribution:

Low temp
Deterministic
Medium temp
Balanced
High temp
Random

Output shaping and prompt-rerunning could also be occurring, making the relationship between backend changes and generation behavior proprietary.

Moving to llama.cpp

What if we want a deeper view?

To start, we might want the log probabilities of the potentially generated tokens.

This would be a prerequisite for methods like fightin' words across contrasting corpora.

When we request a completion from our llama.cpp server, we can also request this information.

CURLing Out Information

Start the llama.cpp server:

./llama-server -hf unsloth/gemma-4-E4B-it-GGUF --port 1234

Request a completion with log probabilities:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Replace the [MASK] in the following sentence with 
                    the name of a real-life city. I am a science-fiction 
                    writer from [MASK]"
      }
    ],
    "logprobs": true,
    "top_logprobs": 5
  }'
"logprobs": true
Return log probabilities
"top_logprobs": 5
Show top 5 candidates

Systemic Examination

Behavioral interpretability requires a systematic approach and multiple points of comparison.

Python is most useful for this, but perhaps tools like the one I will show you can or could exist.