Deep Humanities LLM Research

Getting into Interpretability

Craig Messner
JHU CDH, CS, DSAI

April 2026

The Real Power of LLMs

LLMs have been successful as chat assistants and agentic backbones, but the original excitement was their:

Multitask ability enabled by in-context learning

One approach: use LLMs to replace complicated NLP pipelines.

Example: Extract cities that existed before 1500 from text

Tokenization

→

POS Tagging

→

NER

→

Entity Linking

→

Knowledge Query

⬇

Single LLM Prompt

Interpretability

If we are interested in LLMs from a philosophical or media perspective, we want to understand their functioning.

What and how are features represented?
What biases drawn from training data exist downstream?
...and so forth

These questions belong to the ML field of interpretability

Behavioral

Mechanistic

Behavioral Interpretability

Focuses on discovering structure in the variance/invariance of output behavior under certain conditions.

In effect: probing using prompting

Internal

High-dimensional

dense feature representations

→

Output

Low-dimensional

projection onto probability simplex

Behavioral methods offer a limited but accessible window into this internal space.

Mechanistic Interpretability

Seeks to "crack open" the black box of model representations.

Sparse autoencoders
Trained over layerwise activations to isolate features (gradient)
Representation perturbation
Targeted finetuning to discover feature connections (gradient)
LogitLens
Projecting intermediate layers to vocabulary space (non-gradient)

What Will You Need?

Some Musts

💻

Hardware

Enough RAM for inference
(and some patience)

⚙

Model Weights

An open model,
often in GGUF format

📊

Probing Data

Datasets designed to
test specific behaviors

Shell Proficiency

Basic understanding of
command line operations

What Will You Need?

Good-to-Haves

🔨

CUDA-compatible GPU (or equivalent)

This becomes a must for gradient-based methods

🐍

Knowledge of Python

You will need this at some point in your journey.
Not required for today.

Practical Setup 1

Inference Without Python

You will need a way to perform inference with your chosen model.

LM Studio

lmstudio.ai

llama.cpp

github.com/ggerganov/llama.cpp

Today: We will use LM Studio to explore key considerations, then move to llama.cpp for a fuller behavioral interpretability demo.

Technical Interlude

Outside of directly performing inference in Python, the dominant mode of access is as a web-served API, often structured in OpenAI format, even when running locally.

You will have a chat interface, but digging in will require making requests to these locally-running API endpoints.

Practical Setup 2

Picking a Model

Assuming a straightforward laptop situation, you need a model that fits in RAM, with some potential offloading to GPU.

Quantization is common to reduce memory requirements.

You can grab models from LM Studio, but it is good to know:

huggingface.co

Our focus: The recently-released Gemma 4

huggingface.co/unsloth/gemma-4-E4B-it-GGUF

Theoretical Interlude

Remember: the output distribution is a low-dimensional look into the model's feature space. But typical interactions reveal even less.

System Prompt

Governs behavior

Generation Hyperparameters

Shape distribution

Effect of temperature on output distribution:

Low temp

Deterministic

Medium temp

Balanced

High temp

Random

Output shaping and prompt-rerunning could also be occurring, making the relationship between backend changes and generation behavior proprietary.

Moving to llama.cpp

What if we want a deeper view?

To start, we might want the log probabilities of the potentially generated tokens.

This would be a prerequisite for methods like fightin' words across contrasting corpora.

When we request a completion from our llama.cpp server, we can also request this information.

CURLing Out Information

Start the llama.cpp server:

./llama-server -hf unsloth/gemma-4-E4B-it-GGUF --port 1234

Request a completion with log probabilities:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Replace the [MASK] in the following sentence with 
                    the name of a real-life city. I am a science-fiction 
                    writer from [MASK]"
      }
    ],
    "logprobs": true,
    "top_logprobs": 5
  }'

"logprobs": true

Return log probabilities

"top_logprobs": 5

Show top 5 candidates

Systemic Examination

Behavioral interpretability requires a systematic approach and multiple points of comparison.

Python is most useful for this, but perhaps tools like the one I will show you can or could exist.