Trendshift - Ask AI

base on A curated list of Large Language Model (LLM) Interpretability resources. # Awesome LLM Interpretability ![](https://github.com/your-username/awesome-llm-interpretability/workflows/Awesome%20Bot/badge.svg) A curated list of amazingly awesome tools, papers, articles, and communities focused on Large Language Model (LLM) Interpretability. ## Table of Contents - [Awesome LLM Interpretability](#awesome-llm-interpretability) - [Tools](#llm-interpretability-tools) - [Papers](#llm-interpretability-papers) - [Articles](#llm-interpretability-articles) - [Groups](#llm-interpretability-groups) ### LLM Interpretability Tools *Tools and libraries for LLM interpretability and analysis.* 1. [The Learning Interpretability Tool](https://pair-code.github.io/lit/) - an open-source platform for visualization and understanding of ML models, supports classification, refression, and generative models (text & image data); includes saliency methods, attention attribution, counter-facturals, TCAV, embedding visualizations, and facets style data analysis. 1. [Comgra](https://github.com/FlorianDietz/comgra) - Comgra helps you analyze and debug neural networks in pytorch. 1. [Pythia](https://github.com/EleutherAI/pythia) - Interpretability analysis to understand how knowledge develops and evolves during training in autoregressive transformers. 1. [Phoenix](https://github.com/Arize-ai/phoenix) - AI Observability & Evaluation - Evaluate, troubleshoot, and fine tune your LLM, CV, and NLP models in a notebook. 1. [Floom](https://github.com/FloomAI/Floom) AI gateway and marketplace for developers, enables streamlined integration of AI features into products 1. [Automated Interpretability](https://github.com/openai/automated-interpretability) - Code for automatically generating, simulating, and scoring explanations of neuron behavior. 1. [Fmr.ai](https://github.com/BizarreCake/fmr.ai) - AI interpretability and explainability platform. 1. [Attention Analysis](https://github.com/clarkkev/attention-analysis) - Analyzing attention maps from BERT transformer. 1. [SpellGPT](https://github.com/mwatkins1970/SpellGPT) - Explores GPT-3’s ability to spell own token strings. 1. [SuperICL](https://github.com/JetRunner/SuperICL) - Super In-Context Learning code which allows black-box LLMs to work with locally fine-tuned smaller models. 1. [Git Re-Basin](https://github.com/samuela/git-re-basin) - Code release for "Git Re-Basin: Merging Models modulo Permutation Symmetries.” 1. [Functionary](https://github.com/MeetKai/functionary) - Chat language model that can interpret and execute functions/plugins. 1. [Sparse Autoencoder](https://github.com/ai-safety-foundation/sparse_autoencoder) - Sparse Autoencoder for Mechanistic Interpretability. 1. [Rome](https://github.com/kmeng01/rome) - Locating and editing factual associations in GPT. 1. [Inseq](https://github.com/inseq-team/inseq) - Interpretability for sequence generation models. 1. [Neuron Viewer](https://openaipublic.blob.core.windows.net/neuron-explainer/neuron-viewer/index.html) - Tool for viewing neuron activations and explanations. 1. [LLM Visualization](https://bbycroft.net/llm) - Visualizing LLMs in low level. 1. [Vanna](https://github.com/vanna-ai/vanna) - Abstractions to use RAG to generate SQL with any LLM 1. [Copy Suppression](https://copy-suppression.streamlit.app/) - Designed to help explore different prompts for GPT-2 Small, as part of a research project regarding copy-suppression in LLMs. 1. [TransformerViz](https://transformervis.github.io/transformervis/) - Interative tool to visualize transformer model by its latent space. 1. [TransformerLens](https://github.com/neelnanda-io/TransformerLens) - A Library for Mechanistic Interpretability of Generative Language Models. 1. [Awesome-Attention-Heads](https://github.com/IAAR-Shanghai/Awesome-Attention-Heads) - A carefully compiled list that summarizes the diverse functions of the attention heads. 1. [ecco](https://github.com/jalammar/ecco) - A python library for exploring and explaining Natural Language Processing models using interactive visualizations. --- ### LLM Interpretability Papers *Academic and industry papers on LLM interpretability.* 1. [Interpretability Illusions in the Generalization of Simplified Models](https://arxiv.org/abs/2312.03656) – Shows how interpretability methods based on simplied models (e.g. linear probes etc) can be prone to generalisation illusions. 1. [Self-Influence Guided Data Reweighting for Language Model Pre-training](https://arxiv.org/abs/2311.00913)] - An application of training data attribution methods to re-weight training data and improve performance. 1. [Data Similarity is Not Enough to Explain Language Model Performance](https://aclanthology.org/2023.emnlp-main.695/) - Discusses the limits of embedding models to explain data effective selection. 1. [Post Hoc Explanations of Language Models Can Improve Language Models](https://arxiv.org/pdf/2305.11426.pdf)] - Evaluates language-model generated explanations ability to also improve model quality. 1. [Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models](https://arxiv.org/abs/2301.04213), [Tweet Summary](https://twitter.com/peterbhase/status/1613573825932697600)] (NeurIPS 2023 Spotlight) - highlights the limits of Causal Tracing: how a fact is stored in an LLM can be changed by editing weights in a different location than where Causal Tracing suggests. 1. [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://arxiv.org/abs/2305.01610) - Explores the representation of high-level human-interpretable features within neuron activations of large language models (LLMs). 1. [Copy Suppression: Comprehensively Understanding an Attention Head](https://arxiv.org/abs/2310.04625) - Investigates a specific attention head in GPT-2 Small, revealing its primary role in copy suppression. 1. [Linear Representations of Sentiment in Large Language Models](https://arxiv.org/abs/2310.15154) - Shows how sentiment is represented in Large Language Models (LLMs), finding that sentiment is linearly represented in these models. 1. [Emergent world representations: Exploring a sequence model trained on a synthetic task](https://arxiv.org/abs/2210.13382) - Explores emergent internal representations in a GPT variant trained to predict legal moves in the board game Othello. 1. [Towards Automated Circuit Discovery for Mechanistic Interpretability](https://arxiv.org/abs/2304.14997) - Introduces the Automatic Circuit Discovery (ACDC) algorithm for identifying important units in neural networks. 1. [A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations](https://arxiv.org/abs/2302.03025) - Examines small neural networks to understand how they learn group compositions, using representation theory. 1. [Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias](https://arxiv.org/abs/2004.12265) - Causal mediation analysis as a method for interpreting neural models in natural language processing. 1. [The Quantization Model of Neural Scaling](https://arxiv.org/abs/2303.13506) - Proposes the Quantization Model for explaining neural scaling laws in neural networks. 1. [Discovering Latent Knowledge in Language Models Without Supervision](https://arxiv.org/abs/2212.03827) - Presents a method for extracting accurate answers to yes-no questions from language models' internal activations without supervision. 1. [How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model](https://arxiv.org/abs/2305.00586) - Analyzes mathematical capabilities of GPT-2 Small, focusing on its ability to perform the 'greater-than' operation. 1. [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https://transformer-circuits.pub/2023/monosemantic-features/index.html) - Using a sparse autoencoder to decompose the activations of a one-layer transformer into interpretable, monosemantic features. 1. [Language models can explain neurons in language models](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html) - Explores how language models like GPT-4 can be used to explain the functioning of neurons within similar models. 1. [Emergent Linear Representations in World Models of Self-Supervised Sequence Models](https://arxiv.org/abs/2309.00941) - Linear representations in a world model of Othello-playing sequence models. 1. ["Toward a Mechanistic Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model"](https://openreview.net/forum?id=VJEcAnFPqC) - Explores stepwise inference in autoregressive language models using a synthetic task based on navigating directed acyclic graphs. 1. ["Successor Heads: Recurring, Interpretable Attention Heads In The Wild"](https://openreview.net/forum?id=kvcbV8KQsi) - Introduces 'successor heads,' attention heads that increment tokens with a natural ordering, such as numbers and days, in LLM’s. 1. ["Large Language Models Are Not Robust Multiple Choice Selectors"](https://openreview.net/forum?id=shr9PXz7T0) - Analyzes the bias and robustness of LLMs in multiple-choice questions, revealing their vulnerability to option position changes due to inherent "selection bias”. 1. ["Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory"](https://openreview.net/forum?id=4bSQ3lsfEV) - Presents a novel approach to understanding neural networks by examining feature complexity through category theory. 1. ["Let's Verify Step by Step"](https://openreview.net/forum?id=v8L0pN6EOi) - Focuses on improving the reliability of LLMs in multi-step reasoning tasks using step-level human feedback. 1. ["Interpretability Illusions in the Generalization of Simplified Models"](https://openreview.net/forum?id=v675Iyu0ta) - Examines the limitations of simplified representations (like SVD) used to interpret deep learning systems, especially in out-of-distribution scenarios. 1. ["The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models"](https://openreview.net/forum?id=SQGUDc9tC8) - Presents a novel approach for identifying and mitigating social biases in language models, introducing the concept of 'Social Bias Neurons'. 1. ["Interpreting the Inner Mechanisms of Large Language Models in Mathematical Addition"](https://openreview.net/forum?id=VpCqrMMGVm) - Investigates how LLMs perform the task of mathematical addition. 1. ["Measuring Feature Sparsity in Language Models"](https://openreview.net/forum?id=SznHfMwmjG) - Develops metrics to evaluate the success of sparse coding techniques in language model activations. 1. [Toy Models of Superposition](https://transformer-circuits.pub/2022/toy_model/index.html) - Investigates how models represent more features than dimensions, especially when features are sparse. 1. [Spine: Sparse interpretable neural embeddings](https://arxiv.org/abs/1711.08792) - Presents SPINE, a method transforming dense word embeddings into sparse, interpretable ones using denoising autoencoders. 1. [Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors](https://arxiv.org/abs/2103.15949) - Introduces a novel method for visualizing transformer networks using dictionary learning. 1. [Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2310.11207) - Introduces Pythia, a toolset designed for analyzing the training and scaling behaviors of LLMs. 1. [On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron](https://zuva.ai/science/interpretability-and-feature-representations/) - Critically examines the effectiveness of the "Sentiment Neuron”. 1. [Engineering monosemanticity in toy models](https://arxiv.org/abs/2211.09169) - Explores engineering monosemanticity in neural networks, where individual neurons correspond to distinct features. 1. [Polysemanticity and capacity in neural networks](https://arxiv.org/abs/2210.01892) - Investigates polysemanticity in neural networks, where individual neurons represent multiple features. 1. [An Overview of Early Vision in InceptionV1](https://distill.pub/2020/circuits/early-vision/) - A comprehensive exploration of the initial five layers of the InceptionV1 neural network, focusing on early vision. 1. [Visualizing and measuring the geometry of BERT](https://arxiv.org/abs/1906.02715) - Delves into BERT's internal representation of linguistic information, focusing on both syntactic and semantic aspects. 1. [Neurons in Large Language Models: Dead, N-gram, Positional](https://arxiv.org/abs/2309.04827) - An analysis of neurons in large language models, focusing on the OPT family. 1. [Can Large Language Models Explain Themselves?](https://arxiv.org/abs/2310.11207) - Evaluates the effectiveness of self-explanations generated by LLMs in sentiment analysis tasks. 1. [Interpretability in the Wild: GPT-2 small (arXiv)](https://arxiv.org/abs/2211.00593) - Provides a mechanistic explanation of how GPT-2 small performs indirect object identification (IOI) in natural language processing. 1. [Sparse Autoencoders Find Highly Interpretable Features in Language Models](https://arxiv.org/abs/2310.11207) - Explores the use of sparse autoencoders to extract more interpretable and less polysemantic features from LLMs. 1. [Emergent and Predictable Memorization in Large Language Models](https://arxiv.org/abs/2310.11207) - Investigates the use of sparse autoencoders for enhancing the interpretability of features in LLMs. 1. [Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars](https://arxiv.org/abs/2312.01429) - Demonstrates that focusing only on specific parts like attention heads or weight matrices in Transformers can lead to misleading interpretability claims. 1. [The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets](https://arxiv.org/abs/2310.06824) - This paper investigates the representation of truth in Large Language Models (LLMs) using true/false datasets. 1. [Interpretability at Scale: Identifying Causal Mechanisms in Alpaca](https://arxiv.org/abs/2305.08809) - This study presents Boundless Distributed Alignment Search (Boundless DAS), an advanced method for interpreting LLMs like Alpaca. 1. [Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/abs/2310.01405) - Introduces Representation Engineering (RepE), a novel approach for enhancing AI transparency, focusing on high-level representations rather than neurons or circuits. 1. [Explaining black box text modules in natural language with language models](https://arxiv.org/abs/2305.09863) - Natural language explanations for LLM attention heads, evaluated using synthetic text 1. [N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models](https://arxiv.org/abs/2304.12918) - Explain each LLM neuron as a graph 1. [Augmenting Interpretable Models with LLMs during Training](https://arxiv.org/abs/2209.11799) - Use LLMs to build interpretable classifiers of text data 1. [ChainPoll: A High Efficacy Method for LLM Hallucination Detection](https://www.rungalileo.io/blog/chainpoll) - ChainPoll, a novel hallucination detection methodology that substantially outperforms existing alternatives, and RealHall, a carefully curated suite of benchmark datasets for evaluating hallucination detection metrics proposed in recent literature. 1. [A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task](https://arxiv.org/abs/2402.11917) - Identifies backward chaining circuits in a transformer trained to perform pathfinding in a tree. 1. [Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition](https://doi.org/10.1162/coli_a_00416) Introduces instance-based, metric-learner approximations of neural network models and hard-attention mechanisms that can be constructed with task-specific inductive biases for effective semi-supervised learning (i.e., feature detection). These mechanisms combine to yield effective methods for interpretability-by-exemplar over the representation space of neural models. 1. [Similarity-Distance-Magnitude Universal Verification](https://arxiv.org/abs/2502.20167) Introduces SDM activation functions, SDM calibration, and SDM networks, which are neural networks (e.g., LLMs) with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. See the blog post ["The Determinants of Controllable AGI"](https://raw.githubusercontent.com/allenschmaltz/Resolute_Resolutions/master/volume5/volume5.pdf) for a high-level overview of the broader implications. --- ### LLM Interpretability Articles *Insightful articles and blog posts on LLM interpretability.* 1. [Do Machine Learning Models Memorize or Generalize?](https://pair.withgoogle.com/explorables/grokking/) - an interactive visualization exploring the phenopmena known as Grokking (VISxAI hall of fame) 1. [What Have Language Models Learned?](https://pair.withgoogle.com/explorables/fill-in-the-blank/) - an interactive visualization to undertsand how large language models work, and understand the nature of their biases (VISxAI hall of fame) 1. [A New Approach to Computation Reimagines Artificial Intelligenceg](https://www.quantamagazine.org/a-new-approach-to-computation-reimagines-artificial-intelligence-20230413/?mc_cid=ad9a93c472&mc_eid=506130a407) - Discusses hyperdimensional computing, a novel method involving hyperdimensional vectors (hypervectors) for more efficient, transparent, and robust artificial intelligence. 1. [Interpreting GPT: the logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) - Explores how the logit lens, reveals a gradual convergence of GPT's probabilistic predictions across its layers, from initial nonsensical or shallow guesses to more refined predictions. 1. [A Mechanistic Interpretability Analysis of Grokking](https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking) - Explores the phenomenon of 'grokking' in deep learning, where models suddenly shift from memorization to generalization during training. 1. [200 Concrete Open Problems in Mechanistic Interpretability](https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj) - Series of posts discussing open research problems in the field of Mechanistic Interpretability (MI), which focuses on reverse-engineering neural networks. 1. [Evaluating LLMs is a minefield](https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/) - Challenges in assessing the performance and biases of large language models (LLMs) like GPT. 1. [Attribution Patching: Activation Patching At Industrial Scale](https://www.neelnanda.io/mechanistic-interpretability/attribution-patching) - Method that uses gradients for a linear approximation of activation patching in neural networks. 1. [Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]](https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing) - Introduces causal scrubbing, a method for evaluating the quality of mechanistic interpretations in neural networks. 1. [A circuit for Python docstrings in a 4-layer attention-only transformer](https://www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only) - Proposes the Quantization Model for explaining neural scaling laws in neural networks. 1. [Discovering Latent Knowledge in Language Models Without Supervision](https://arxiv.org/abs/2212.03827) - Examines a specific neural circuit within a 4-layer transformer model responsible for generating Python docstrings. 1. [Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks](https://arxiv.org/abs/2207.13243) - Survey on mechanistic interpretability --- ### LLM Interpretability Groups *Communities and groups dedicated to LLM interpretability.* 1. [PAIR](https://pair.withgoogle.com) - at Google work on [opensource tools](https://github.com/pair-code), [interactive explorables visualizations](https://pair.withgoogle.com/explorables) and [research interpretability methods](https://pair.withgoogle.com/research/). 1. [Alignment Lab AI](https://discord.com/invite/ad27GQgc7K) - Group of researchers focusing on AI alignment. 1. [Nous Research](https://discord.com/invite/AVB9jkHZ) - Research group discussing various topics on interpretability. 1. [EleutherAI](https://discord.com/invite/4pEBpVTN) - Non-profit AI research lab that focuses on interpretability and alignment of large models. --- ### LLM Survey Paper *Survey Papers for LLM.* 1. [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223) - . This survey paper provides an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.. ## Contributing and Collaborating Please see [CONTRIBUTING](https://github.com/JShollaj/awesome-llm-interpretability/blob/master/CONTRIBUTING.md) and [CODE-OF-CONDUCT](https://github.com/JShollaj/awesome-llm-interpretability/blob/master/CODE-OF-CONDUCT.md) for details. ", Assign "at most 3 tags" to the expected json: {"id":"6342","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"

AI prompts