-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.html
512 lines (475 loc) · 29.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>ICML 2024 Mechanistic Interpretability Workshop</title>
<!-- Setup all meta-information like description and titles -->
<meta name="description" content="The Workshop on Mechanistic Interpretability seeks to explore and drive discussions on the latest advances in interpretable machine learning models. We invite submissions of
research, technological breakthroughs and demonstrations, as well as proposals for technical
discussions, to be held during the workshop." />
<meta name="keywords" content="ICML, Mechanistic Interpretability, Workshop" />
<meta name="author" content="ICML 2024 Mechanistic Interpretability" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<!-- Load fonts Gothic A1 -->
<link href="https://fonts.googleapis.com/css?family=Gothic+A1:400,700&display=swap" rel="stylesheet" />
<!-- Load style.css -->
<link rel="stylesheet" href="style.css" />
</head>
<body>
<!-- Header with a background color filling approx. 300px and that has a title of the workshop and the date as a byline -->
<header>
<h1 class="fade-in">Mechanistic Interpretability Workshop 2024</h1>
<!-- make the next one white text -->
<h2 class="fade-in" style="color: white;">ICML 2024 In-Person Workshop, Vienna</h2>
<h2 class="fade-in" style="color: white;">July 27, 2024</h2>
<p class="fade-in"></p>
</header>
<!-- Content on white background with sections Overview, Schedule, Speakers and Organizing Committee -->
<main class="fade-in">
<section>
<p> This is a 1 day workshop at ICML on mechanistic interpretability, held on July 27th in room Lehar 1 at ICML
venue of Messe
Wien Exhibition Congress Center, Vienna, Austria.
<p>
</section>
<section>
<h2 id="prizes">Top Papers Prize</h2>
These are our 5 prize winning papers. You can see all 93 accepted papers, showcasing the latest
mechanistic interpretability research, <a
href="https://openreview.net/group?id=ICML.cc/2024/Workshop/MI&referrer=%5BHomepage%5D(%2F)#tab-accept-oral">here</a>!
<ol>
<li><strong>First prize ($1000):</strong> <a href="https://openreview.net/forum?id=KXuYjuBzKo">The Geometry of
Categorical and Hierarchical Concepts in Large Language Models</a></li>
<li><strong>Second prize ($500):</strong> <a href="https://openreview.net/forum?id=P7MW0FahEq">InversionView: A
General-Purpose Method for Reading Information from Neural Activations</a></li>
<li><strong>Third prize ($250):</strong> <a href="https://openreview.net/forum?id=ibSNv9cldu">Hypothesis Testing
the
Circuit Hypothesis in LLMs</a></li>
<li><strong>Honorable mention:</strong> <a href="https://openreview.net/forum?id=pJs3ZiKBM5">Missed Causes and
Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks</a></li>
<li><strong>Honorable mention:</strong> <a href="https://openreview.net/forum?id=qzsDKwGJyB">Measuring Progress
in
Dictionary Learning for Language Model Interpretability with Board Game Models</a></li>
</ol>
</section>
<section>
<h2 id="schedule">Schedule</h2>
<table>
<colgroup>
<col style="width: 150px;">
<col style="width: 350px;">
</colgroup>
<tr>
<th>Time</th>
<th>Event</th>
</tr>
<tr>
<td>09:00 - 09:30</td>
<td>Welcome + Talk 1: David Bau</td>
</tr>
<tr>
<td>09:30 - 10:30</td>
<td><a href="posters.html">Oral Presentation</a></td>
</tr>
<tr>
<td>10:30 - 11:00</td>
<td><a href="posters.html#posters-1">Spotlights 1</a></td>
</tr>
<tr>
<td>11:00 - 12:00</td>
<td><a href="posters.html#posters-1">Poster Session 1 </a></td>
</tr>
<tr>
<td>12:00 - 13:00</td>
<td>Panel Discussion</td>
</tr>
<tr>
<td>13:00 - 14:00</td>
<td>Lunch</td>
</tr>
<tr>
<td>14:00 - 14:30</td>
<td><a href="posters.html#posters-2">Spotlights 2</a></td>
</tr>
<tr>
<td>14:30 - 15:30</td>
<td><a href="posters.html#posters-2">Poster Session 2</a></td>
</tr>
<tr>
<td>15:30 - 16:00</td>
<td>Coffee Break</td>
</tr>
<tr>
<td>16:00 - 16:30</td>
<td>Talk 2: Asma Ghandeharioun</td>
</tr>
<tr>
<td>16:30 - 17:00</td>
<td>Talk 3: Chris Olah (remote)</td>
</tr>
<tr>
<td>18:30 - late</td>
<td>Invite-only evening social (<a
href="https://docs.google.com/forms/d/e/1FAIpQLSf6EHr8JQu8NHNG1XNYoxfqyjeg89qSVYtpkg_gYbXQ8nSYJg/viewform">apply
here</a>)</td>
</tr>
</table>
</section>
<section>
<h2>Introduction</h2>
<p>Even though ever larger and more capable machine learning models are being deployed in real-world settings, we
still know concerningly little about how they implement their many impressive capabilities. This in turn can
make it difficult to rely on these models in high-stakes situations, or to reason about or address cases where
said models exhibit undesirable behavior. </p>
One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse
engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining
the weights and activations of neural networks to identify circuits[<a
href="https://distill.pub/2020/circuits">Cammarata et al., 2020</a>, <a
href="https://transformer-circuits.pub/2021/framework/index.html">Elhage et al., 2021</a>] that implement
particular behaviors.</p>
<p>Though this is an ambitious goal, in the past two years, mechanistic interpretability has seen rapid progress.
For example, researchers have used newly developed mechanistic interpretability techniques to recover how large
language models implement particular behaviors [for example, <a
href="https://proceedings.ICLR.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html">Geiger et
al., 2021</a>, <a href="https://arxiv.org/abs/2211.00593">Wang et al., 2022</a>, <a
href="https://arxiv.org/abs/2209.11895">Olsson et al.,
2022</a>, <a href="https://arxiv.org/abs/2304.14767">Geva et al., 2023</a>, <a
href="https://arxiv.org/abs/2305.00586">Hanna et al., 2023</a>, <a href="https://arxiv.org/pdf/2310.13121">
Quirke and Barez, 2024</a>], illuminated various puzzles such as double descent [<a
href="https://transformer-circuits.pub/2023/toy-double-descent/index.html">Henighan et al., 2023</a>], scaling
laws [<a href="https://arxiv.org/abs/2303.13506">Michaud et al., 2023</a>], and grokking [<a
href="https://arxiv.org/abs/2301.05217">Nanda et al., 2023</a>], and explored phenomena such as superposition
[<a href="https://transformer-circuits.pub/2022/toy_model/index.html">Elhage et al., 2022</a>, <a
href="https://arxiv.org/abs/2305.01610">Gurnee et al., 2023</a>, <a
href="https://transformer-circuits.pub/2023/monosemantic-features/index.html">Bricken et al., 2023</a>] that
may be fundamental principles of how models work. Despite this progress, significant amounts of mechanistic
interpretability work still occur in relatively disparate circles – there seem to be relatively separate threads
of work in industry and academia that each use their own (slightly different) notation and terminology.</p>
<p>This workshop aims to bring together researchers from both industry and academia to discuss recent progress,
address the challenges faced by this field, and clarify future goals, use cases, and agendas. We believe that
this workshop can help foster a rich dialogue between researchers with a wide variety of backgrounds and ideas,
which in turn will help researchers develop a deeper understanding of how machine learning systems work in
practice.
</p>
</p>
</section>
<section>
<h2>Attending</h2>
<p>We welcome attendees from all backgrounds, regardless of your prior research experience or if you have work
published at this workshop.
Note that while you <b>do not</b> need to be registered for the ICML main conference to attend this workshop,
you <b>do</b> need to be
<a href=https://icml.cc/Register>registered for the ICML workshop track</a>.
No further registration (eg with this specific workshop) is needed, just turn up on the day!
</p>
</section>
<section>
<h2>Speakers</h2>
<div class="speakers">
<div class="speaker">
<img src="img/chrisolah.jpeg" alt="Speaker" />
<div>
<h3><a href="https://colah.github.io/about.html">Chris Olah</a></h3>
<p>Anthropic</p>
</div>
</div>
<div class="speaker">
<img src="img/davidbau.jpeg" alt="Speaker" />
<div>
<h3><a href="https://www.khoury.northeastern.edu/people/david-bau/">David Bau</a></h3>
<p>Northeastern University</p>
</div>
</div>
<div class="speaker">
<img src="img/asmaghandeharioun.png" alt="Speaker" />
<div>
<h3><a href="https://asmadotgh.github.io/">Asma Ghandeharioun</a></h3>
<p>Google DeepMind</p>
</div>
</div>
</div>
</section>
<section>
<h2>Panelists</h2>
<div class="speakers">
<div class="speaker">
<img src="img/naomisaphra.jpeg" alt="Speaker" />
<div>
<h3><a href="https://nsaphra.net/">Naomi Saphra</a></h3>
<p>Harvard University</p>
</div>
</div>
<div class="speaker">
<img src="img/atticusgeiger.jpeg" alt="Speaker" />
<div>
<h3><a href="https://atticusg.github.io/">Atticus Geiger</a></h3>
<p>Pr(Ai)<sup>2</sup>R Group</p>
</div>
</div>
<div class="speaker">
<img src="img/stellabiderman.jpeg" alt="Speaker" />
<div>
<h3><a href="https://www.stellabiderman.com">Stella Biderman</a></h3>
<p>EleutherAI</p>
</div>
</div>
<div class="speaker">
<img src="img/arthurconmy.jpeg" alt="Speaker" />
<div>
<h3><a href="https://arthurconmy.github.io/about/">Arthur Conmy</a></h3>
<p>Google DeepMind</p>
</div>
</div>
</div>
</section>
<section>
<h2>Call for Papers</h2>
<p>We are inviting submissions of short (4 pages) and long (8 pages) papers outlining new research, with a
deadline of May 29th 2024. We welcome papers on any of the following topics (see the Topics for Discussion
section for more details and example papers), or anything else where the authors convincingly argue that it
moves forward the field of mechanistic interpretability.</p>
<ul>
<li><b>Techniques:</b> Work inventing new mechanistic interpretability techniques, evaluating the quality of
existing techniques, or proposing benchmarks and tools for future evaluations.</li>
<li><b>Exploratory analysis:</b>Qualitative, biologically-inspired analysis of components, circuits or phenomena
inside neural networks.</li>
<li><b>Decoding superposition:</b> Work that deepens our understanding of the hypothesis that models activations
are represented in superposition, and explores techniques to decode superposed activations, such as sparse
autoencoders. </li>
<li><b>Applications of interpretability:</b> Can we study jailbreaks/hallucinations/other interesting real-world
phenomena of LLMs? Where are places where mech interp provides value, in a fair comparison with e.g. linear
probing or finetuning baselines?</li>
<li><b>Scaling and automation:</b> How can we reduce the dependence of mechanistic interpretability on slow,
subjective and expensive human labor? How much do our current techniques scale?</li>
<li><b>Basic science:</b> There are many fundamental mysteries of model internals, and we welcome work that can
shed any light on them: Are activations sparse linear combinations of features? Are features universal? Are
circuits and features even the right way to think about models? </li>
</ul>
<p>We also welcome work that furthers the field of mechanistic interpretability in less standard ways, such as by
providing rigorous negative results, or open source software (e.g. <a
href="https://github.com/neelnanda-io/TransformerLens">TransformerLens</a>, <a
href="https://github.com/stanfordnlp/pyvene/tree/main">pyvene</a>, <a
href="https://github.com/ndif-team/nnsight">nnsight</a> or <a
href="https://github.com/google-deepmind/penzai">Penzai</a>), models or datasets that may be of value to the
community (e.g. <a href="https://arxiv.org/abs/2304.01373">Pythia</a>, <a
href="https://arxiv.org/abs/2106.16163">MultiBERTs</a> or <a
href="https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream">open
source sparse autoencoders</a>), coding tutorials (e.g. <a
href="https://arena3-chapter1-transformer-interp.streamlit.app/">the ARENA materials</a>), <a
href="https://distill.pub/2017/research-debt/">distillations of key and poorly explained concepts</a> (e.g. <a
href="https://transformer-circuits.pub/2021/framework/index.html">Elhage et al., 2021</a>), or position pieces
discussing future use cases of mechanistic interpretability or that bring clarification to complex topics such
as “what is a feature?”. </p>
<h3> Reviewing and Submission Policy</h3>
<p>All submissions must be made <a href="https://openreview.net/group?id=ICML.cc/2024/Workshop/MI">via
OpenReview</a>. Please use the <a href="https://media.icml.cc/Conferences/ICML2024/Styles/icml2024.zip">ICML
2024 LaTeX Template</a> for all submissions.</p>
<p>Submissions are non-archival. We are happy to receive submissions that are also undergoing peer review
elsewhere at the time of submission, but we will not accept submissions that have already been previously
published or accepted for publication at peer-reviewed conferences or journals. Submission is permitted for
papers presented or to be presented at other non-archival venues (e.g. other workshops) </p>
<p>Reviewing for our workshop is double blind: reviewers will not know the authors’ identity (and vice versa).
Both short (max 4 page) and long (max 8 page) papers allow unlimited pages for references and appendices, but
reviewers are not expected to read these.
Evaluation of submissions will be based on the originality and novelty, the technical strength, and relevance to
the workshop topics. Notifications of acceptance will be sent to applicants by email.
<h3>Prizes</h3>
<ul>
<li>Best paper prize: $1000</li>
<li>Second place: $500</li>
<li>Third place: $250</li>
<li>Honorable mentions: Up to 5, no cash prize</li>
</ul>
</section>
<section>
<h2>Important Dates</h2>
<ul>
<li><a href="https://openreview.net/group?id=ICML.cc/2024/Workshop/MI">Submission open on OpenReview</a>: May
12, 2024</li>
<li>Submission Deadline: May 29, 2024</li>
<li>Notification of Acceptance: June 23, 2024</li>
<li>Camera-ready Deadline: July 14th, 2024</li>
<li>Workshop Date: July 27, 2024
<!-- (tentative, may be moved to 26 pending ICML confirmation)</li> -->
</ul>
<p>All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).</p>
<p><b>Note:</b> You will require an OpenReview account to submit. If you do not have an institutional email (e.g.
a .edu address), OpenReview moderation can take up to 2 weeks. <b>Please make an account by May 14th at the
latest if this applies to you.</b>
<p>
</section>
<section>
<h2><strong>Potential topics of discussion include:</strong></h2>
<ul>
<li>Many recent papers have suggested different metrics and techniques for validating mechanistic
interpretations [<a href="https://distill.pub/2020/circuits">Cammarata et al., 2020</a>, <a
href="https://arxiv.org/abs/2106.02997">Geiger et al., 2021</a>, <a
href="https://arxiv.org/abs/2211.00593">Wang et al., 2022</a>, <a
href="https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing">Chan
et al., 2022</a>]. What are the advantages and disadvantages of these metrics, and which metrics should the
field use going forward? How do we avoid spurious explanations or “interpretability illusions" [<a
href="https://arxiv.org/abs/2104.07143">Bolukbasi et al., 2021</a>]? Are there unknown illusions for
currently popular techniques?
<li>Neural networks seem to represent more features in superposition [<a
href="https://transformer-circuits.pub/2022/toy/_model/index.html">Elhage et al., 2022</a>, <a
href="https://arxiv.org/abs/2305.01610">Gurnee et al., 2023</a>] than they have dimensions, which poses a
significant challenge for identifying what features particular subcomponents are representing. How much of a
challenge does superposition pose for various approaches to mechanistic interpretability? What are approaches
that allow us to address or circumvent this challenge? We are particularly excited to see work building on
recent successes using dictionary learning to address superposition, such as Sparse Autoencoders [<a
href="https://transformer-circuits.pub/2023/monosemantic-features/index.html">Bricken et al., 2023</a>],
including studying these dictionaries, using them for circuit analysis [<a
href="https://arxiv.org/abs/2403.19647">Marks et al., 2024</a>], understanding reward models <a
href="https://arxiv.org/pdf/2310.08164">[Marks et al., 2024]</a>, and developing better training methods.
<li>Techniques from mechanistic interpretability have been used to identify, edit, and control behavior inside
of neural networks [<a href="https://arxiv.org/abs/2202.05262">Meng et al., 2022</a>, <a
href="https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector">Turner
et al., 2023</a>]. However, other recent work has suggested that these model editing and pruning techniques
often have unintended side effects, especially on larger models [<a
href="https://arxiv.org/abs/2305.17553">Hoelscher-Obermaier et al., 2023</a>, <a
href="https://arxiv.org/abs/2307.12976">Cohen et al. 2023</a>, <a
href="https://arxiv.org/abs/2402.17700">Huang et al. 2024</a>, <a href="https://arxiv.org/pdf/2401.01814">Lo
et al., 2024</a>]. How can we refine localization and editing and pruning behavior in more specific and
scalable methods?
<li>To understand what model activations and components do, it is crucial to have principled techniques, which
ideally involve causally intervening on the model, or otherwise being faithful to the model's internal
mechanisms. For example, a great deal of work has been done around activation patching, such as (distributed)
interchange interventions [<a href="https://arxiv.org/abs/2004.12265">Vig et al. 2020</a>, <a
href="https://proceedings.iclr.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html">Geiger et
al., 2021, Geiger et al. 2024</a>], causal tracing [<a href="https://arxiv.org/abs/2202.05262">Meng et al.,
2022</a>], path patching [<a href="https://arxiv.org/abs/2211.00593">Wang et al., 2022</a>, <a
href="https://arxiv.org/abs/2304.05969">Goldowsky-Dill et al., 2023</a>], patchscopes [<a
href="https://arxiv.org/abs/2401.06102">Ghandeharioun et al., 2024</a>] and causal scrubbing [<a
href="https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing">Chan
et al., 2022</a>].. What are the strengths and weaknesses of current techniques, when should or shouldn't
they be applied, and how can they be refined? And can we find new techniques, capable of giving new insights?
<li>Many approaches for generating mechanistic explanations are very labor intensive, leading to interest in
automated and scalable mechanistic interpretability [<a href="https://arxiv.org/abs/2304.12918">Foote et al.,
2023</a>, <a href="https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html">Bills et
al., 2023</a>, <a href="https://arxiv.org/abs/2304.14997">Conmy et al., 2023</a>, <a
href="https://arxiv.org/abs/2403.00745">Kramar et al., 2024</a>, <a
href="https://arxiv.org/abs/2305.08809">Wu et al.2024</a>]. How can we develop more scalable, efficient
techniques for interpreting ever larger and more complicated models? How do interpretability properties change
with model scale, and what will it take for the field to be able to keep up with frontier foundation models?
<li>Models are complex, high-dimensional objects, and significant insights can be gained from more qualitative,
biological-style analysis, such as studying individual neurons [<a
href="https://distill.pub/2021/multimodal-neurons/">Goh et al., 2021</a>, <a
href="https://arxiv.org/abs/2401.12181">Gurnee et al., 2024</a>], Sparse Autoencoder features [<a
href="https://arxiv.org/abs/2309.08600">Cunningham et al. 2023</a>, <a
href="https://transformer-circuits.pub/2023/monosemantic-features/index.html">Bricken et al., 2023</a>],
attention heads [<a href="https://arxiv.org/abs/2310.04625">McDougall et al., 2023</a>, <a
href="https://arxiv.org/abs/2312.09230">Gould et al., 2023</a>], or specific circuits [<a
href="https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html">Olsson et
al., 2022</a>, <a href="https://arxiv.org/abs/2211.00593">Wang et al., 2022</a>, <a
href="https://arxiv.org/abs/2307.09458">Lieberum et al., 2023</a>]. What more can we learn from such
analyses? How can we ensure they’re kept to a high standard of rigor, and what mistakes have been made in past
work?
<li>Mechanistic interpretability is sometimes criticized for a focus on cherry-picked, toy tasks. Can we
validate that our understanding is correct by doing something useful with interpretability on a real world
task, such as reducing sycophancy [<a
href="https://www.lesswrong.com/posts/zt6hRsDE84HeBKh7E/reducing-sycophancy-and-improving-honesty-via-activation">Rimsky
2023</a>] or preventing jailbreaks [<a href="https://arxiv.org/abs/2401.18018">Zheng et al., 2024</a>]? In
particular, can we find cases where mechanistic interpretability wins in a “fair fight”, and beats strong
non-mechanistic baselines such as representation engineering [<a href="https://arxiv.org/abs/2310.01405">Zou
et al., 2023</a>] or fine-tuning?
<li>There are many mysteries in the basic science of model internals: how and whether they use superposition [<a
href="https://transformer-circuits.pub/2022/toy_model/index.html">Elhage et al., 2023</a>], whether the
linear representation hypothesis [<a href="https://arxiv.org/abs/2311.03658">Park et al., 2023</a>] is true,
if features are universal [<a href="https://distill.pub/2020/circuits/zoom-in/">Olah et al., 2020</a>], what
fine-tuning does to a model [<a href="https://finetuning.baulab.info/">Prakash et al., 2024</a>], and many
more. What are the biggest remaining open problems, and how can we make progress on them?
<li>Much current mechanistic interpretability work focuses on LLMs. How well does this generalize to other areas
and modalities, such as vision [<a href="https://distill.pub/2020/circuits/curve-circuits/">Cammarata et al.,
2021</a>], audio, video, protein folding, or reinforcement learning [<a
href="https://distill.pub/2020/understanding-rl-vision">Hilton et al., 2020</a>]? What can mechanistic
interpretability learn from related fields, such as neuroscience and the study of biology circuits, and does
mechanistic interpretability have any insights to be shared there?
<li>A significant contributor to the rapid growth of the field is the availability of introductory materials [<a
href="https://neelnanda.io/glossary">Nanda 2022</a>], beginner-friendly coding tutorials on key techniques
[<a href="https://arena3-chapter1-transformer-interp.streamlit.app/">McDougall 2023</a>], open-sourced code
and easy-to-use software packages (for example, <a
href="https://github.com/neelnanda-io/TransformerLens">Nanda and Bloom [2022]</a> or <a
href="https://github.com/ndif-team/nnsight">Fiotto-Kaufman [2023]</a>), which makes it easier for new
researchers to begin to contribute to the field. How can the field continue to foster this beginner-friendly
environment going forward?
<li>Mechanistic interpretability is sometimes analogized to the neuroscience of machine learning models.
Multimodal neurons were found in biological networks [<a
href="http://amygdala.psychdept.arizona.edu/IntroData/Readings/week5/Quiroga-reddy-kreiman-koch-Fried+invariant-visual-single-neurons-human+Nature+2005.pdf">Quiroga
et al., 2005</a>] and then artificial ones [<a href="https://distill.pub/2021/multimodal-neurons/">Goh et
al., 2021</a>], and high-low frequency detectors were found in artificial networks [<a
href="https://distill.pub/2020/circuits/frequency-edges/#:~:text=A%20family%20of%20early%2Dvision,high%20to%20low%20spatial%20frequency.">Schubert
et al., 2021</a>] then biological ones [<a
href="https://www.biorxiv.org/content/10.1101/2023.03.15.532836v1">Ding et al., 2023</a>]. How tight is this
analogy, and what can the two fields learn from each other?
</li>
</ul>
<p>
Besides panel discussions, invited talks, and a poster session, we also plan on running a hands-on tutorial
exploring newer results in the field using <a href="https://github.com/neelnanda-io/TransformerLens">Nanda and
Bloom [2022]'s</a> TransformerLens package.
</p>
</section>
<section>
<h2>Organizing Committee</h2>
<div class="organizers">
<div class="Organizer">
<img src="img/fazlbarez.jpeg" alt="Speaker" />
<div>
<h3><a href="https://fbarez.github.io/">Fazl Barez</a></h3>
<p>Research Fellow University of Oxford</p>
</div>
</div>
<div class="Organizer">
<img src="img/morgeva.jpeg" alt="Organizer" />
<div>
<h3><a href="https://mega002.github.io/">Mor Geva</a></h3>
<p>Ass. Prof Tel Aviv University, Visiting Researcher Google Research</p>
</div>
</div>
<div class="Organizer">
<img src="img/lawrencechan.jpeg" alt="Organizer" />
<div>
<h3><a href="https://chanlawrence.me/">Lawrence Chan</a></h3>
<p>PhD student UC Berkeley</p>
</div>
</div>
<div class="Organizer">
<img src="img/atticusgeiger.jpeg" alt="Organizer" />
<div>
<h3><a href="https://atticusg.github.io/">Atticus Geiger</a></h3>
<p>Pr(Ai)<sup>2</sup>R Group</p>
</div>
</div>
<div class="Organizer">
<img src="img/kayoyin.jpeg" alt="Organizer" />
<div>
<h3><a href="https://kayoyin.github.io/">Kayo Yin</a></h3>
<p>PhD student UC Berkeley</p>
</div>
</div>
<div class="Organizer">
<img src="img/neelnanda.jpeg" alt="Organizer" />
<div>
<h3><a href="https://www.neelnanda.io/about">Neel Nanda</a></h3>
<p>Research Engineer Google DeepMind</p>
</div>
</div>
<div class="Organizer">
<img src="img/maxtegmark.webp" alt="Organizer" />
<div>
<h3><a href="https://physics.mit.edu/faculty/max-tegmark/">Max Tegmark</a></h3>
<p>Professor MIT</p>
</div>
</div>
</div>
</div>
</section>
<section>
<h2>Contact</h2>
<p>
<!-- Emails [email protected] -->
Email: <a href="mailto:[email protected]">[email protected]</a>
</p>
</main>
</body>
</html>