-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathmonster-match.html
247 lines (235 loc) · 14.5 KB
/
monster-match.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
<!DOCTYPE html>
<html>
<head>
<link rel="canonical" href="https://hardmath123.github.io/monster-match.html"/>
<link rel="stylesheet" type="text/css" href="/static/base.css"/>
<title>They did the MONSTER match! - Comfortably Numbered</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<link rel="alternate" type="application/rss+xml" title="Comfortably Numbered" href="/feed.xml" />
<!--
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script>
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$']]}
});
</script>
-->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Um5gpz1odJg5Z4HAmzPtgZKdTBHZdw8S29IecapCSB31ligYPhHQZMIlWLYQGVoc" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-YNHdsYkH6gMx9y3mRkmcJ2mFUjTd0qNQQvY9VYZgQd7DcN7env35GzlmFaZ23JGp" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-vZTG03m+2yp6N6BNi5iM4rW4oIwk5DfcNdFfxkk9ZWpDriOkXX8voJBFrAO7MpVl" crossorigin="anonymous"></script>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
// customised options
// • auto-render specific keys, e.g.:
delimiters: [
{left: '$$', right: '$$', display: true},
{left: '$', right: '$', display: false},
{left: '\\begin{align}', right: '\\end{align}', display: true},
{left: '\\(', right: '\\)', display: false},
{left: '\\[', right: '\\]', display: true}
],
// • rendering keys, e.g.:
throwOnError : false
});
});
</script>
</head>
<body>
<header id="header">
<script src="static/main.js"></script>
<div>
<a href="/"><span class="left-word">Comfortably</span> <span class="right-word">Numbered</span></a>
</div>
</header>
<article id="postcontent" class="centered">
<section>
<h1>They did the MONSTER match!</h1>
<center><em><p>Implementing regular expressions in Scratch using subway systems and coins.</p>
</em></center>
<h4>Monday, February 29, 2016 · 6 min read</h4>
<p>I needed a fun little project to tinker with while exploring
<a href="http://tjvr.org/secret-project/">tosh</a>, and I settled on implementing regexes
in Scratch. Because that’s clearly a very practical idea with no problems at
all.</p>
<p>Regular expressions are <a href="string1.html">like
magic</a>. If you haven’t come across
them yet, you soon will. They’re a way to match text against patterns, and you
can use a regex to teach a computer what, for example, an email address “looks
like”. I’m going to assume that if you’re reading this, you have some idea of
how to use regexes; but if you don’t, perhaps the aforementioned link could
help you get up to speed.</p>
<p>Since regexes are like magic, they’re also one of those things that many people
know how to use, but fewer people know how to implement, and even fewer people
know how to implement efficiently. Hopefully by the end of this article, we’ll
be part of that last group.</p>
<p>But first, something fun for you to play with: check out
<a href="https://scratch.mit.edu/projects/97121677/">this</a> Scratch project for a live
demo.</p>
<hr>
<p>Let’s talk about finite-state machines. Finite-state machines are kind of like
a subway system. Suppose you’re heading back home after visiting the Scratch
Team at MIT, and so you’re at Kendall Station on the MTA. Maybe you want to get
to Haymarket.</p>
<p>(I promise this gets relevant to regexes soon.)</p>
<p>You could head north towards Alewife or south towards Braintree. Let’s head
south. You pass Charles/MGH and then you’re at Park St. Now you have a choice:
you can either continue on the red line towards Harvard, or you can switch over
to the Green Line and go to west to Boylston or east to Haymarket (which is
your destination).</p>
<p><img src="static/monster-match/subway-spider.jpg" alt="The MTBA subway map."></p>
<p>Now imagine each little stretch of subway between two stations has a letter
associated with it. So, maybe Kendall to Charles is “B” and Charles to Park St.
is “A”, and so on. But, to be clear, the letters belong to the tracks joining
two stations, <em>not</em> to the stations themselves. You go take a ride on the
subway and read off the letter at each stretch.</p>
<p>You have now re-enacted what we call a “finite-state machine”. Instead of
subway stations, we have <em>states</em>, and instead of stretches of tracks, we have
<em>edges</em>.</p>
<p>Some details: the edges need to be <em>directed</em>, so you can’t go both ways on an
edge. This makes sense with the subway analogy—each individual track only
heads in one direction (otherwise the trains would collide!). To keep track of
this, we draw them with arrows rather than lines.</p>
<p>Here’s a simpler picture of a small part of the T that we care about, then:</p>
<p><img src="static/monster-match/subway-fsm.png" alt="FSM diagram"></p>
<p>A few things to notice here: “letters” don’t need to be from the alphabet (“!”
is a valid “letter”). You can use the same letter on multiple edges. And,
finally, placement doesn’t matter. Haymarket isn’t actually west of Downtown.</p>
<p>Haymarket is double-circled because it’s the final destination.</p>
<p>Now, the question is, if you get from Kendall (start) to Haymarket (finish),
what words can you form by reading off the letters in order? Well, you need to
read off “B” and “A” to get to Park. But then you have a choice: you could go
directly to Haymarket and read off a “!”, or you could go to Downtown and back
and read off “N” followed by “A”. You can’t go to Boylston because there’s no
way to get back to Haymarket from there.</p>
<p>And so some sequences you could read off are “BA!”, “BANA!”, “BANANA!”, and…
you might have already guessed that this matches the regex <code>BA(NA)*!</code>.</p>
<p>Interesting.</p>
<hr>
<p>So it turns out that you can turn any finite state machine into a regular
expression, and, more importantly, vice-versa. If you think of finite state
machines in terms of the <em>set of words they admit</em> and regular expressions in
terms of the <em>set of words they match</em>, then each FSM can be paired with an
equivalent regex (but that regex isn’t necessarily unique!).</p>
<p>So to match a string against a regex, you really just need to match it against
an equivalent FSM, which seems like a much easier thing to do since they’re so
much more visual.</p>
<p>The way to match an FSM is to start with a coin on the start state (Kendall).
Then, you read off the letters from the string you’re trying to match. For each
letter, you look at <em>all</em> the coins on the FSM. For each coin, you either move
it to an adjacent state if they’re connected by an edge with that letter, or
you remove it from the FSM if there is no such edge (if there are <em>multiple</em>
edges with the same letter, then you have to “clone” the coin and put a copy on
each edge’s target—this is called a <em>nondeterministic</em> FSM, and more on this
later).</p>
<p>If, when you’re done reading the string, a coin ends up in your final
(“accepting”, double-circled) state then you win. That coin represents a way to
travel the subway so that you read off that exact string, so your string must
clearly be accepted by the FSM (and therefore the regex it’s equivalent to).</p>
<p>I suggest trying this process yourself with the “BANANA!” example.</p>
<hr>
<p>And so this leaves one more question: how do you turn a regex into an FSM? This
turns out to be pretty easy.
<a href="https://en.wikipedia.org/wiki/Thompson%27s_construction">Thompson</a> gives an
algorithm to do this, but it’s really simple and I urge you to try and figure
it out yourself. The best way is to take each of the regex primitives like <code>*</code>
and <code>|</code> and <code>()</code> and figure out how you can compile each one down individually.
It’s a fun exercise, though there are some annoying subtleties with the empty
string that you might have to deal with.</p>
<p>You should also think about how you can turn other regex features like <code>+</code> and
charsets into simpler regexes that just involve the above-mentioned 3 things.
Can you compile down backreferences? Lookaheads?</p>
<p>Anyway, there’s also a way to turn the resulting nondeterministic FSM into a
<em>deterministic</em> one (i.e. no coin-cloning) using what’s called the <a href="https://en.wikipedia.org/wiki/Powerset_construction">powerset
construction</a>. This
creates a whole lot of new states, though, so it’s usually better to just make
peace with coin-cloning. And there’s a <a href="https://en.wikipedia.org/wiki/DFA_minimization">host of
algorithms</a> to make
deterministic FSMs smaller.</p>
<p>And that’s basically how my Scratch project works! There’s a small JavaScript
program that takes a regex and outputs an FSM as a data structure. A second
program then “flattens” the FSM into something that looks kind of like assembly
(you could call this the “link” phase of the compilation). The assembly-ish
stuff is just a flat, linear representation of the FSM in a Scratch list, where
instead of arrows, I have the address of the state the arrow points to. Then a
simple Scratch program written with Tosh goes and interprets that assembly
using the coin-shunting technique. The whole process is very much like
compiling a C program and running it on a processor, in fact.</p>
<hr>
<p>Now that we know a fair bit about how to implement regexes, I want to talk
about a cool application of FSM theory: in particular, a nice little result
that says that you <em>cannot</em> write a regular expression to match nested
parentheses (so <code>[]</code>, <code>[[]]</code>, <code>[[[]]]</code> all match but <code>][[]</code> doesn’t).</p>
<p>First, go ahead and try to write such a regex to convince yourself that you
can’t. No “cheating” using fancy stuff like backreferences and whatnot: I’m
talking about “pure” regexes involving only <code>*</code> and <code>|</code> and <code>()</code>.</p>
<p>Convinced? Good.</p>
<p>We’re going to <em>prove</em> that you can’t do this.</p>
<p>Recall that any regex can be turned into an FSM. So let’s suppose we had a
regex for matched parentheses, and we turned it into an FSM. Now, remember how
FSM stands for “finite-state machine”? That means it, uh, has a finite number
of states. Which means, if it has 100 states and you give that FSM the sequence
of 51 <code>(</code>s and 51 <code>)</code>s (for a total of 102 characters), then you <em>must</em> visit
some state twice. It’s the same argument as if you have 102 pigeons and 100
holes, you must have a hole with more than one pigeon in it.</p>
<p>And so if you loop back to a state, you can clearly loop back to that state
<em>again</em>! So there must be some subsequence in your string that can be repeated
while still matching the FSM. For example, in “BANANA!”, you can repeat the
“NA” as many times as you want. This fact is called the <a href="https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages">pumping
lemma</a> (not
to be confused with the pumping llama, which is a fuzzy weightlifting camelid).</p>
<p><img src="static/monster-match/llama.png" alt="A pumping llama."></p>
<blockquote>
<p>Image by Kimberly Do.</p>
</blockquote>
<p>Now, let’s think about the <code>()</code> language (called the <a href="https://en.wikipedia.org/wiki/Dyck_language">Dyck
language</a>). What part of it can
you repeat? If you repeat the <code>(</code>s then you have too many <code>(</code>s. If you repeat
the <code>)</code>s you have too many <code>)</code>s. And if you repeat something with both <code>(</code>s and
<code>)</code>s, you clearly don’t follow the <code>(...(())...)</code> structure anymore.</p>
<p>So… no such regex can exist.</p>
<p>Isn’t that a great argument? I think it’s very cool.</p>
<hr>
<p>So, let’s recap. Regular expressions correspond to finite-state machines, which
are souped-up subway systems. You can implement finite-state machines as a kind
of automatic board game which also somehow involves cloning currency. Finally,
the finite-ness of finite-state machines lets us prove a <s>llama</s> lemma
which lets you show all sorts of interesting facts about regexes.</p>
<p>Perhaps a deeper lesson here is that regexes, and to a lesser extent the rest
of computer science, might seem like magic. But it’s magic that you’re entirely
capable of learning. In the words of my hero <a href="http://blog.jamisbuck.org">Basil
Smockwhitener</a>,</p>
<blockquote>
<p>You don’t understand how powerful you are. You can learn anything. Anything
at all. But you have to be willing to balance the scale with effort.</p>
</blockquote>
</section>
<div id="comment-breaker">◊ ◊ ◊</div>
</article>
<footer id="footer">
<div>
<ul>
<li><a href="https://github.com/kach">
Github</a></li>
<li><a href="feed.xml">
Subscribe (RSS feed)</a></li>
<li><a href="https://twitter.com/hardmath123">
Twitter</a></li>
<li><a href="https://creativecommons.org/licenses/by-nc/3.0/deed.en_US">
CC BY-NC 3.0</a></li>
</ul>
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-46120535-1', 'hardmath123.github.io');
ga('require', 'displayfeatures');
ga('send', 'pageview');
</script>
</footer>
</body>
</html>