đźšš As of April 2022 this repo is no longer the home of simple-markdown
. The contents and development activity have moved into the Perseus repo here.
simple-markdown is a markdown-like parser designed for simplicity and extensibility.
Most markdown-like parsers aim for speed or edge case handling. simple-markdown aims for extensibility and simplicity.
What does this mean?
Many websites using markdown-like languages have custom extensions,
such as @
mentions or issue number linking. Unfortunately, most
markdown-like parsers don't allow extension without
forking, and can be difficult to modify even when forked.
simple-markdown is designed to allow simple addition of
custom extensions without needing to be forked.
At Khan Academy, we use simple-markdown to format over half of our math exercises, because we need markdown extensions for math text and interactive widgets.
simple-markdown is MIT licensed.
First, let's parse and output some generic markdown using simple-markdown.
If you want to run these examples in
node, you should run npm install
in the simple-markdown
folder or npm install simple-markdown
in your project's
folder. Then you can acquire the SimpleMarkdown
variable
with:
var SimpleMarkdown = require("simple-markdown");
Then let's get a basic markdown parser and outputter.
SimpleMarkdown
provides default parsers/outputters for
generic markdown:
var mdParse = SimpleMarkdown.defaultBlockParse;
var mdOutput = SimpleMarkdown.defaultOutput;
mdParse
can give us a syntax tree:
var syntaxTree = mdParse("Here is a paragraph and an *em tag*.");
Let's inspect our syntax tree:
// pretty-print this with 4-space indentation:
console.log(JSON.stringify(syntaxTree, null, 4));
=> [
{
"content": [
{
"content": "Here is a paragraph and an ",
"type": "text"
},
{
"content": [
{
"content": "em tag",
"type": "text"
}
],
"type": "em"
},
{
"content": ".",
"type": "text"
}
],
"type": "paragraph"
}
]
Then to turn that into an array of React elements, we can
call mdOutput
:
mdOutput(syntaxTree)
=> [ { type: 'div',
key: null,
ref: null,
_owner: null,
_context: {},
_store: { validated: false, props: [Object] } } ]
Let's add an underline extension! To do this, we'll need to create a new rule and then make a new parser/outputter. The next section will explain how all of these steps work in greater detail. (To follow along with these examples, you'll also need underscore.)
First, we create a new rule. We'll look for double underscores surrounding text.
We'll put underlines right
before em
s, so that __
will be parsed before _
for emphasis/italics.
A regex to capture this would look something
like /^__([\s\S]+?)__(?!_)/
. This matches __
, followed by
any content until it finds another __
not followed by a
third _
.
var underlineRule = {
// Specify the order in which this rule is to be run
order: SimpleMarkdown.defaultRules.em.order - 0.5,
// First we check whether a string matches
match: function (source) {
return /^__([\s\S]+?)__(?!_)/.exec(source);
},
// Then parse this string into a syntax node
parse: function (capture, parse, state) {
return {
content: parse(capture[1], state),
};
},
// Finally transform this syntax node into a
// React element
react: function (node, output) {
return React.DOM.u(null, output(node.content));
},
// Or an html element:
// (Note: you may only need to make one of `react:` or
// `html:`, as long as you never ask for an outputter
// for the other type.)
html: function (node, output) {
return "<u>" + output(node.content) + "</u>";
},
};
Then, we need to add this rule to the other rules:
var rules = _.extend({}, SimpleMarkdown.defaultRules, {
underline: underlineRule,
});
Finally, we need to build our parser and outputters:
var rawBuiltParser = SimpleMarkdown.parserFor(rules);
var parse = function (source) {
var blockSource = source + "\n\n";
return rawBuiltParser(blockSource, { inline: false });
};
// You probably only need one of these: choose depending on
// whether you want react nodes or an html string:
var reactOutput = SimpleMarkdown.outputFor(rules, "react");
var htmlOutput = SimpleMarkdown.outputFor(rules, "html");
Now we can use our custom parse
and output
functions to parse
markdown with underlines!
var syntaxTree = parse("__hello underlines__");
console.log(JSON.stringify(syntaxTree, null, 4));
=> [
{
"content": [
{
"content": [
{
"content": "hello underlines",
"type": "text"
}
],
"type": "underline"
}
],
"type": "paragraph"
}
]
reactOutput(syntaxTree)
=> [ { type: 'div',
key: null,
ref: null,
_owner: null,
_context: {},
_store: { validated: false, props: [Object] } } ]
htmlOutput(syntaxTree)
=> '<div class="paragraph"><u>hello underlines</u></div>'
Returns a syntax tree of the result of parsing source
with the
default markdown rules. Assumes a block scope.
Returns a syntax tree of the result of parsing source
with the
default markdown rules, where source
is assumed to be inline text.
Does not emit <p>
elements. Useful for allowing inline markdown
formatting in one-line fields where paragraphs, lists, etc. are
disallowed.
Parses source
as block if it ends with \n\n
, or inline if not.
Returns React-renderable output for syntaxTree
.
Note: raw html output will be coming soon
Elements in simple-markdown are generally created from rules.
For parsing, rules must specify match
and parse
methods.
For output, rules must specify a react
or html
method
(or both), depending on which outputter you create afterwards.
Here is an example rule, a slightly modified version of what simple-markdown uses for parsing strong (bold) text:
strong: {
match: function(source, state, lookbehind) {
return /^\*\*([\s\S]+?)\*\*(?!\*)/.exec(source);
},
parse: function(capture, recurseParse, state) {
return {
content: recurseParse(capture[1], state)
};
},
react: function(node, recurseOutput) {
return React.DOM.strong(null, recurseOutput(node.content));
},
html: function(node, recurseOutput) {
return '<strong>' + recurseOutput(node.content) + '</strong>';
},
},
Let's look at those three methods in more detail.
simple-markdown calls your match
function to determine whether the
upcoming markdown source matches this rule or not.
source
is the upcoming source, beginning at the current position of
parsing (source[0] is the next character).
state
is a mutable state object to allow for more complicated matching
and parsing. The most common field on state
is inline
, which all of
the default rules set to true when we are in an inline scope, and false
or undefined when we are in a block scope.
DEPRECATED - use state.prevCapture
instead. lookbehind
is the string previously captured at this parsing level, to
allow for lookbehind. For example, lists check that lookbehind ends with
/^$|\n *$/
to ensure that lists only match at the beginning of a new
line.
If this rule matches, match
should return an object, array, or
array-like object, which we'll call capture
, where capture[0]
is the full matched source, and any other fields can be used in the
rule's parse
function. The return value from Regexp.prototype.exec
fits this requirement, and the common use case is to return the result
of someRegex.exec(source)
.
If this rule does not match, match
should return null.
NOTE: If you are using regexes in your match function, your regex
should always begin with ^
. Regexes without leading ^
s can
cause unexpected output or infinite loops.
parse
takes the output of match
and transforms it into a syntax
tree node object, which we'll call node
here.
capture
is the non-null result returned from match.
recurseParse
is a function that can be called on sub-content and
state to recursively parse the sub-content. This returns an array.
state
is the mutable state threading object, which can be examined
or modified, and should be passed as the third argument to any
recurseParse
calls.
For example, to parse inline sub-content, you can add inline: true
to state, or inline: false
to force block parsing (to leave the
parsing scope alone, you can just pass state
with no modifications).
For example:
var innerText = capture[1];
recurseParse(
innerText,
_.defaults(
{
inline: true,
},
state
)
);
parse
should return a node
object, which can have custom fields
that will be passed to output
, below. The one reserved field is
type
, which designates the type of the node, which will be used
for output. If no type is specified, simple-markdown will use the
current rule's type (the common case). If you have multiple ways
to parse a single element, it can be useful to have multiple rules
that all return nodes of the same type.
react
takes a syntax tree node
and transforms it into
React-renderable output.
node
is the return value from parse
, which has a type
field of the same type as the current rule, as well as any
custom fields created by parse
.
recurseOutput
is a function to recursively output sub-tree
nodes created by using recurseParse
in parse
.
state
is the mutable state threading object, which can be
examined or modified, and should be passed as the second
argument to any recurseOutput calls.
The simple-markdown API contains several helper methods for creating rules, as well as methods for creating parsers and outputters from rules.
simple-markdown includes access to the default list of rules, as well as several functions to allow you to create parsers and outputters from modifications of those default rules, or even from a totally custom rule list.
These functions are separated so that you can customize intermediate steps in the parsing/output process, if necessary.
The default rules, specified as an object, where the keys are
the rule types, and the values are objects containing order
,
match
, parse
, react
, and html
fields (these rules can
be used for both parsing and outputting).
Takes a rules
object and returns a parser for the rule types
in the rules object, in order of increasing order
fields,
which must be present and a finite number for each rule.
In the case of order field ties, rules are ordered
lexicographically by rule name. Each of the rules in the rules
object must contain a match
and a parse
function.
Takes a rules
object and a key
that indicates which key in
the rules object is mapped to the function that generates the
output type you want. This will be 'react'
or 'html'
unless
you are defining a custom output type.
It returns a function that outputs a single syntax tree node of
any type that is in the rules
object, given a node and a
recursive output function.
Given a set of rules, one can create a single function
that takes an input content string and outputs a
React-renderable as follows. Note that since many rules
expect blocks to end in "\n\n"
, we append that to source
input manually, in addition to specifying inline: false
(inline: false
is technically optional for all of the
default rules, which assume inline
is false if it is
undefined).
var rules = {
...SimpleMarkdown.defaultRules,
paragraph: {
...SimpleMarkdown.defaultRules.paragraph,
react: (node, output, state) => {
return <p key={state.key}>{output(node.content, state)}</p>;
}
}
};
var parser = SimpleMarkdown.parserFor(rules);
var reactOutput = SimpleMarkdown.outputFor(rules, 'react'));
var htmlOutput = SimpleMarkdown.outputFor(rules, 'html'));
var blockParseAndOutput = function(source) {
// Many rules require content to end in \n\n to be interpreted
// as a block.
var blockSource = source + "\n\n";
var parseTree = parser(blockSource, {inline: false});
var outputResult = htmlOutput(parseTree);
// Or for react output, use:
// var outputResult = reactOutput(parseTree);
return outputResult;
};
Coming soon
MIT. See the LICENSE file for text.