Markdown Implementation
Issues with HTML in CommonMark
Let's look at the following example. On the left is some markdown, and on the right is the CommonMark AST returned by the CommonMark reference parser:
<div>
_something_
</div>
<html_block><div></html_block>
<paragraph>
<emph>
<text>something</text>
</emph>
</paragraph>
<html_block></div></html_block>
Note, in particular, the html_block
elements in the AST. CommonMark doesn't understand the provided markdown as one HTML element with some markdown inside, but rather as two separate HTML "blocks" with markdown in between. This can be a problem for us because CommonMark doesn't transform HTML blocks, and because it specifies that, for an HTML block to end, it must be followed by a blank line. For example:
<div>
*something*
</div>
<html_block><div>
*something*</html_block>
<html_block></div></html_block>
Here, CommonMark interpreted the markdown as two HTML blocks, <div>*something*</div>
and </div>
, and will return both of them without any transformation.
Generally speaking, we have the following rules of thumb:
- If a line starts with one of a specific set of HTML tags (eg.
div
,p
,h1
,ul
,li
, etc.), or if a line contains nothing but an HTML tag (and optionally some whitespace), then everything between that line and the first blank line after it is considered a single HTML block. Examples: case 1, case 2. - If a line contains an HTML tag, but doesn't start with it (disregarding whitespace), then said HTML tag will be treated as an inline HTML element. This behavior will not be an issue for us.
NB: The above are oversimplifications. See the CommonMark spec for the full details.
Special elements
The following table shows the results of rendering the string `<div>${'\n'.repeat(x)}*(${x},${y})*${'\n'.repeat(y)}</div>`
, for
Other elements
Remedy
We remedy the above issues in three steps:
Escape the HTML tags of "CommonMark type-6 HTML blocks" by appending an UUIDv4 (without dashes) to the tag name. The same UUIDv4 is used for all these tags in a given document. This ensures that all HTML content is treated the same way by the CommonMark parser.
Adjust the whitespace before and after the inner content of the HTML block in accordance with the following rule: Let
be the number of newline characters (e.g., '\n'
) between the end of the opening tag ('>'
) and the first non-whitespace character of the inner content. Similarly, letbe the number of newline characters between the last non-whitespace character of the inner content and the start of the closing tag ( '</'
). Now, we calculate the adjusted valuesand as follows: where
prefersInline: (tag: string) => boolean
is configurable by the user via themarkdown.prefersInline
property of the SvelTeX configuration. (By default,prefersInline
is the constant function() => true
. This is because treating elements as inline elements in this sense is less invasive, since the markdown processor won't wrap the content in<p>
tags.)Finally, after the markdown processor has processed the document:
- For escaped HTML tags that are immediately preceded and followed by a
<p>
and</p>
tag, respectively, these<p>
and</p>
tags are removed. The motivation for this is preventing having something like<p><p>...</p></p>
in the end, as well as the assumption and hope that this behavior mostly aligns with the output that a user might expect. - All occurrences of the UUIDv4 are replaced with the empty string, leaving only the original HTML tags.
- For escaped HTML tags that are immediately preceded and followed by a
With this remedy in place, the tables from before can be merged into the following:
<any>...*(before,after)*...</any> | |||
---|---|---|---|
Before | After | Result | Expected |
0 | 0 | <any><em>(0,0)</em></any> | |
0 | 1 | <any><em>(0,1)</em></any> | |
0 | 2 | <any><em>(0,2)</em></any> | |
1 | 0 | <any><em>(1,0)</em></any> or <any>\n<p><em>(1,0)</em></p>\n</any> | |
1 | 1 | <any><em>(1,1)</em></any> or <any>\n<p><em>(1,1)</em></p>\n</any> | |
1 | 2 | <any><em>(1,2)</em></any> or <any>\n<p><em>(1,2)</em></p>\n</any> | |
≥2 | 0 | <any>\n<p><em>(2,0)</em></p>\n</any> | |
≥2 | 1 | <any>\n<p><em>(2,1)</em></p>\n</any> | |
≥2 | ≥2 | <any>\n<p><em>(2,2)</em></p>\n</any> |
Directives
Markdown directives are great. They're also somewhat tricky for us, since they can contain curly brackets, which could be mistaken for e.g. Svelte mustache tags. To deal with this, we provide the configuration option markdown.directives.enabled
, which can be set to true
to enable correct directive parsing despite any potential curly brackets. To loosen the default directive syntax a bit, one can furthermore set the property markdown.directives.bracesArePartOfDirective
to a function which defines a looser syntax for directives when it comes to curly brackets.