Skip to content

Markdown Implementation

Issues with HTML in CommonMark

Let's look at the following example. On the left is some markdown, and on the right is the CommonMark AST returned by the CommonMark reference parser:

markdown
<div>

_something_

</div>
xml
<html_block>&lt;div&gt;</html_block>
<paragraph>
    <emph>
        <text>something</text>
    </emph>
</paragraph>
<html_block>&lt;/div&gt;</html_block>

Note, in particular, the html_block elements in the AST. CommonMark doesn't understand the provided markdown as one HTML element with some markdown inside, but rather as two separate HTML "blocks" with markdown in between. This can be a problem for us because CommonMark doesn't transform HTML blocks, and because it specifies that, for an HTML block to end, it must be followed by a blank line. For example:

markdown
<div>
*something*

</div>
xml
<html_block>&lt;div&gt;
*something*</html_block>
<html_block>&lt;/div&gt;</html_block>

Here, CommonMark interpreted the markdown as two HTML blocks, <div>*something*</div> and </div>, and will return both of them without any transformation.

Generally speaking, we have the following rules of thumb:

  • If a line starts with one of a specific set of HTML tags (eg. div, p, h1, ul, li, etc.), or if a line contains nothing but an HTML tag (and optionally some whitespace), then everything between that line and the first blank line after it is considered a single HTML block. Examples: case 1, case 2.
  • If a line contains an HTML tag, but doesn't start with it (disregarding whitespace), then said HTML tag will be treated as an inline HTML element. This behavior will not be an issue for us.

NB: The above are oversimplifications. See the CommonMark spec for the full details.

Special elements

The following table shows the results of rendering the string `<div>${'\n'.repeat(x)}*(${x},${y})*${'\n'.repeat(y)}</div>`, for x,y{0,1,2}:

<div>...*(before,after)*...</div>
BeforeAfterResultExpected
00<div>*(0,0)*</div>
01<div>*(0,1)*\n</div>
02<div>*(0,2)*\n</div>
10<div>\n*(1,0)*</div>
11<div>\n*(1,1)*\n</div>
12<div>\n*(1,2)*\n</div>
≥20<div>\n<p><em>(2,0)</em></div></p>
≥21<div>\n<p><em>(2,1)</em></p>\n</div>
≥2≥2<div>\n<p><em>(2,2)</em></p>\n</div>

Other elements

<foo>...*(before,after)*...</foo>
BeforeAfterResultExpected
00<p><foo><em>(0,0)</em></foo></p>
01<p><foo><em>(0,1)</em>\n</foo></p>
02<p><foo><em>(0,2)</em></p>\n</foo>
10<foo>\n*(1,0)*</foo>
11<foo>\n*(1,1)*\n</foo>
12<foo>\n*(1,2)*\n</foo>
≥20<foo>\n<p><em>(2,0)</em></foo></p>
≥21<foo>\n<p><em>(2,1)</em>\n</foo></p>
≥2≥2<foo>\n<p><em>(2,2)</em></p>\n</foo>

Remedy

We remedy the above issues in three steps:

  1. Escape the HTML tags of "CommonMark type-6 HTML blocks" by appending an UUIDv4 (without dashes) to the tag name. The same UUIDv4 is used for all these tags in a given document. This ensures that all HTML content is treated the same way by the CommonMark parser.

  2. Adjust the whitespace before and after the inner content of the HTML block in accordance with the following rule: Let xN0 be the number of newline characters (e.g., '\n') between the end of the opening tag ('>') and the first non-whitespace character of the inner content. Similarly, let yN0 be the number of newline characters between the last non-whitespace character of the inner content and the start of the closing tag ('</'). Now, we calculate the adjusted values xadjusted and yadjusted as follows:

    xadjusted={2xoriginal2,2xoriginal=1¬prefersInline(tag),0xoriginal=1prefersInline(tag),0xoriginal=0,yadjusted=xadjusted,

    where prefersInline: (tag: string) => boolean is configurable by the user via the markdown.prefersInline property of the SvelTeX configuration. (By default, prefersInline is the constant function () => true. This is because treating elements as inline elements in this sense is less invasive, since the markdown processor won't wrap the content in <p> tags.)

  3. Finally, after the markdown processor has processed the document:

    1. For escaped HTML tags that are immediately preceded and followed by a <p> and </p> tag, respectively, these <p> and </p> tags are removed. The motivation for this is preventing having something like <p><p>...</p></p> in the end, as well as the assumption and hope that this behavior mostly aligns with the output that a user might expect.
    2. All occurrences of the UUIDv4 are replaced with the empty string, leaving only the original HTML tags.

With this remedy in place, the tables from before can be merged into the following:

<any>...*(before,after)*...</any>
BeforeAfterResultExpected
00<any><em>(0,0)</em></any>
01<any><em>(0,1)</em></any>
02<any><em>(0,2)</em></any>
10
<any><em>(1,0)</em></any> or <any>\n<p><em>(1,0)</em></p>\n</any>
11
<any><em>(1,1)</em></any> or <any>\n<p><em>(1,1)</em></p>\n</any>
12
<any><em>(1,2)</em></any> or <any>\n<p><em>(1,2)</em></p>\n</any>
≥20<any>\n<p><em>(2,0)</em></p>\n</any>
≥21<any>\n<p><em>(2,1)</em></p>\n</any>
≥2≥2<any>\n<p><em>(2,2)</em></p>\n</any>

Directives

Markdown directives are great. They're also somewhat tricky for us, since they can contain curly brackets, which could be mistaken for e.g. Svelte mustache tags. To deal with this, we provide the configuration option markdown.directives.enabled, which can be set to true to enable correct directive parsing despite any potential curly brackets. To loosen the default directive syntax a bit, one can furthermore set the property markdown.directives.bracesArePartOfDirective to a function which defines a looser syntax for directives when it comes to curly brackets.