From 6dcd2beafdfbc9f694916bcdfa822b896aa44177 Mon Sep 17 00:00:00 2001 From: John MacFarlane Date: Sun, 12 Jul 2015 23:31:01 -0700 Subject: Updated spec. --- test/spec.txt | 165 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 157 insertions(+), 8 deletions(-) diff --git a/test/spec.txt b/test/spec.txt index 0f62a54..9b5e9dc 100644 --- a/test/spec.txt +++ b/test/spec.txt @@ -2715,7 +2715,7 @@ So, we explain what counts as a block quote or list item by explaining how these can be *generated* from their contents. This should suffice to define the syntax, although it does not give a recipe for *parsing* these constructions. (A recipe is provided below in the section entitled -[A parsing strategy](#appendix-a-a-parsing-strategy).) +[A parsing strategy](#appendix-a-parsing-strategy).) ## Block quotes @@ -7940,7 +7940,10 @@ Multiple spaces -# Appendix A: A parsing strategy {-} +# Appendix: A parsing strategy {-} + +In this appendix we describe some features of the parsing strategy +used in the CommonMark reference implementations. ## Overview {-} @@ -7957,8 +7960,6 @@ are parsed into sequences of Markdown inline elements (strings, code spans, links, emphasis, and so on), using the map of link references constructed in phase 1. -## The document tree {-} - At each point in processing, the document is represented as a tree of **blocks**. The root of the tree is a `document` block. The `document` may have any number of other blocks as **children**. These children @@ -7982,7 +7983,7 @@ marked by arrows: "aliquando id" ``` -## How source lines alter the document tree {-} +## Phase 1: block structure {-} Each line that is processed has an effect on this tree. The line is analyzed and, depending on its contents, the document may be altered @@ -7997,6 +7998,36 @@ in one or more of the following ways: Once a line has been incorporated into the tree in this way, it can be discarded, so input can be read in a stream. +For each line, we follow this procedure: + +1. First we iterate through the open blocks, starting with the +root document, and descending through last children down to the last +open block. Each block imposes a condition that the line must satisfy +if the block is to remain open. For example, a block quote requires a +`>` character. A paragraph requires a non-blank line. +In this phase we may match all or just some of the open +blocks. But we cannot close unmatched blocks yet, because we may have a +[lazy continuation line]. + +2. Next, after consuming the continuation markers for existing +blocks, we look for new block starts (e.g. `>` for a block quote. +If we encounter a new block start, we close any blocks unmatched +in step 1 before creating the new block as a child of the last +matched block. + +3. Finally, we look at the remainder of the line (after block +markers like `>`, list markers, and indentation have been consumed). +This is text that can be incorporated into the last open +block (a paragraph, code block, header, or raw HTML). + +Setext headers are formed when we detect that the second line of +a paragraph is a setext header line. + +Reference link definitions are detected when a paragraph is closed; +the accumulated text lines are parsed to see if they begin with +one or more reference link definitions. Any remainder becomes a +normal paragraph. + We can see how this works by considering how the tree above is generated by four lines of Markdown: @@ -8094,7 +8125,7 @@ We thus obtain the final tree: "aliquando id" ``` -## From block structure to the final document {-} +## Phase 2: inline structure {-} Once all of the input has been parsed, all open blocks are closed. @@ -8125,5 +8156,123 @@ Notice how the [line ending] in the first paragraph has been parsed as a `softbreak`, and the asterisks in the first list item have become an `emph`. -The document can be rendered as HTML, or in any other format, given -an appropriate renderer. +### An algorithm for parsing nested emphasis and links {-} + +By far the trickiest part of inline parsing is handling emphasis, +strong emphasis, links, and images. This is done using the following +algorithm. + +When we're parsing inlines and we hit either + +- a run of `*` or `_` characters, or +- a `[` or `![` + +we insert a text node with these symbols as its literal content, and we +add a pointer to this text node to the [delimiter stack](@delimiter-stack). + +The [delimiter stack] is a doubly linked list. Each +element contains a pointer to a text node, plus information about + +- the type of delimiter (`[`, `![`, `*`, `_`) +- the number of delimiters, +- whether the delimiter is "active" (all are active to start), and +- whether the delimiter is a potential opener, a potential closer, + or both (which depends on what sort of characters precede + and follow the delimiters). + +When we hit a `]` character, we call the *look for link or image* +procedure (see below). + +When we hit the end of the input, we call the *process emphasis* +procedure (see below), with `stack_bottom` = NULL. + +#### *look for link or image* {-} + +Starting at the top of the delimiter stack, we look backwards +through the stack for an opening `[` or `![` delimiter. + +- If we don't find one, we return a literal text node `]`. + +- If we do find one, but it's not *active*, we remove the inactive + delimiter from the stack, and return a literal text node `]`. + +- If we find one and it's active, then we parse ahead to see if + we have an inline link/image, reference link/image, compact reference + link/image, or shortcut reference link/image. + + + If we don't, then we remove the opening delimiter from the + delimiter stack and return a literal text node `]`. + + + If we do, then + + * We return a link or image node whose children are the inlines + after the text node pointed to by the opening delimiter. + + * We run *process emphasis* on these inlines, with the `[` opener + as `stack_bottom`. + + * We remove the opening delimiter. + + * If we have a link (and not an image), we also set all + `[` delimiters before the opening delimiter to *inactive*. (This + will prevent us from getting links within links.) + +#### *process emphasis* {-} + +Parameter `stack_bottom` sets a lower bound to how far we +descend in the [delimiter stack]. If it is NULL, we can +go all the way to the bottom. Otherwise, we stop before +visiting `stack_bottom`. + +Let `current_position` point to the element on the [delimiter stack] +just above `stack_bottom` (or the first element if `stack_bottom` +is NULL). + +We keep track of the `openers_bottom` for each delimiter +type (`*`, `_`). Initialize this to `stack_bottom`. + +Then we repeat the following until we run out of potential +closers: + +- Move `current_position` forward in the delimiter stack (if needed) + until we find the first potential closer with delimiter `*` or `_`. + (This will be the potential closer closest + to the beginning of the input -- the first one in parse order.) + +- Now, look back in the stack (staying above `stack_bottom` and + the `openers_bottom` for this delimiter type) for the + first matching potential opener ("matching" means same delimiter). + +- If one is found: + + + Figure out whether we have emphasis or strong emphasis: + if both closer and opener spans have length >= 2, we have + strong, otherwise regular. + + + Insert an emph or strong emph node accordingly, after + the text node corresponding to the opener. + + + Remove any delimiters between the opener and closer from + the delimiter stack. + + + Remove 1 (for regular emph) or 2 (for strong emph) delimiters + from the opening and closing text nodes. If they become empty + as a result, remove them and remove the corresponding element + of the delimiter stack. If the closing node is removed, reset + `current_position` to the next element in the stack. + +- If none in found: + + + Set `openers_bottom` to the element before `current_position`. + (We know that there are no openers for this kind of closer up to and + including this point, so this puts a lower bound on future searches.) + + + If the closer at `current_position` is not a potential opener, + remove it from the delimiter stack (since we know it can't + be a closer either). + + + Advance `current_position` to the next element in the stack. + +After we're done, we remove all delimiters above `stack_bottom` from the +delimiter stack. + -- cgit v1.2.3