From 8b44dab7b3465445ac4137dc7893665f2336024b Mon Sep 17 00:00:00 2001 From: John MacFarlane Date: Tue, 23 Dec 2014 17:24:14 -0700 Subject: Added definitions of whitespace and other character classes. Closes #108. --- spec.txt | 162 +++++++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 100 insertions(+), 62 deletions(-) diff --git a/spec.txt b/spec.txt index 3217e6c..bb7e620 100644 --- a/spec.txt +++ b/spec.txt @@ -189,17 +189,61 @@ Markdown, which can then be converted into other formats. In the examples, the `→` character is used to represent tabs. -# Preprocessing +# Preliminaries + +## Characters and lines + +The input is a sequence of zero or more [lines](#line). A [line](@line) is a sequence of zero or more [characters](#character) followed by a -line ending (CR, LF, or CRLF) or by the end of file. +[line ending](#line-ending) or by the end of file. A [character](@character) is a unicode code point. This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding. +A [line ending](@line-ending) is, depending on the platform, a +newline (`U+000A`), carriage return (`U+000D`), or +carriage return + newline. + +For security reasons, a conforming parser must strip or replace the +Unicode character `U+0000`. + +A line containing no characters, or a line containing only spaces +(`U+0020`) or tabs (`U+0009`), is called a [blank line](@blank-line). + +The following definitions of character classes will be used in this spec: + +A [whitespace character](@whitespace-character) is a space +(`U+0020`), tab (`U+0009`), carriage return (`U+000D`), or +newline (`U+000A`). + +[Whitespace](@whitespace) is a sequence of one or more [whitespace +characters](#whitespace-character). + +A [unicode whitespace character](@unicode-whitespace-character) is +any code point in the unicode `Zs` class, or a tab (`U+0009`), +carriage return (`U+000D`), newline (`U+000A`), or form feed +(`U+000C`). + +[Unicode whitespace](@unicode-whitespace) is a sequence of one +or more [unicode whitespace characters](#unicode-whitespace-character). + +A [non-space character](@non-space-character) is anything but `U+0020`. + +A [punctuation character](@punctuation-character) is anything in +the unicode classes `Pc`, `Pd`, `Pe`,` `Pf`, `Pi`, `Po`, or `Ps`. + +An [ASCII punctuation character](@ascii-punctuation-character) +is a [punctuation character](#punctuation-character) in the +ASCII class: that is, `!`, `"`, `#`, `$`, `%`, `&`, `'`, `(`, `)`, +`*`, `+`, `,`, `-`, `.`, `/`, `:`, `;`, `<`, `=`, `>`, `?`, `@`, +`[`, `\`, `]`, `^`, `_`, `` ` ``, `{`, `|`, `}`, or `~`. + +## Tab expansion + Tabs in lines are expanded to spaces, with a tab stop of 4 characters: . @@ -218,14 +262,6 @@ Tabs in lines are expanded to spaces, with a tab stop of 4 characters: . -Line endings are replaced by newline characters (LF). - -A line containing no characters, or a line containing only spaces (after -tab expansion), is called a [blank line](@blank-line). - -For security reasons, a conforming parser must strip or replace the -Unicode character `U+0000`. - # Blocks and inlines We can think of a document as a sequence of @@ -394,7 +430,8 @@ a------

---a---

. -It is required that all of the non-space characters be the same. +It is required that all of the +[non-space characters](#non-space-character) be the same. So, this is not a horizontal rule: . @@ -952,9 +989,9 @@ An [indented code block](@indented-code-block) is composed of one or more [indented chunks](#indented-chunk) separated by blank lines. An [indented chunk](@indented-chunk) is a sequence of non-blank lines, each indented four or more spaces. The contents of the code block are -the literal contents of the lines, including trailing newlines, -minus four spaces of indentation. An indented code block has no -attributes. +the literal contents of the lines, including trailing +[line endings](#line-ending), minus four spaces of indentation. +An indented code block has no attributes. An indented code block cannot interrupt a paragraph, so there must be a blank line between a paragraph and a following indented code block. @@ -1750,14 +1787,14 @@ So there is no important loss of expressive power with the new rule. ## Link reference definitions A [link reference definition](@link-reference-definition) -consists of a [link -label](#link-label), indented up to three spaces, followed -by a colon (`:`), optional blank space (including up to one -newline), a [link destination](#link-destination), optional -blank space (including up to one newline), and an optional [link +consists of a [link label](#link-label), indented up to three spaces, followed +by a colon (`:`), optional [whitespace](#whitespace) (including up to one +[line ending](#line-ending)), a [link destination](#link-destination), +optional [whitespace](#whitespace) (including up to one +[line ending](#line-ending)), and an optional [link title](#link-title), which if it is present must be separated -from the [link destination](#link-destination) by whitespace. -No further non-space characters may occur on the line. +from the [link destination](#link-destination) by [whitespace](#whitespace). +No further [non-space characters](#non-space-character) may occur on the line. A [link reference-definition](#link-reference-definition) does not correspond to a structural element of a document. Instead, it @@ -1874,7 +1911,7 @@ It contributes nothing to the document. . This is not a link reference definition, because there are -non-space characters after the title: +[non-space characters](#non-space-character) after the title: . [foo]: /url "title" ok @@ -2133,7 +2170,8 @@ The following rules define [block quotes](@block-quote): 2. **Laziness.** If a string of lines *Ls* constitute a [block quote](#block-quote) with contents *Bs*, then the result of deleting the initial [block quote marker](#block-quote-marker) from one or - more lines in which the next non-space character after the [block + more lines in which the next + [non-space character](#non-space-character) after the [block quote marker](#block-quote-marker) is [paragraph continuation text](#paragraph-continuation-text) is a block quote with *Bs* as its content. @@ -2494,7 +2532,8 @@ is a sequence of one of more digits (`0-9`), followed by either a The following rules define [list items](@list-item): 1. **Basic case.** If a sequence of lines *Ls* constitute a sequence of - blocks *Bs* starting with a non-space character and not separated + blocks *Bs* starting with a [non-space character](#non-space-character) + and not separated from each other by more than one blank line, and *M* is a list marker *M* of width *W* followed by 0 < *N* < 5 spaces, then the result of prepending *M* and the following spaces to the first line of @@ -2972,7 +3011,7 @@ Four spaces indent gives a code block: 4. **Laziness.** If a string of lines *Ls* constitute a [list item](#list-item) with contents *Bs*, then the result of deleting some or all of the indentation from one or more lines in which the - next non-space character after the indentation is + next [non-space character](#non-space-character) after the indentation is [paragraph continuation text](#paragraph-continuation-text) is a list item with the same contents and attributes. The unindented lines are called @@ -4174,11 +4213,11 @@ A [backtick string](@backtick-string) is a string of one or more backtick characters (`` ` ``) that is neither preceded nor followed by a backtick. -A [code span](@code-span) begins with a backtick string and ends with a backtick -string of equal length. The contents of the code span are the -characters between the two backtick strings, with leading and trailing -spaces and newlines removed, and consecutive spaces and newlines -collapsed to single spaces. +A [code span](@code-span) begins with a backtick string and ends with +a backtick string of equal length. The contents of the code span are +the characters between the two backtick strings, with leading and +trailing spaces and [line endings](#line-ending) removed, and +[whitespace](#whitespace) collapsed to single spaces. This is a simple code span: @@ -4206,7 +4245,7 @@ spaces:

``

. -Newlines are treated like spaces: +[Line endings](#line-ending) are treated like spaces: . `` @@ -4216,8 +4255,8 @@ foo

foo

. -Interior spaces and newlines are collapsed into single spaces, just -as they would be by a browser: +Interior spaces and [line endings](#line-ending) are collapsed into +single spaces, just as they would be by a browser: . `foo bar @@ -4231,13 +4270,13 @@ anyway? A: Because we might be targeting a non-HTML format, and we shouldn't rely on HTML-specific rendering assumptions. (Existing implementations differ in their treatment of internal -spaces and newlines. Some, including `Markdown.pl` and -`showdown`, convert an internal newline into a `
` tag. -But this makes things difficult for those who like to hard-wrap -their paragraphs, since a line break in the midst of a code -span will cause an unintended line break in the output. Others -just leave internal spaces as they are, which is fine if only -HTML is being targeted.) +spaces and [line endings](#line-ending). Some, including `Markdown.pl` and +`showdown`, convert an internal [line ending](#line-ending) into a +`
` tag. But this makes things difficult for those who like to +hard-wrap their paragraphs, since a line break in the midst of a code +span will cause an unintended line break in the output. Others just +leave internal spaces as they are, which is fine if only HTML is being +targeted.) . `foo `` bar` @@ -4355,34 +4394,32 @@ The following rules capture all of these patterns, while allowing for efficient parsing strategies that do not backtrack: 1. A single `*` character [can open emphasis](@can-open-emphasis) - iff it is not followed by whitespace. (For these purposes, - any unicode space character counts as whitespace.) + iff it is not followed by [unicode whitespace](#unicode-whitespace). 2. A single `_` character [can open emphasis](#can-open-emphasis) iff - it is not followed by whitespace and it is not preceded by an - ASCII alphanumeric character. + it is not followed by [unicode whitespace](#unicode-whitespace) + and it is not preceded by an ASCII alphanumeric character. 3. A single `*` character [can close emphasis](@can-close-emphasis) - iff it is not preceded by whitespace. + iff it is not preceded by [unicode whitespace](#unicode-whitespace). 4. A single `_` character [can close emphasis](#can-close-emphasis) iff - it is not preceded by whitespace and it is not followed by an - ASCII alphanumeric character. + it is not preceded by [unicode whitespace](#unicode-whitespace) + and it is not followed by an ASCII alphanumeric character. 5. A double `**` [can open strong emphasis](@can-open-strong-emphasis) - iff it is not followed by - whitespace. + iff it is not followed by [unicode whitespace](#unicode-whitespace). 6. A double `__` [can open strong emphasis](#can-open-strong-emphasis) - iff it is not followed by whitespace and it is not preceded by an - ASCII alphanumeric character. + iff it is not followed by [unicode whitespace](#unicode-whitespace) + and it is not preceded by an ASCII alphanumeric character. 7. A double `**` [can close strong emphasis](@can-close-strong-emphasis) - iff it is not preceded by whitespace. + iff it is not preceded by [unicode whitespace](#unicode-whitespace). 8. A double `__` [can close strong emphasis](#can-close-strong-emphasis) - iff it is not preceded by whitespace and it is not followed by an - ASCII alphanumeric character. + iff it is not preceded by [unicode whitespace](#unicode-whitespace) + and it is not followed by an ASCII alphanumeric character. 9. Emphasis begins with a delimiter that [can open emphasis](#can-open-emphasis) and ends with a delimiter that [can close @@ -6610,8 +6647,8 @@ baz baz

. -For a more visible alternative, a backslash before the newline may be -used instead of two spaces: +For a more visible alternative, a backslash before the +[line ending](#line-ending) may be used instead of two spaces: . foo\ @@ -6734,9 +6771,10 @@ foo A regular line break (not in a code span or HTML tag) that is not preceded by two or more spaces is parsed as a softbreak. (A -softbreak may be rendered in HTML either as a newline or as a space. -The result will be the same in browsers. In the examples here, a -newline will be used.) +softbreak may be rendered in HTML either as a +[line ending](#line-ending) or as a space. The result will be the same +in browsers. In the examples here, a [line ending](#line-ending) will +be used.) . foo @@ -6971,9 +7009,9 @@ document str "aliquando id" ``` -Notice how the newline in the first paragraph has been parsed as -a `softbreak`, and the asterisks in the first list item have become -an `emph`. +Notice how the [line ending](#line-ending) in the first paragraph has +been parsed as a `softbreak`, and the asterisks in the first list item +have become an `emph`. The document can be rendered as HTML, or in any other format, given an appropriate renderer. -- cgit v1.2.3