Updates for new HTML block spec.

* Rewrote spec for HTML blocks. A few other spec examples also changed as a result. * Removed old `html_block_tag` scanner. Added new `html_block_start` and `html_block_start_7`, as well as `html_block_end_n` for n = 1-5. * Rewrote block parser for new HTML block spec.
author: John MacFarlane <jgm@berkeley.edu> 2015-07-08 17:42:22 -0700
committer: John MacFarlane <jgm@berkeley.edu> 2015-07-10 14:24:22 -0700
commit: 17e6720dd9b5d25aeb906bb23915a6ee13a07e3d (patch)
tree: 368489317ca19a0136bba3381be4ab219b1eaf21 /test
parent: 039098095da3a31dd338f2a1137e673d914489ea (diff)
1 files changed, 464 insertions, 80 deletions
diff --git a/test/spec.txt b/test/spec.txt
index 0c42aae..ed9b8e2 100644
--- a/test/spec.txt
+++ b/test/spec.txt
@@ -1,8 +1,8 @@
 ---
 title: CommonMark Spec
 author: John MacFarlane
-version: 0.20
-date: 2015-06-08
+version: 0.21-dev
+date:
 license: '[CC-BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/)'
 ...
 
@@ -237,7 +237,7 @@ or more [unicode whitespace character]s.
 
 A [space](@space) is `U+0020`.
 
-A [non-space character](@non-space-character) is any character
+A [non-whitespace character](@non-space-character) is any character
 that is not a [whitespace character].
 
 An [ASCII punctuation character](@ascii-punctuation-character)
@@ -474,7 +474,7 @@ a------
 <p>---a---</p>
 .
 
-It is required that all of the [non-space character]s be the same.
+It is required that all of the [non-whitespace character]s be the same.
 So, this is not a horizontal rule:
 
 .
@@ -564,7 +564,7 @@ consists of a string of characters, parsed as inline content, between an
 opening sequence of 1--6 unescaped `#` characters and an optional
 closing sequence of any number of `#` characters.  The opening sequence
 of `#` characters cannot be followed directly by a
-[non-space character]. The optional closing sequence of `#`s must be
+[non-whitespace character]. The optional closing sequence of `#`s must be
 preceded by a [space] and may be followed by spaces only.  The opening
 `#` character may be indented 0-3 spaces.  The raw contents of the
 header are stripped of leading and trailing spaces before being parsed
@@ -696,7 +696,7 @@ Spaces are allowed after the closing sequence:
 .
 
 A sequence of `#` characters with a
-[non-space character] following it
+[non-whitespace character] following it
 is not a closing sequence, but counts as part of the contents of the
 header:
 
@@ -765,7 +765,7 @@ ATX headers can be empty:
 ## Setext headers
 
 A [setext header](@setext-header)
-consists of a line of text, containing at least one [non-space character],
+consists of a line of text, containing at least one [non-whitespace character],
 with no more than 3 spaces indentation, followed by a [setext header
 underline].  The line of text must be
 one that, were it not followed by the setext header underline,
@@ -1593,27 +1593,65 @@ Closing code fences cannot have [info string]s:
 
 ## HTML blocks
 
-An [HTML block tag](@html-block-tag) is
-an [open tag] or [closing tag] whose tag
-name is one of the following (case-insensitive):
-`article`, `header`, `aside`, `hgroup`, `blockquote`, `hr`, `iframe`,
-`body`, `li`, `map`, `button`, `object`, `canvas`, `ol`, `caption`,
-`output`, `col`, `p`, `colgroup`, `pre`, `dd`, `progress`, `div`,
-`section`, `dl`, `table`, `td`, `dt`, `tbody`, `embed`, `textarea`,
-`fieldset`, `tfoot`, `figcaption`, `th`, `figure`, `thead`, `footer`,
-`tr`, `form`, `ul`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `video`,
-`script`, `style`.
-
-An [HTML block](@html-block) begins with an
-[HTML block tag], [HTML comment], [processing instruction],
-[declaration], or [CDATA section].
-It ends when a [blank line] or the end of the
-input is encountered.  The initial line may be indented up to three
-spaces, and subsequent lines may have any indentation.  The contents
-of the HTML block are interpreted as raw HTML, and will not be escaped
-in HTML output.
-
-Some simple examples:
+An [HTML block](@html-block) is a group of lines that is treated
+as raw HTML (and will not be escaped in HTML output).
+
+There are seven kinds of [HTML block], which can be defined
+by their start and end conditions.  The block begins with a line that
+meets a [start condition](@start-condition) (after up to three spaces
+optional indentation).  It ends with the first subsequent line that
+meets a matching [end condition](@end-condition), or the last line of
+the document, if no line is encountered that meets the
+[end condition].  If the first line meets both the [start condition]
+and the [end condition], the block will contain just that line.
+
+1.  **Start condition:**  line begins with the string `<script`,
+`<pre`, or `<style` (case-insensitive), followed by whitespace,
+the string `>`, or the end of the line.\
+**End condition:**  line contains an end tag
+`</script>`, `</pre>`, or `</style>` (case-insensitive; it
+need not match the start tag).
+
+2.  **Start condition:** line begins with the string `<!--`.\
+**End condition:**  line contains the string `-->`.
+
+3.  **Start condition:** line begins with the string `<?`.\
+**End condition:** line contains the string `?>`.
+
+4.  **Start condition:** line begins with the string `<!`
+followed by an uppercase ASCII letter.\
+**End condition:** line contains the character `>`.
+
+5.  **Start condition:**  line begins with the string
+`<![CDATA[`.\
+**End condition:** line contains the string `]]>`.
+
+6.  **Start condition:** line begins the string `<` or `</`
+followed by one of the strings (case-insensitive) `address`,
+`article`, `aside`, `base`, `basefont`, `blockquote`, `body`,
+`caption`, `center`, `col`, `colgroup`, `dd`, `details`, `dialog`,
+`dir`, `div`, `dl`, `dt`, `fieldset`, `figcaption`, `figure`,
+`footer`, `form`, `frame`, `frameset`, `h1`, `head`, `header`, `hr`,
+`html`, `legend`, `li`, `link`, `main`, `menu`, `menuitem`, `meta`,
+`nav`, `noframes`, `ol`, `optgroup`, `option`, `p`, `param`, `pre`,
+`section`, `source`, `title`, `summary`, `table`, `tbody`, `td`,
+`tfoot`, `th`, `thead`, `title`, `tr`, `track`, `ul`, followed
+by [whitespace], the end of the line, the string `>`, or
+the string `/>`.\
+**End condition:** line is followed by a [blank line].
+
+7.  **Start condition:**  line begins with an [open tag]
+(with any [tag name]) followed only by [whitespace] or the end
+of the line.\
+**End condition:** line is followed by a [blank line].
+
+All types of [HTML blocks] except type 7 may interrupt
+a paragraph.  Blocks of type 7 may not interrupt a paragraph.
+(This restricted is intended to prevent unwanted interpretation
+of long tags inside a wrapped paragraph as starting HTML blocks.)
+
+Some simple examples follow.  Here are some basic HTML blocks
+of type 6:
 
 .
 <table>
@@ -1646,6 +1684,16 @@ okay.
          <foo><a>
 .
 
+A block can also start with a closing tag:
+
+.
+</div>
+*foo*
+.
+</div>
+*foo*
+.
+
 Here we have two HTML blocks with a Markdown paragraph between them:
 
 .
@@ -1660,7 +1708,94 @@ Here we have two HTML blocks with a Markdown paragraph between them:
 </DIV>
 .
 
-In the following example, what looks like a Markdown code block
+The tag on the first line can be partial, as long
+as it is split where there would be whitespace:
+
+.
+<div id="foo"
+  class="bar">
+</div>
+.
+<div id="foo"
+  class="bar">
+</div>
+.
+
+.
+<div id="foo" class="bar
+  baz">
+</div>
+.
+<div id="foo" class="bar
+  baz">
+</div>
+.
+
+An open tag need not be closed:
+.
+<div>
+*foo*
+
+*bar*
+.
+<div>
+*foo*
+<p><em>bar</em></p>
+.
+
+
+A partial tag need not even be completed (garbage
+in, garbage out):
+
+.
+<div id="foo"
+*hi*
+.
+<div id="foo"
+*hi*
+.
+
+.
+<div class
+foo
+.
+<div class
+foo
+.
+
+The initial tag doesn't even need to be a valid
+tag, as long as it starts like one:
+
+.
+<div *???-&&&-<---
+*foo*
+.
+<div *???-&&&-<---
+*foo*
+.
+
+In type 6 blocks, the initial tag need not be on a line by
+itself:
+
+.
+<div><a href="bar">*foo*</a></div>
+.
+<div><a href="bar">*foo*</a></div>
+.
+
+.
+<table><tr><td>
+foo
+</td></tr></table>
+.
+<table><tr><td>
+foo
+</td></tr></table>
+.
+
+Everything until the next blank line or end of document
+gets included in the HTML block.  So, in the following
+example, what looks like a Markdown code block
 is actually part of the HTML block, which continues until a blank
 line or the end of the document is reached:
 
@@ -1676,43 +1811,241 @@ int x = 33;
 ```
 .
 
-A comment:
+To start an [HTML block] with a tag that is *not* in the
+list of block-level tags in (6), you must put the tag by
+itself on the first line (and it must be complete):
+
+.
+<a href="foo">
+*bar*
+</a>
+.
+<a href="foo">
+*bar*
+</a>
+.
+
+In type 7 blocks, the [tag name] can be anything:
+
+.
+<Warning>
+*bar*
+</Warning>
+.
+<Warning>
+*bar*
+</Warning>
+.
+
+.
+<i class="foo">
+*bar*
+</i>
+.
+<i class="foo">
+*bar*
+</i>
+.
+
+These rules are designed to allow us to work with tags that
+can function as either block-level or inline-level tags.
+The `<del>` tag is a nice example.  We can surround content with
+`<del>` tags in three different ways.  In this case, we get a raw
+HTML block, because the `<del>` tag is on a line by itself:
+
+.
+<del>
+*foo*
+</del>
+.
+<del>
+*foo*
+</del>
+.
+
+In this case, we get a raw HTML block that just includes
+the `<del>` tag (because it ends with the following blank
+line).  So the contents get interpreted as CommonMark:
+
+.
+<del>
+
+*foo*
+
+</del>
+.
+<del>
+<p><em>foo</em></p>
+</del>
+.
+
+Finally, in this case, the `<del>` tags are interpreted
+as [raw HTML] *inside* the CommonMark paragraph.  (Because
+the tag is not on a line by itself, we get inline HTML
+rather than an [HTML block].)
+
+.
+<del>*foo*</del>
+.
+<p><del><em>foo</em></del></p>
+.
+
+HTML tags designed to contain literal content
+(`script`, `style`, `pre`), comments, processing instructions,
+and declarations are treated somewhat differently.
+Instead of ending at the first blank line, these blocks
+end at the first line containing a corresponding end tag.
+As a result, these blocks can contain blank lines:
+
+A pre tag (type 1):
+
+.
+<pre language="haskell"><code>
+import Text.HTML.TagSoup
+
+main :: IO ()
+main = print $ parseTags "<a href=\"foo\">bar</a>"
+</code></pre>
+.
+<pre language="haskell"><code>
+import Text.HTML.TagSoup
+
+main :: IO ()
+main = print $ parseTags "<a href=\"foo\">bar</a>"
+</code></pre>
+.
+
+A script tag (type 1):
+
+.
+<script type="text/javascript">
+// JavaScript example
+
+document.getElementById("demo").innerHTML = "Hello JavaScript!";
+</script>
+.
+<script type="text/javascript">
+// JavaScript example
+
+document.getElementById("demo").innerHTML = "Hello JavaScript!";
+</script>
+.
+
+A style tag (type 1):
+
+.
+<style
+  type="text/css">
+h1 {color:red;}
+
+p {color:blue;}
+</style>
+.
+<style
+  type="text/css">
+h1 {color:red;}
+
+p {color:blue;}
+</style>
+.
+
+If there is no matching end tag, the block will end at the
+end of the document:
+
+.
+<style
+  type="text/css">
+
+foo
+.
+<style
+  type="text/css">
+
+foo
+.
+
+The end tag can occur on the same line as the start tag:
+
+.
+<style>p{color:red;}</style>
+*foo*
+.
+<style>p{color:red;}</style>
+<p><em>foo</em></p>
+.
+
+.
+<!-- foo -->*bar*
+*baz*
+.
+<!-- foo -->*bar*
+<p><em>baz</em></p>
+.
+
+Note that anything on the last line after the
+end tag will be included in the [HTML block]:
+
+.
+<script>
+foo
+</script>1. *bar*
+.
+<script>
+foo
+</script>1. *bar*
+.
+
+A comment (type 2):
 
 .
 <!-- Foo
+
 bar
    baz -->
 .
 <!-- Foo
+
 bar
    baz -->
 .
 
-A processing instruction:
+
+A processing instruction (type 3):
 
 .
 <?php
+
   echo '>';
+
 ?>
 .
 <?php
+
   echo '>';
+
 ?>
 .
 
-CDATA:
+A declaration (type 4):
+
+.
+<!DOCTYPE html>
+.
+<!DOCTYPE html>
+.
+
+CDATA (type 5):
 
 .
 <![CDATA[
 function matchwo(a,b)
 {
-if (a < b && a < 0) then
-  {
-  return 1;
-  }
-else
-  {
-  return 0;
+  if (a < b && a < 0) then {
+    return 1;
+
+  } else {
+
+    return 0;
   }
 }
 ]]>
@@ -1720,13 +2053,12 @@ else
 <![CDATA[
 function matchwo(a,b)
 {
-if (a < b && a < 0) then
-  {
-  return 1;
-  }
-else
-  {
-  return 0;
+  if (a < b && a < 0) then {
+    return 1;
+
+  } else {
+
+    return 0;
   }
 }
 ]]>
@@ -1744,8 +2076,18 @@ The opening tag can be indented 1-3 spaces, but not 4:
 </code></pre>
 .
 
-An HTML block can interrupt a paragraph, and need not be preceded
-by a blank line.
+.
+  <div>
+
+    <div>
+.
+  <div>
+<pre><code>&lt;div&gt;
+</code></pre>
+.
+
+An HTML block of types 1--6 can interrupt a paragraph, and need not be
+preceded by a blank line.
 
 .
 Foo
@@ -1759,8 +2101,8 @@ bar
 </div>
 .
 
-However, a following blank line is always needed, except at the end of
-a document:
+However, a following blank line is needed, except at the end of
+a document, and except for blocks of types 1--5, above:
 
 .
 <div>
@@ -1774,14 +2116,16 @@ bar
 *foo*
 .
 
-An incomplete HTML block tag may also start an HTML block:
+HTML blocks of type 7 cannot interrupt a paragraph:
 
 .
-<div class
-foo
+Foo
+<a href="bar">
+baz
 .
-<div class
-foo
+<p>Foo
+<a href="bar">
+baz</p>
 .
 
 This rule differs from John Gruber's original Markdown syntax
@@ -1800,8 +2144,8 @@ here:
 - It requires a matching end tag, which it also does not allow to
   be indented.
 
-Indeed, most Markdown implementations, including some of Gruber's
-own perl implementations, do not impose these restrictions.
+Most Markdown implementations (including some of Gruber's own) do not
+respect all of these restrictions.
 
 There is one respect, however, in which Gruber's rule is more liberal
 than the one given here, since it allows blank lines to occur inside
@@ -1812,6 +2156,8 @@ if no matching end tag is found. Second, it provides a very simple
 and flexible way of including Markdown content inside HTML tags:
 simply separate the Markdown from the HTML using blank lines:
 
+Compare:
+
 .
 <div>
 
@@ -1824,8 +2170,6 @@ simply separate the Markdown from the HTML using blank lines:
 </div>
 .
 
-Compare:
-
 .
 <div>
 *Emphasized* text.
@@ -1869,11 +2213,37 @@ Hi
 </table>
 .
 
-Moreover, blank lines are usually not necessary and can be
-deleted.  The exception is inside `<pre>` tags; here, one can
-replace the blank lines with `&#10;` entities.
+There are problems, however, if the inner tags are indented
+*and* separated by spaces, as then they will be interpreted as
+an indented code block:
+
+.
+<table>
+
+  <tr>
+
+    <td>
+      Hi
+    </td>
+
+  </tr>
+
+</table>
+.
+<table>
+  <tr>
+<pre><code>&lt;td&gt;
+  Hi
+&lt;/td&gt;
+</code></pre>
+  </tr>
+</table>
+.
 
-So there is no important loss of expressive power with the new rule.
+Fortunately, blank lines are usually not necessary and can be
+deleted.  The exception is inside `<pre>` tags, but as described
+above, raw HTML blocks starting with `<pre>` *can* contain blank
+lines.
 
 ## Link reference definitions
 
@@ -1885,7 +2255,7 @@ optional [whitespace] (including up to one
 [line ending]), and an optional [link
 title], which if it is present must be separated
 from the [link destination] by [whitespace].
-No further [non-space character]s may occur on the line.
+No further [non-whitespace character]s may occur on the line.
 
 A [link reference definition]
 does not correspond to a structural element of a document.  Instead, it
@@ -2056,7 +2426,7 @@ bar
 .
 
 This is not a link reference definition, because there are
-[non-space character]s after the title:
+[non-whitespace character]s after the title:
 
 .
 [foo]: /url "title" ok
@@ -2323,7 +2693,7 @@ The following rules define [block quotes]:
 2.  **Laziness.**  If a string of lines *Ls* constitute a [block
     quote](#block-quotes) with contents *Bs*, then the result of deleting
     the initial [block quote marker] from one or
-    more lines in which the next [non-space character] after the [block
+    more lines in which the next [non-whitespace character] after the [block
     quote marker] is [paragraph continuation
     text] is a block quote with *Bs* as its content.
     [Paragraph continuation text](@paragraph-continuation-text) is text
@@ -2700,7 +3070,7 @@ is a sequence of one of more digits (`0-9`), followed by either a
 The following rules define [list items]:
 
 1.  **Basic case.**  If a sequence of lines *Ls* constitute a sequence of
-    blocks *Bs* starting with a [non-space character] and not separated
+    blocks *Bs* starting with a [non-whitespace character] and not separated
     from each other by more than one blank line, and *M* is a list
     marker of width *W* followed by 0 < *N* < 5 spaces, then the result
     of prepending *M* and the following spaces to the first line of
@@ -2758,7 +3128,7 @@ The most important thing to notice is that the position of
 the text after the list marker determines how much indentation
 is needed in subsequent blocks in the list item.  If the list
 marker takes up two spaces, and there are three spaces between
-the list marker and the next [non-space character], then blocks
+the list marker and the next [non-whitespace character], then blocks
 must be indented five spaces in order to fall under the list
 item.
 
@@ -2816,7 +3186,7 @@ put under the list item:
 
 It is tempting to think of this in terms of columns:  the continuation
 blocks must be indented at least to the column of the first
-[non-space character] after the list marker. However, that is not quite right.
+[non-whitespace character] after the list marker. However, that is not quite right.
 The spaces after the list marker determine how much relative indentation
 is needed.  Which column this indentation reaches will depend on
 how the list item is embedded in other constructions, as shown by
@@ -3069,7 +3439,7 @@ inside the code block:
 
 Note that rules #1 and #2 only apply to two cases:  (a) cases
 in which the lines to be included in a list item begin with a
-[non-space character], and (b) cases in which
+[non-whitespace character], and (b) cases in which
 they begin with an indented code
 block.  In a case like the following, where the first block begins with
 a three-space indent, the rules do not allow us to form a list item by
@@ -3301,7 +3671,7 @@ Four spaces indent gives a code block:
 5.  **Laziness.**  If a string of lines *Ls* constitute a [list
     item](#list-items) with contents *Bs*, then the result of deleting
     some or all of the indentation from one or more lines in which the
-    next [non-space character] after the indentation is
+    next [non-whitespace character] after the indentation is
     [paragraph continuation text] is a
     list item with the same contents and attributes.  The unindented
     lines are called
@@ -4360,7 +4730,7 @@ raw HTML:
 .
 <a href="/bar\/)">
 .
-<p><a href="/bar\/)"></p>
+<a href="/bar\/)">
 .
 
 But they work in all other contexts, including URLs and link titles,
@@ -4474,7 +4844,7 @@ code blocks, including raw HTML, URLs, [link title]s, and
 .
 <a href="&ouml;&ouml;.html">
 .
-<p><a href="&ouml;&ouml;.html"></p>
+<a href="&ouml;&ouml;.html">
 .
 
 .
@@ -6031,6 +6401,20 @@ in Markdown:
 <p><a href="foo):">link</a></p>
 .
 
+A link can contain fragment identifiers and queries:
+
+.
+[link](#fragment)
+
+[link](http://example.com#fragment)
+
+[link](http://example.com?foo=bar&baz#fragment)
+.
+<p><a href="#fragment">link</a></p>
+<p><a href="http://example.com#fragment">link</a></p>
+<p><a href="http://example.com?foo=bar&amp;baz#fragment">link</a></p>
+.
+
 Note that a backslash before a non-escapable character is
 just a backslash:
 
@@ -6245,7 +6629,7 @@ that [matches] a [link reference definition] elsewhere in the document.
 
 A [link label](@link-label)  begins with a left bracket (`[`) and ends
 with the first right bracket (`]`) that is not backslash-escaped.
-Between these brackets there must be at least one non-[whitespace character].
+Between these brackets there must be at least one [non-whitespace character].
 Unescaped square bracket characters are not allowed in
 [link label]s.  A link label can have at most 999
 characters inside the square brackets.
@@ -6492,7 +6876,7 @@ backslash-escaped:
 <p><a href="/uri">foo</a></p>
 .
 
-A [link label] must contain at least one non-[whitespace character]:
+A [link label] must contain at least one [non-whitespace character]:
 
 .
 []
@@ -7107,7 +7491,7 @@ consists of `"`, zero or more
 characters not including `"`, and a final `"`.
 
 An [open tag](@open-tag) consists of a `<` character, a [tag name],
-zero or more [attributes], optional [whitespace], an optional `/`
+zero or more [attributes](@attribute], optional [whitespace], an optional `/`
 character, and a `>` character.
 
 A [closing tag](@closing-tag) consists of the string `</`, a
@@ -7220,8 +7604,8 @@ Closing tags:
 </a>
 </foo >
 .
-<p></a>
-</foo ></p>
+</a>
+</foo >
 .
 
 Illegal attributes in closing tag:
@@ -7288,7 +7672,7 @@ Entities are preserved in HTML attributes:
 .
 <a href="&ouml;">
 .
-<p><a href="&ouml;"></p>
+<a href="&ouml;">
 .
 
 Backslash escapes do not work in HTML attributes:
@@ -7296,7 +7680,7 @@ Backslash escapes do not work in HTML attributes:
 .
 <a href="\*">
 .
-<p><a href="\*"></p>
+<a href="\*">
 .
 
 .
author	John MacFarlane <jgm@berkeley.edu>	2015-07-08 17:42:22 -0700
committer	John MacFarlane <jgm@berkeley.edu>	2015-07-10 14:24:22 -0700
commit	17e6720dd9b5d25aeb906bb23915a6ee13a07e3d (patch)
tree	368489317ca19a0136bba3381be4ab219b1eaf21 /test
parent	039098095da3a31dd338f2a1137e673d914489ea (diff)