From d3c3e749f4f7b95a9604f751cf993fd488a15b19 Mon Sep 17 00:00:00 2001 From: John MacFarlane Date: Tue, 7 Oct 2014 22:24:53 -0700 Subject: Cleaned up entity section of spec. We convert entities to unicode characters, not UTF-8 sequences. (Though they might ultimately be output that way.) --- spec.txt | 41 ++++++++++++++++++++++++----------------- 1 file changed, 24 insertions(+), 17 deletions(-) (limited to 'spec.txt') diff --git a/spec.txt b/spec.txt index db62f53..489b9c0 100644 --- a/spec.txt +++ b/spec.txt @@ -3727,21 +3727,25 @@ foo ## Entities -With the goal of making this standard as HTML-agnostic as possible, all HTML valid HTML Entities in any -context are recognized as such and converted into their actual values (i.e. the UTF8 characters representing -the entity itself) before they are stored in the AST. +With the goal of making this standard as HTML-agnostic as possible, all +valid HTML entities in any context are recognized as such and +converted into unicode characters before they are stored in the AST. -This allows implementations that target HTML output to trivially escape the entities when generating HTML, -and simplifies the job of implementations targetting other languages, as these will only need to handle the -UTF8 chars and need not be HTML-entity aware. +This allows implementations that target HTML output to trivially escape +the entities when generating HTML, and simplifies the job of +implementations targetting other languages, as these will only need to +handle the unicode chars and need not be HTML-entity aware. [Named entities](#name-entities) consist of `&` -+ any of the valid HTML5 entity names + `;`. The [following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json) -is used as an authoritative source of the valid entity names and their corresponding codepoints. ++ any of the valid HTML5 entity names + `;`. The +[following document](http://www.whatwg.org/specs/web-apps/current-work/multipage/entities.json) +is used as an authoritative source of the valid entity names and their +corresponding codepoints. -Conforming implementations that target Markdown don't need to generate entities for all the valid -named entities that exist, with the exception of `"` (`"`), `&` (`&`), `<` (`<`) and `>` (`>`), -which always need to be written as entities for security reasons. +Conforming implementations that target HTML don't need to generate +entities for all the valid named entities that exist, with the exception +of `"` (`"`), `&` (`&`), `<` (`<`) and `>` (`>`), which +always need to be written as entities for security reasons. .   & © Æ Ď ¾ ℋ ⅆ ∲ @@ -3750,9 +3754,10 @@ which always need to be written as entities for security reasons. . [Decimal entities](#decimal-entities) -consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these entities need to be recognised -and tranformed into their corresponding UTF8 codepoints. Invalid Unicode codepoints will be written -as the "unknown codepoint" character (`0xFFFD`) +consist of `&#` + a string of 1--8 arabic digits + `;`. Again, these +entities need to be recognised and tranformed into their corresponding +UTF8 codepoints. Invalid Unicode codepoints will be written as the +"unknown codepoint" character (`0xFFFD`) . # Ӓ Ϡ � @@ -3779,7 +3784,8 @@ Here are some nonentities: . Although HTML5 does accept some entities without a trailing semicolon -(such as `©`), these are not recognized as entities here, because it makes the grammar too ambiguous: +(such as `©`), these are not recognized as entities here, because it +makes the grammar too ambiguous: . © @@ -3787,7 +3793,8 @@ Although HTML5 does accept some entities without a trailing semicolon

&copy

. -Strings that are not on the list of HTML5 named entities are not recognized as entities either: +Strings that are not on the list of HTML5 named entities are not +recognized as entities either: . &MadeUpEntity; @@ -4836,7 +4843,7 @@ in Markdown: URL-escaping should be left alone inside the destination, as all URL-escaped characters are also valid URL characters. HTML entities in -the destination will be parsed into their UTF8 codepoints, as usual, and +the destination will be parsed into their UTF-8 codepoints, as usual, and optionally URL-escaped when written as HTML. . -- cgit v1.2.3