Age | Commit message (Collapse) | Author |
|
Note that our current procedure for removing nulls is not
working properly.
|
|
|
|
|
|
Also tools/make_entities_h.py -> tools/make_entitis_inc.py.
|
|
|
|
Also command line option `--validate-utf8`.
This option causes cmark to check for valid UTF-8,
replacing invalid sequences with the replacement
character, U+FFFD.
Reinstated api tests for utf8.
|
|
Use S_is_line_end_char.
|
|
|
|
|
|
This gives bad results in parsing reference links,
where we might have trailing blanks.
(finalize in blocks.c removes the bytes parsed as
a reference definition; before this change, some
blank bytes might remain on the line.)
|
|
We no longer validate utf8 before parsing.
|
|
We now replace null characters in the line splitting code.
|
|
This change will need to be ported to CommonMark if we
do this.
We no longer replace spaces with tabs.
Rather, we treat tabs as equivalent spaces for purposes
of determining structure. Tab stop is still 4.
Tabs in the text remain in the text.
|
|
Now it just replaces bad UTF-8 sequences and NULLs.
This restores benchmarks to near their previous levels.
|
|
We no longer preprocess tabs to spaces before parsing.
Instead, we keep track of both the byte offset and
the (virtual) column as we parse block starts.
This allows us to handle tabs without converting
to spaces first. Tabs are left as tabs in the output.
Added `column` and `first_nonspace_column` fields to `parser`.
Added utility function to advance the offset, computing
the virtual column too.
Note that we don't need to deal with UTF-8 here at all.
Only ASCII occurs in block starts.
Significant performance improvement due to the fact that
we're not doing UTF-8 validation -- though we might want
to add that back in.
|
|
|
|
|
|
We dispense with the hashes and just do string comparsions.
Since the array is in order, we can search intelligently
and should never need to do more than 8 or so comparisons.
This reduces binary size even further, at a small cost
in performance. (This shouldn't matter too much, as
it's only detectable in really entity-heavy sources.)
|
|
At least with valid data.
|
|
This reverts commit e113185554c4d775e6fca0596011b405fa1700a5.
|
|
|
|
|
|
We now use -1 instead of 0 to indicate leaf nodes.
|
|
|
|
|
|
The primary advantage is a big reduction in the size of
the compiled library and executable (> 100K).
There should be no measurable performance difference in
normal documents. I detected a slight performance
hit (around 5%) in a file containing 1,000,000 entities.
* Removed `src/html_unescape.gperf` and `src/html_unescape.h`.
* Added `src/entities.h` (generated by `tools/make_entities_h.py`).
* Added binary tree lookup functions to `houdini_html_u.c`, and
use the data in `src/entities.h`.
|
|
```
[ref]: url
"title" ok
```
Here we should parse the first line as a reference.
|
|
|
|
|
|
See jgm/commonmark#45.
|
|
|
|
|
|
The old one had many errors.
The new one is derived from the list in the npm entities package.
Since the sequences can now be longer (multi-code-point), we
have bumped the length limit from 4 to 8, which also affects
houdini_html_u.c.
An example of the kind of error that was fixed in given
in jgm/commonmark.js#47: `≧̸` should be rendered as "≧̸" (U+02267
U+00338), but it's actually rendered as "≧" (which is the same as
`≧`).
|
|
This isn't actually needed.
|
|
It breaks on Windows.
|
|
|
|
Removed sundown, because the reading was anomalous.
This commit in hoedown caused the speed difference btw
sundown and hoedown that I was measuring before (on 32 bit
machines):
https://github.com/hoedown/hoedown/commit/ca829ff83580ed52cc56c09a67c80119026bae20
As Nick Wellnhofer explains: "The commit removes a rather arbitrary
limit of 16MB for buffers. Your benchmark input probably results in
an buffer larger than 16MB. It also seems that hoedown didn't check
error returns thoroughly at the time of the commit. This basically means
that large input files ould produce any kind of random behavior before
that commit, and that any benchmark that results in a too large buffer
can't be relied on."
|
|
Now we have an array of pointers (`potential_openers`),
keyed to the delim char.
When we've failed to match a potential opener prior to point X
in the delimiter stack, we reset `potential_openers` for that opener
type to X, and thus avoid having to look again through all the openers
we've already rejected.
See jgm/commonmark#43.
|
|
"*a_ " * 20000
See jgm/commonmark#43.
|
|
This way tests fail instead of just hanging.
Currently we use a 1 sec timeout.
Added a failing test from jgm/commonmark#43.
|
|
|
|
|
|
Many link closers with no openers.
Many link openers with no closers.
Many emph openers with no closers.
|
|
Many closers with no openers.
|
|
When they have no matching openers and cannot be openers themselves,
we can safely remove them.
This helps with a performance case: "a_ " * 20000.
See jgm/commonmark.js#43.
|
|
This reverts commit 54d1249c2caebf45a24d691dc765fb93c9a5e594, reversing
changes made to bc14d869323650e936c7143dcf941b28ccd5b57d.
|
|
|
|
Further optimize utf8proc_valid
|
|
Assume a multi-byte sequence and rework switch statement into if/else
for another 2% speedup.
|
|
Optimize utf8proc_detab
|