Age | Commit message (Collapse) | Author |
|
Now it just replaces bad UTF-8 sequences and NULLs.
This restores benchmarks to near their previous levels.
|
|
We no longer preprocess tabs to spaces before parsing.
Instead, we keep track of both the byte offset and
the (virtual) column as we parse block starts.
This allows us to handle tabs without converting
to spaces first. Tabs are left as tabs in the output.
Added `column` and `first_nonspace_column` fields to `parser`.
Added utility function to advance the offset, computing
the virtual column too.
Note that we don't need to deal with UTF-8 here at all.
Only ASCII occurs in block starts.
Significant performance improvement due to the fact that
we're not doing UTF-8 validation -- though we might want
to add that back in.
|
|
We dispense with the hashes and just do string comparsions.
Since the array is in order, we can search intelligently
and should never need to do more than 8 or so comparisons.
This reduces binary size even further, at a small cost
in performance. (This shouldn't matter too much, as
it's only detectable in really entity-heavy sources.)
|
|
|
|
We now use -1 instead of 0 to indicate leaf nodes.
|
|
|
|
The primary advantage is a big reduction in the size of
the compiled library and executable (> 100K).
There should be no measurable performance difference in
normal documents. I detected a slight performance
hit (around 5%) in a file containing 1,000,000 entities.
* Removed `src/html_unescape.gperf` and `src/html_unescape.h`.
* Added `src/entities.h` (generated by `tools/make_entities_h.py`).
* Added binary tree lookup functions to `houdini_html_u.c`, and
use the data in `src/entities.h`.
|
|
```
[ref]: url
"title" ok
```
Here we should parse the first line as a reference.
|
|
|
|
See jgm/commonmark#45.
|
|
The old one had many errors.
The new one is derived from the list in the npm entities package.
Since the sequences can now be longer (multi-code-point), we
have bumped the length limit from 4 to 8, which also affects
houdini_html_u.c.
An example of the kind of error that was fixed in given
in jgm/commonmark.js#47: `≧̸` should be rendered as "≧̸" (U+02267
U+00338), but it's actually rendered as "≧" (which is the same as
`≧`).
|
|
This isn't actually needed.
|
|
|
|
Now we have an array of pointers (`potential_openers`),
keyed to the delim char.
When we've failed to match a potential opener prior to point X
in the delimiter stack, we reset `potential_openers` for that opener
type to X, and thus avoid having to look again through all the openers
we've already rejected.
See jgm/commonmark#43.
|
|
|
|
|
|
When they have no matching openers and cannot be openers themselves,
we can safely remove them.
This helps with a performance case: "a_ " * 20000.
See jgm/commonmark.js#43.
|
|
This reverts commit 54d1249c2caebf45a24d691dc765fb93c9a5e594, reversing
changes made to bc14d869323650e936c7143dcf941b28ccd5b57d.
|
|
Assume a multi-byte sequence and rework switch statement into if/else
for another 2% speedup.
|
|
Speeds up "make bench" by another percent.
|
|
Handle valid UTF-8 chars inside the main loop and avoid a call to
strbuf_put for every UTF-8 char.
Results in a 8% speedup in the UTF-8-heavy "make bench" on my system.
|
|
|
|
|
|
|
|
|
|
|
|
Guard against too large chunks passed via the API.
|
|
There are probably a couple of places I missed. But this will only
be a problem if we use a 64-bit bufsize_t at some point. Then, we'll
get warnings from -Wshorten-64-to-32.
|
|
|
|
|
|
This function was missing a couple of range checks that I'm too lazy
to fix.
|
|
Avoid potential overflow and allow for different bufsize types.
|
|
|
|
Replace macro ENSURE_SIZE with inline function S_strbuf_grow_by that
checks for overflow.
|
|
cmark_strbuf_grow will never truncate a buffer.
|
|
This simplifies overflow checks.
|
|
|
|
Always add 50% on top of target size. No need for a loop.
|
|
This makes it easier to change the type later.
No functional change. The rest of the code base still has to be
adjusted to use the new type.
Also add some TODO comments in buffer.c.
|
|
|
|
Users of the strbuf API are supposed to check for an OOM condition
after appending to strbufs, but:
* This is never done in the whole code base.
* The implementation was flawed because only `ptr` was set to the
OOM value without adjusting `size` and `asize`. After an error,
subsequent calls could very well lead to segfaults, contrary to the
documentation.
Change the code to always abort on errors with a message printed to
stderr. The only alternative is to propagate errors throughout the
whole library which seems infeasible.
|
|
fix ENSURE_SIZE to actually check left arg length.
|
|
|
|
|
|
Added fields `offset`, `first_nonspace`, `indent`, and `blank`
to `cmark_parser` struct.
This just removes some repetition in the code.
|
|
|
|
This fixes cases like:
```
1. a
2. b
3. c
```
|
|
See jgm/CommonMark#322.
|
|
|
|
From btrask's alternate code in the comment on
https://github.com/jgm/cmark/pull/18.
Note: this gives a 1-2% performance boot in our benchmark,
probably enough to make it worth while.
|