Fix #220 (hash collisions for references).

This commit ports Vicent Marti's fix in cmark-gfm. (384cc9db4cd7a90f59c0751e58eb7b3023d38b85) His commit message follows: As explained on the previous commit, it is trivial to DoS the CMark parser by generating a document where all the link reference names hash to the same bucket in the hash table. This will cause the lookup process for each reference to take linear time on the amount of references in the document, and with enough link references to lookup, the end result is a pathological O(N^2) that causes medium-sized documents to finish parsing in 5+ minutes. To avoid this issue, we propose the present commit. Based on the fact that all reference lookup/resolution in a Markdown document is always performed as a last step during the parse process, we've reimplemented reference storage as follows: 1. New references are always inserted at the end of a linked list. This is an O(1) operation, and does not check whether an existing (duplicate) reference with the same label already exists in the document. 2. Upon the first call to `cmark_reference_lookup` (when it is expected that no further references will be added to the reference map), the linked list of references is written into a fixed-size array. 3. The fixed size array can then be efficiently sorted in-place in O(n log n). This operation only happens once. We perform this sort in a _stable_ manner to ensure that the earliest link reference in the document always has preference, as the spec dictates. To accomplish this, every reference is tagged with a generation number when initially inserted in the linked list. 4. The sorted array is then compacted in O(n). Since it was sorted in a stable way, the first reference for each label is preserved and the duplicates are removed, matching the spec. 5. We can now simply perform a binary search for the current `cmark_reference_lookup` query in O(log n). Any further lookup calls will also be O(log n), since the sorted references table only needs to be generated once. The resulting implementation is notably simple (as it uses standard library builtins `qsort` and `bsearch`), whilst performing better than the fixed size hash table in documents that have a high number of references and never becoming pathological regardless of the input.
author: John MacFarlane <jgm@berkeley.edu> 2020-02-16 08:50:54 -0800
committer: John MacFarlane <jgm@berkeley.edu> 2020-02-16 08:50:54 -0800
commit: b2378e459be775004af39bbe280846a98c8cbda6 (patch)
tree: f5a2436ce757ebfc59e5de8200882f32adf10339 /src/references.h
parent: 04936d63235a229c30d2cf2cd23ca5a177f0c133 (diff)
1 files changed, 4 insertions, 4 deletions
diff --git a/src/references.h b/src/references.h
index 5038c49..cc59509 100644
--- a/src/references.h
+++ b/src/references.h
@@ -7,21 +7,21 @@
 extern "C" {
 #endif
 
-#define REFMAP_SIZE 16
-
 struct cmark_reference {
   struct cmark_reference *next;
   unsigned char *label;
   unsigned char *url;
   unsigned char *title;
-  unsigned int hash;
+  unsigned int age;
 };
 
 typedef struct cmark_reference cmark_reference;
 
 struct cmark_reference_map {
   cmark_mem *mem;
-  cmark_reference *table[REFMAP_SIZE];
+  cmark_reference *refs;
+  cmark_reference **sorted;
+  unsigned int size;
 };
 
 typedef struct cmark_reference_map cmark_reference_map;
author	John MacFarlane <jgm@berkeley.edu>	2020-02-16 08:50:54 -0800
committer	John MacFarlane <jgm@berkeley.edu>	2020-02-16 08:50:54 -0800
commit	b2378e459be775004af39bbe280846a98c8cbda6 (patch)
tree	f5a2436ce757ebfc59e5de8200882f32adf10339 /src/references.h
parent	04936d63235a229c30d2cf2cd23ca5a177f0c133 (diff)