Comments (18)
Testing the new code from #21 , It seems to reach it's limit at:
crash-d40a5f23182756de9477e2782ba15dd2be71457d6cd8d0c7637795b5f9ecd6ba.txt
15 minutes in, it hasn't crashed yet (progress!), but it also seems to be hung/stalled.
I'm running a profiler on it to see if it's related to unravelLinkedTokens, or something else.
edit: it did eventually finish, including flame graph of run flamegraph.html.log
from micromark.
Hello 👋
I've provided a fix for the three crash reports in #21
However, we still have an issue with the following and it's unrelated to unravelLinkedTokens
var micromark = require('./lib')
var doc = Array(35_000).fill('[](').join('')
var result = micromark(doc)
console.log(result)
It seems to be caused by the presence of a call to main()
in createTokenizer().write()
from micromark.
Seems to be my node/mac version!
$ node test/perf.js
base
25ms
strong
4s
strong/emphasis?
4s
unclosed links
291ms
unclosed links (2)
919ms
tons of definitions
24s
from micromark.
Merging tokens is certainly a good idea. I hope that it will speed things up !
from micromark.
I managed to get the definitions case down to .7s by adding one line ✨ haha, pfew
from micromark.
the unclosed (3) case mentioned before is still slow, but the rest are fine:
base
24ms
strong
1s
strong/emphasis?
1s
unclosed links
365ms
unclosed links (2)
417ms
tons of definitions
746ms
from micromark.
I looked for a couple of those in test/perf
too. I’m sure there are other cases where malicious data could be made to cause either stackoverflows or a big delays (I believe the 10k definitions in one content block also crash).
from micromark.
Awesome!
Main seems to be this one, which is where every state is called from. It should be that the self time (time spent actually in main) is short, but the total time (time spent in functions called from main) is long
from micromark.
After some thoughts, I did some more digging for the last issue and we may have too many subevents. This is why we have the stackoverflow, main()
is only called 3 or 4 times.
However subevents go from 3, 4 to 140000 and cause Array#splice
to stackoveflow which is weird
from micromark.
To add some more info: for markdown, we first need to parse blocks (headings, definitions, etc), and then we can do inlines (emphasis, links, etc). Reason: later definitions need to be found first, before parsing references. Hence, subtokenize, which deals with the events from one tokenizer and passes it to other tokenizers.
In this case, there is one giant paragraph, where every character is a separate token.
Q: do you know if those 140k events include (almost all) tokens which are data
? Or are they other types of markers still?
from micromark.
@wooorm I confirm that all tokens are enter/exit of individual characters. I tracked the issue to the fact that Array#splice takes all items to be inserted as individual argument. So I tried to implemented a chunked version of it at it works like a charm 🔥 . I updated the original PR with some tests
from micromark.
That’s awesome!
Do you know about the performance improvement? (test/perf
should have an example of this case, where you can quickly measure before and after the commit?)
Another idea I had: while tokenizing the text, the values are separate. When we know they aren’t links, they are marked as data
. This is also done for several other things (whitespace is also a data token). Would merging adjacent enter/exits for data solve this too?
from micromark.
Yes, I think that we may fix the issue if the tokens are merged but I can't say for sure that we won't have the same issue with a different input.
I ran the performance tests and I got a stackoverflow on main
branch
# before
base
32ms
strong
5s
strong/emphasis?
5s
unclosed links
975ms
unclosed links (2)
1s
tons of definitions
/Users/marouane/Projects/micromark/dist/util/normalize-identifier.js:7
.replace(/[\t\n\r ]+/g, ' ')
^
RangeError: Maximum call stack size exceeded
at String.replace (<anonymous>)
at normalizeIdentifier (/Users/marouane/Projects/micromark/dist/util/normalize-identifier.js:7:8)
at Object.resolveDefinition [as resolve] (/Users/marouane/Projects/micromark/dist/tokenize/definition.js:26:20)
at addResult (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:291:27)
at onsuccessfulconstruct (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:199:5)
at ok (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:265:9)
at after (/Users/marouane/Projects/micromark/dist/tokenize/definition.js:262:14)
at main (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:130:17)
at write (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:58:5)
at flatMap (/Users/marouane/Projects/micromark/dist/util/flat-map.js:11:28)
# after
base
36ms
strong
5s
strong/emphasis?
5s
unclosed links
958ms
unclosed links (2)
1s
tons of definitions
31s
from micromark.
I’m currently getting (I have a couple of refactored things locally, but nothing noteworthy performance-wise):
base
25ms
strong
4s
strong/emphasis?
4s
unclosed links
285ms
unclosed links (2)
917ms
tons of definitions
/Users/tilde/Projects/oss/micromark/dist/tokenize/definition.js:291
return nok(code)
^
RangeError: Maximum call stack size exceeded
It seems my unclosed links 1 case is significantly faster, which seems to indicate that the chunked splice is slower. However, as the definitions are working for you, the while loop and the chunked splice fix the stack overflow at least?
from micromark.
Yes I don't have a stack overflow when I combine both the while loop and the chunked splice. Can you try the perfs of my branch on your computer ? 285ms is indeed quite fast ! I don't know if the speedup/slowdown is due to the computer, the code or node's version 🤔
from micromark.
So that indicates your code fixes the crash. Adding your new test case to test/perf gives me 7s:
console.log('unclosed links (3)')
then = Date.now()
m('[](<'.repeat(35000))
console.log(ms(Date.now() - then))
Also: It might be that some of the refactoring I’ve done locally made unclosed links (1) much faster for me.
Definitions are still way too slow tho. And unclosed links (3) too. I’m hopeful that merging data tokens will help there.
from micromark.
Wow ! What was the line ? 😮
from micromark.
This one! 09a4e43
It’s there to prevent links in links. But it walked from the link back to the start of the events (to prevent all previous brackets) every time. Now it exits if it already has one.
Could be better: it’ll still keep walking to the start in cases where there is no link opening before!
from micromark.
Related Issues (20)
- `index.d.ts` is missing in `micromark-util-encode` published files HOT 3
- HTML with excess whitespace is not parsed correctly HOT 2
- List items wrapped in <p> tags due to trailing space HOT 3
- hard break at the end of a paragraph is not properly parsed HOT 3
- Make `definitions` available to extensions HOT 2
- Custom extensions break in development mode, despite working in production HOT 6
- & in image url will be encode to html entity HOT 2
- Configure collapsing newlines into a single paragraph HOT 3
- TokenizeContext.sliceSerialize throws in sliceChunks if first chunk of token is Code instead of string HOT 20
- Reduce execution time by ~11% with a simple reimplementation of TokenizeContext.now HOT 3
- nested ordered lists not starting with 1. are not detected HOT 4
- `TokenizeContext.sliceSerialize` for `Token.type` of `setextHeading` includes non-heading content from outside the range of [`startLine`, `endLine`] HOT 1
- `micromark-util-symbol` can not be imported by typescript HOT 9
- Strings ending with `\n-` are compiled into a level 2 heading HOT 3
- Error - [webpack] 'dist': ./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11 Module parse failed: Identifier directly after number HOT 12
- Emphasis and strong when immediately followed by emphasis in the same word causes extra asterisks to appear HOT 4
- Implementation of autolink and literalAutolink (micromark-extension-gfm-autolink-literal) are inconsistent when handling "@." HOT 10
- Including license in NPM packages HOT 4
- Performance: larger MDX files are unmanagably slow to parse HOT 5
- "TypeError: chunks[startIndex].slice is not a function" during call to postprocess when used with micromark-extension-directive HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from micromark.