Code Monkey home page Code Monkey logo

Comments (18)

ChristianMurphy avatar ChristianMurphy commented on June 1, 2024 2

Testing the new code from #21 , It seems to reach it's limit at:
crash-d40a5f23182756de9477e2782ba15dd2be71457d6cd8d0c7637795b5f9ecd6ba.txt
15 minutes in, it hasn't crashed yet (progress!), but it also seems to be hung/stalled.

I'm running a profiler on it to see if it's related to unravelLinkedTokens, or something else.

edit: it did eventually finish, including flame graph of run flamegraph.html.log

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024 1

Hello 👋

I've provided a fix for the three crash reports in #21

However, we still have an issue with the following and it's unrelated to unravelLinkedTokens

var micromark = require('./lib')
var doc = Array(35_000).fill('[](').join('')

var result = micromark(doc)
console.log(result)

It seems to be caused by the presence of a call to main() in createTokenizer().write()

from micromark.

wooorm avatar wooorm commented on June 1, 2024 1

Seems to be my node/mac version!

$ node test/perf.js 
base
25ms
strong
4s
strong/emphasis?
4s
unclosed links
291ms
unclosed links (2)
919ms
tons of definitions
24s

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024 1

Merging tokens is certainly a good idea. I hope that it will speed things up !

from micromark.

wooorm avatar wooorm commented on June 1, 2024 1

I managed to get the definitions case down to .7s by adding one line ✨ haha, pfew

from micromark.

wooorm avatar wooorm commented on June 1, 2024 1

the unclosed (3) case mentioned before is still slow, but the rest are fine:

base
24ms
strong
1s
strong/emphasis?
1s
unclosed links
365ms
unclosed links (2)
417ms
tons of definitions
746ms

from micromark.

wooorm avatar wooorm commented on June 1, 2024

I looked for a couple of those in test/perf too. I’m sure there are other cases where malicious data could be made to cause either stackoverflows or a big delays (I believe the 10k definitions in one content block also crash).

from micromark.

wooorm avatar wooorm commented on June 1, 2024

Awesome!

Main seems to be this one, which is where every state is called from. It should be that the self time (time spent actually in main) is short, but the total time (time spent in functions called from main) is long

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024

After some thoughts, I did some more digging for the last issue and we may have too many subevents. This is why we have the stackoverflow, main() is only called 3 or 4 times.

However subevents go from 3, 4 to 140000 and cause Array#splice to stackoveflow which is weird

from micromark.

wooorm avatar wooorm commented on June 1, 2024

To add some more info: for markdown, we first need to parse blocks (headings, definitions, etc), and then we can do inlines (emphasis, links, etc). Reason: later definitions need to be found first, before parsing references. Hence, subtokenize, which deals with the events from one tokenizer and passes it to other tokenizers.
In this case, there is one giant paragraph, where every character is a separate token.

Q: do you know if those 140k events include (almost all) tokens which are data? Or are they other types of markers still?

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024

@wooorm I confirm that all tokens are enter/exit of individual characters. I tracked the issue to the fact that Array#splice takes all items to be inserted as individual argument. So I tried to implemented a chunked version of it at it works like a charm 🔥 . I updated the original PR with some tests

from micromark.

wooorm avatar wooorm commented on June 1, 2024

That’s awesome!

Do you know about the performance improvement? (test/perf should have an example of this case, where you can quickly measure before and after the commit?)

Another idea I had: while tokenizing the text, the values are separate. When we know they aren’t links, they are marked as data. This is also done for several other things (whitespace is also a data token). Would merging adjacent enter/exits for data solve this too?

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024

Yes, I think that we may fix the issue if the tokens are merged but I can't say for sure that we won't have the same issue with a different input.
I ran the performance tests and I got a stackoverflow on main branch

# before
base
32ms
strong
5s
strong/emphasis?
5s
unclosed links
975ms
unclosed links (2)
1s
tons of definitions
/Users/marouane/Projects/micromark/dist/util/normalize-identifier.js:7
      .replace(/[\t\n\r ]+/g, ' ')
       ^

RangeError: Maximum call stack size exceeded
    at String.replace (<anonymous>)
    at normalizeIdentifier (/Users/marouane/Projects/micromark/dist/util/normalize-identifier.js:7:8)
    at Object.resolveDefinition [as resolve] (/Users/marouane/Projects/micromark/dist/tokenize/definition.js:26:20)
    at addResult (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:291:27)
    at onsuccessfulconstruct (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:199:5)
    at ok (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:265:9)
    at after (/Users/marouane/Projects/micromark/dist/tokenize/definition.js:262:14)
    at main (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:130:17)
    at write (/Users/marouane/Projects/micromark/dist/util/create-tokenizer.js:58:5)
    at flatMap (/Users/marouane/Projects/micromark/dist/util/flat-map.js:11:28)
# after
base
36ms
strong
5s
strong/emphasis?
5s
unclosed links
958ms
unclosed links (2)
1s
tons of definitions
31s

from micromark.

wooorm avatar wooorm commented on June 1, 2024

I’m currently getting (I have a couple of refactored things locally, but nothing noteworthy performance-wise):

base
25ms
strong
4s
strong/emphasis?
4s
unclosed links
285ms
unclosed links (2)
917ms
tons of definitions
/Users/tilde/Projects/oss/micromark/dist/tokenize/definition.js:291
    return nok(code)
           ^

RangeError: Maximum call stack size exceeded

It seems my unclosed links 1 case is significantly faster, which seems to indicate that the chunked splice is slower. However, as the definitions are working for you, the while loop and the chunked splice fix the stack overflow at least?

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024

Yes I don't have a stack overflow when I combine both the while loop and the chunked splice. Can you try the perfs of my branch on your computer ? 285ms is indeed quite fast ! I don't know if the speedup/slowdown is due to the computer, the code or node's version 🤔

from micromark.

wooorm avatar wooorm commented on June 1, 2024

So that indicates your code fixes the crash. Adding your new test case to test/perf gives me 7s:

console.log('unclosed links (3)')
then = Date.now()
m('[](<'.repeat(35000))
console.log(ms(Date.now() - then))

Also: It might be that some of the refactoring I’ve done locally made unclosed links (1) much faster for me.

Definitions are still way too slow tho. And unclosed links (3) too. I’m hopeful that merging data tokens will help there.

from micromark.

fazouane-marouane avatar fazouane-marouane commented on June 1, 2024

Wow ! What was the line ? 😮

from micromark.

wooorm avatar wooorm commented on June 1, 2024

This one! 09a4e43

It’s there to prevent links in links. But it walked from the link back to the start of the events (to prevent all previous brackets) every time. Now it exits if it already has one.

Could be better: it’ll still keep walking to the start in cases where there is no link opening before!

from micromark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.