micromark / micromark Goto Github PK

View Code? Open in Web Editor NEW

1.7K 15.0 59.0 1.89 MB

small, safe, and great commonmark (optionally gfm) compliant markdown parser

Home Page: https://unifiedjs.com

License: MIT License

JavaScript 100.00%

markdown unified ast cst commonmark parse compile tokenize render gfm

micromark's Issues

Please provide a helloworld on how to write an extension? The real ones are kind of complex...

Subject of the feature

A helloworld is required on how to write an extension.

Problem

The real extensions are too complex for quick reference.

Expected behavior

A real simple extension.

Alternatives

Say, I want to write an extension to make link external. Still digging around, maybe it is a good place for helloworld - modify attributes?

Tables not rendering properly

Honestly I haven't debug this at all, however I'm hopeful that @wooorm will be kind enough to take a look, and probably figure out what's going on very quickly :)

The following markdown should produce a table (it does in github for example), however micromark parses it as text.

Gist: https://gist.github.com/diervo/931d9f68a08922efb3341c6faff3caea

nested ordered lists not starting with 1. are not detected

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark 3.2

Link to runnable example

https://codesandbox.io/s/trusting-worker-lt4scv

Steps to reproduce

nested ordered lists are not parsed correctly, if they don't start with a 1.

they work in gh:

a
2. foo
3. bar
b

Expected behavior

nested lists are parsed correctly on any level.

also see failing test: main...adobe-rnd:micromark:nested-lists-test

Actual behavior

if a nested ordered list doesn't start with 1. it is not parsed as list

Runtime

Node v16

Package manager

No response

OS

No response

Build and bundle tools

No response

`TokenizeContext.sliceSerialize` for `Token.type` of `setextHeading` includes non-heading content from outside the range of [`startLine`, `endLine`]

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

4.0.0

Link to runnable example

No response

Steps to reproduce

user@HOST micromark-setext % npm ls micromark
micromark-setext@ /Users/user/Documents/micromark-setext
└── [email protected]

user@HOST micromark-setext % cat issue.mjs 
import { parse } from "micromark";
import { postprocess } from "micromark";
import { preprocess } from "micromark";

const markdown = `
Text

Setext
======

Text
`;

const encoding = undefined;
const end = true;
const options = undefined;
const chunks = preprocess()(markdown, encoding, end);
const parseContext = parse(options).document().write(chunks);
const events = postprocess(parseContext);
for (const event of events) {
  const [ kind, token, context ] = event;
  if (kind === "enter") {
    const { type, start, end } = token;
    const { "line": startLine } = start;
    const { "line": endLine } = end;
    console.dir(`${type} (${startLine}-${endLine}): ${context.sliceSerialize(token)}`);
  }
}
user@HOST micromark-setext % node issue.mjs  
'lineEndingBlank (1-2): \n'
'content (2-2): Text'
'paragraph (2-2): Text'
'data (2-2): Text'
'lineEnding (2-3): \n'
'lineEndingBlank (3-4): \n'
'setextHeading (4-5): Text\n\nSetext\n======'
'setextHeadingText (4-4): Setext'
'data (4-4): Setext'
'lineEnding (4-5): \n'
'setextHeadingLine (5-5): ======'
'setextHeadingLineSequence (5-5): ======'
'lineEnding (5-6): \n'
'lineEndingBlank (6-7): \n'
'content (7-7): Text'
'paragraph (7-7): Text'
'data (7-7): Text'
'lineEnding (7-8): \n'
user@HOST micromark-setext %

Expected behavior

Note specifically this part of the output: 'setextHeading (4-5): Text\n\nSetext\n======'

While the start and end lines are correct, the output of sliceSerialize includes "Text\n\n" from lines 2 and 3 which is not part of the heading (confirmed by the associated setextHeadingText token which contains only "Setext").

Actual behavior

See above.

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

Using power-assert causes Webpack builds to fail

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

Use micromark in a Webpack app.

Expected behavior

The app builds without configuration changes.

Actual behavior

The build fails with:

WARNING in ../../node_modules/power-assert-formatter/lib/create.js 30:28-49
Critical dependency: the request of a dependency is an expression
 @ ../../node_modules/power-assert-formatter/index.js 12:0-40
 @ ../../node_modules/power-assert/index.js 15:16-49
 @ ../../node_modules/micromark/dev/lib/create-tokenizer.js 27:0-33 213:4-22 217:4-10 223:4-22 231:4-22 236:4-10 285:4-10 286:4-25 298:4-10 299:4-25 302:4-10 305:4-22 311:4-10 414:10-16 464:8-26 472:8-26 508:4-10 576:4-10 577:4-10 652:
10-16
 @ ../../node_modules/micromark/dev/lib/parse.js 14:0-53 49:13-28

and similar warnings for any package that now has a dependency on power-assert.

There is a solution provided by power-assert here, but it seems like it would hide other warnings that I would want to see and I don't think I should need to modify my Webpack config to get micromark to work.

It would probably be better not to add power-assert as a dependency in a patch release since it's likely to break many people's builds.

Also note that the main micromark package uses power-assert, but I think it's missing from the the package's dependencies.

Out of curiosity, what is the advantage of switching to power-assert?

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

`assert` is not browser friendly

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

latest

Link to runnable example

No response

Steps to reproduce

browserify/node-util#62 (comment)

Expected behavior

Use console.assert instead.

Actual behavior

assert is a node built-in module.

Runtime

Node v14

Package manager

yarn v1

OS

macOS

Build and bundle tools

Vite

Potential Heap overflow/memory leak

Subject of the issue

Fuzz testing micromark, by itself without plugins (#18 modified)

const fs = require('fs')
const micromark = require('../index')

function fuzz(buf) {
  try {
    // focus on issues in files less than 1Mb
    if (buf.length > 1000000) return

    // write result in temp file in case unrecoverable exception is thrown
    fs.writeFileSync('temp.txt', buf)

    // commonmark buffer without html
    micromark(buf)
  } catch (e) {
    throw e
  }
}

module.exports = {
  fuzz
}

after running through 10-30 files often crashes with:

<--- Last few GCs --->

[16841:0x4e8fc10]    11334 ms: Mark-sweep (reduce) 3664.6 (4118.7) -> 3664.6 (4118.7) MB, 162.9 / 0.0 ms  (average mu = 0.067, current mu = 0.000) last resort GC in old space requested
[16841:0x4e8fc10]    11494 ms: Mark-sweep (reduce) 3664.6 (4115.7) -> 3664.6 (4116.7) MB, 160.5 / 0.0 ms  (average mu = 0.033, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0xa02dd0 node::Abort() [node]
 2: 0x94e471 node::FatalError(char const*, char const*) [node]
 3: 0xb7686e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb76be7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd31485  [node]
 6: 0xd43cf1 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 7: 0xd09562 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
 8: 0xd033e4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
 9: 0xd0b719 v8::internal::Factory::NewInternalizedStringImpl(v8::internal::Handle<v8::internal::String>, int, unsigned int) [node]
10: 0xf3169f v8::internal::StringTable::AddKeyNoResize(v8::internal::Isolate*, v8::internal::StringTableKey*) [node]
11: 0xf3fa16 v8::internal::Handle<v8::internal::String> v8::internal::StringTable::LookupKey<v8::internal::InternalizedStringKey>(v8::internal::Isolate*, v8::internal::InternalizedStringKey*) [node]
12: 0xf3fac6 v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::String>) [node]
13: 0xb7644b v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>, unsigned long, v8::internal::Handle<v8::internal::JSReceiver>, v8::internal::LookupIterator::Configuration) [node]
14: 0xee1809 v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::LookupIterator::Key const&, v8::internal::LookupIterator::Configuration) [node]
15: 0x106d9f9 v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::StoreOrigin, v8::Maybe<v8::internal::ShouldThrow>) [node]
16: 0x106eb07 v8::internal::Runtime_SetKeyedProperty(int, unsigned long*, v8::internal::Isolate*) [node]
17: 0x13fe259  [node]
timeout: the monitored command dumped core
Aborted

on an innocuous looking file, like

# Foo

| Name | GitHub | Twitter |
| ---- | ------ | ------- |

Your environment

OS: Ubuntu
Packages: Micromark 2.8.0, including if #21 is applied
Env: Node 14

Steps to reproduce

Run fuzzer from #18

Expected behavior

no crash

Actual behavior

<--- Last few GCs --->

[16841:0x4e8fc10]    11334 ms: Mark-sweep (reduce) 3664.6 (4118.7) -> 3664.6 (4118.7) MB, 162.9 / 0.0 ms  (average mu = 0.067, current mu = 0.000) last resort GC in old space requested
[16841:0x4e8fc10]    11494 ms: Mark-sweep (reduce) 3664.6 (4115.7) -> 3664.6 (4116.7) MB, 160.5 / 0.0 ms  (average mu = 0.033, current mu = 0.000) last resort GC in old space requested


<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0xa02dd0 node::Abort() [node]
 2: 0x94e471 node::FatalError(char const*, char const*) [node]
 3: 0xb7686e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xb76be7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xd31485  [node]
 6: 0xd43cf1 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 7: 0xd09562 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
 8: 0xd033e4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
 9: 0xd0b719 v8::internal::Factory::NewInternalizedStringImpl(v8::internal::Handle<v8::internal::String>, int, unsigned int) [node]
10: 0xf3169f v8::internal::StringTable::AddKeyNoResize(v8::internal::Isolate*, v8::internal::StringTableKey*) [node]
11: 0xf3fa16 v8::internal::Handle<v8::internal::String> v8::internal::StringTable::LookupKey<v8::internal::InternalizedStringKey>(v8::internal::Isolate*, v8::internal::InternalizedStringKey*) [node]
12: 0xf3fac6 v8::internal::StringTable::LookupString(v8::internal::Isolate*, v8::internal::Handle<v8::internal::String>) [node]
13: 0xb7644b v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Name>, unsigned long, v8::internal::Handle<v8::internal::JSReceiver>, v8::internal::LookupIterator::Configuration) [node]
14: 0xee1809 v8::internal::LookupIterator::LookupIterator(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::LookupIterator::Key const&, v8::internal::LookupIterator::Configuration) [node]
15: 0x106d9f9 v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::StoreOrigin, v8::Maybe<v8::internal::ShouldThrow>) [node]
16: 0x106eb07 v8::internal::Runtime_SetKeyedProperty(int, unsigned long*, v8::internal::Isolate*) [node]
17: 0x13fe259  [node]
timeout: the monitored command dumped core
Aborted

micromark crashes on invalid URI

Subject of the issue

Some malformed URL can crash micromark

Your environment

OS: Ubuntu 16
Packages: micromark 2.6.0
Env: Node 14

Steps to reproduce

var micromark = require('micromark')

console.log(micromark('[](<%>)'))

originally detected with #18, credit to @wooorm for a more minimal repro

Expected behavior

<p><a href="%25"></a></p>

Actual behavior

URIError: URI malformed
    at decodeURI (<anonymous>)
    at normalizeUri (micromark/dist/util/normalize-uri.js:1:1040)
    at url (micromark/dist/compile/html.js:1:54303)
    at Object.onexitmedia (micromark/dist/compile/html.js:1:61812)
    at done (micromark/dist/compile/html.js:1:50389)
    at compile (micromark/dist/compile/html.js:1:48534)
    at buffer (micromark/dist/index.js:1:2192)
    at Worker.fuzz [as fn] (micromark/fuzzer.js:1:1781)
    at process.<anonymous> (micromark/node_modules/jsfuzz/build/src/worker.js:63:30)

effect.check() modify events when construct is for document and has resolver

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark

Link to runnable example

https://github.com/wataru-chocola/report-micromark-20210827

Steps to reproduce

Run my PoC.

$ git clone https://github.com/wataru-chocola/report-micromark-20210827
$ cd report-micromark-20210827
$ npm install
$ npx node index.js

Expected behavior

document constructs are invoked twice in micromark/lib/initialize/document.js :

from checkNewContainers state

return effects.check(
  containerConstruct,
  thereIsANewContainer,
  thereIsNoNewContainer
)(code)

from documentContinued

return effects.attempt(
  containerConstruct,
  containerContinue,
  flowStart
)(code)

And I expect the first invocation effect.check(...) doesn't make any modifications on events.

 * @property {Attempt} check
 *   Attempt, then revert.

Actual behavior

effect.check() does modify events if construct is for document and has resolver.

My construct in PoC code dumps context.events at the start.
On 1st run (from effects.check), we see the correct events which are generated by previous tokenization.

+ initialize tokenizer (runCount: 1)
+ previous events
[ 'enter', 'chunkFlow', 'term\n' ]
[ 'exit', 'chunkFlow', 'term\n' ]
+ run resolverTo

But on 2nd run (from effects.attempt), events are modified by resolver in the previous check execution.

+ initialize tokenizer (runCount: 2)
+ previous events
[ 'enter', 'defListTerm', 'term\n' ]
[ 'enter', 'chunkFlow', 'term\n' ]

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

List items wrapped in tags due to trailing space

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

Link to runnable example

No response

Steps to reproduce

In chrome's console run:

const mm = await import('https://esm.sh/micromark@3?bundle');
console.log(mm.micromark('List1\n* item1\n* item2\n\n\n\n'));
console.log('------');
console.log(mm.micromark('List1\n* item1\n* item2\n\n\n \n'));

Note the only difference between the two examples is a single space some blank lines away from the list. Those two examples return different html, the latter has the list elements wrapped in 

<p>List1</p>
<ul>
<li>item1</li>
<li>item2</li>
</ul>
------
<p>List1</p>
<ul>
<li>
<p>item1</p>
</li>
<li>
<p>item2</p>
</li>
</ul>

Expected behavior

I'm not clear enough on the markdown spec to say which case is actually correct. Certainly other markdown parsers I've tried (though that is not a long list) render it like the first example.

Regardless I'd expect it to be the same between the two. In most markdown editors the trailing space is impossible to see and it can take a long time to track down why some list elements render with increased padding.

Actual behavior

See repro steps. Two examples output visually different HTML whereas I feel they should render the same.

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

& in image url will be encode to html entity

Initial checklist

I read the contributing guide
I read the support docs
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark 3.1.0

Link to runnable example

No response

Steps to reproduce

const content = '![](/imgs/i1.png?_a=center&_w=300)'

const html = micromark(content, {
    extensions: [gfm()],
    htmlExtensions: [gfmHtml()],
});


console.log(html)

Expected behavior

<p><img src="/imgs/i1.png?_a=center&_w=300" alt="" /></p>

Actual behavior

<p><img src="/imgs/i1.png?_a=center&amp;_w=300" alt="" /></p>

Runtime

Node v16

Package manager

pnpm

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

ES5 Compatibility

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Problem

i moved from react-markdown to micromark for the reasons below:

react-markdown has a dependency that is es6 which breaks my app when it is run on IE11
react-markdown used to have a bug which micromark has no such issue
micromark is the 'smallest'

however, since version 1.x, the building target is es2020, there is a lot of consts, lets, method shortcuts and more which will definitely break my app...

Solution

esm is a trend, and webpack can deal with it by default, however the es6 syntax... i may have to config my babel, and it might cost more time compliing.

so might it be possible to set build target to es5 which is for now the most compatible output? it does not hurt esm module.

part of my tsconfig.json

"target": "es5",
"lib": [
  "dom",
  "es2015",
  "es2017"
],

Alternatives

no..

Emphasis and strong when immediately followed by emphasis in the same word causes extra asterisks to appear

Issue from react-markdown: remarkjs/react-markdown#812

But potentially the root of the issue could live in the md parser. Below I have linked the repro links and comments from the other issue:

When processing the MD string

***123****456*

<em>
  <strong>123</strong>
</em>
<em>456</em>

React markdown renders what seems to be some additional asterisks?

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

react-markdown

Link to runnable example

No response

Steps to reproduce

Compare the result of:

This is just 1 word, where the first half is both italicized and bolded, the 2nd half is only italicized.

The MDAST that gets created from unified() => rehypeParse => rehypeRemark looks correct, so to me the issue seems to be either:

The syntax generated from the processing flow is incorrect.
The syntax is correct, and its React-Markdown's rendering of the syntax that is not correct.

Expected behavior

Actual behavior

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Custom extensions break in development mode, despite working in production

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.1.0

Link to runnable example

https://github.com/chudoklates/micromark-error-demo

Steps to reproduce

Use repo provided above.

Generally, for this error to occur, the parser needs to be run through Webpack in development mode. There also needs to be an extension which calls effects.consume() in its syntax before effects.enter() is called

Expected behavior

Actions which are permissible in the production distribution should also be permissible in development mode.

Actual behavior

A TypeError is thrown when the code reaches this assertion:

// at the point of error: code: 123, context.events: []
assert(
      code === null
        ? context.events.length === 0 ||
            context.events[context.events.length - 1][0] === 'exit'
        : context.events[context.events.length - 1][0] === 'enter',
      'expected last token to be open'
    )

Uncaught TypeError: Cannot read properties of undefined (reading '0')
    at Object.consume (create-tokenizer.js:246:52)
    at onStart (extensions.js:45:13)
    at start (create-tokenizer.js:460:12)
    at start (create-tokenizer.js:401:46)
    at start (text.js:49:30)
    at go (create-tokenizer.js:229:13)
    at main (create-tokenizer.js:209:11)
    at Object.write (create-tokenizer.js:135:5)
    at subcontent (index.js:198:17)
    at subtokenize (index.js:90:30)

Runtime

Node v16

Package manager

yarn v1

OS

macOS

Build and bundle tools

Webpack

Support Eastern Arabic numerals for lists

See: https://svelte.dev/repl/982673f97faa457692eb4d7bd51998df?version=3.29.0

Tl; dr: Some languages use different numerals (eg. ١,٢,٣ instead of 1,2,3). Can those numerals also be used to mark lists?

A quick test with babelmark indicated that https://github.com/dotnet/docfx supports this. https://babelmark.github.io/?text=%D9%A1.+%D9%85%D8%B1%D8%AD%D8%A8%D8%A7%0A%D9%A2.+%D8%A8%D8%A7%D9%84%D8%B9%D8%A7%D9%84%D9%85

micromark handles links with custom protocol different from commonmark

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-qr2fly?file=index.js

Steps to reproduce

import { micromark } from "micromark";
import { Parser, HtmlRenderer } from "commonmark";

const reader = new Parser();
const writer = new HtmlRenderer();

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = `<test:what>`;

console.log(micromark(content));
console.log(commonmark(content));

Expected behavior

micromark and commonmark should produce the same HTML output

<p><a href="test:what">test:what</a></p>

Actual behavior

micromark produces different HTML

<p><a href="">test:what</a></p>

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

uvu shouldn't be set in dependencies

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark-core-commonmark@npm:1.0.4, micromark-extension-gfm-autolink-literal@npm:1.0.2, micromark-extension-gfm-footnote@npm:1.0.2, micromark-extension-gfm-strikethrough@npm:1.0.3, micromark-extension-gfm-table@npm:1.0.4, micromark-extension-gfm-task-list-item@npm:1.0.2

Link to runnable example

No response

Steps to reproduce

https://unpkg.com/[email protected]/package.json and you can see that uvu is in the dependencies and not devDeps

Expected behavior

uvu should be set in the dev deps so that, when installing any of the packages define here (like micromark-core-commonmark), uvu wouldn't be installed to (it's a test runner, not used in the runtime code)

Actual behavior

uvu is listed in the deps so it get's installed

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

`index.d.ts` is missing in `micromark-util-encode` published files

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

The package.json published for micromark-util-encode v1.0.0 contains "types": "index.d.ts":

https://unpkg.com/browse/[email protected]/package.json

Yet the index.d.ts file is not in the files whitelist:

micromark/packages/micromark-util-encode/package.json

Lines 33 to 35 in efe9c4d

    
           "files": [ 
        
             "index.js" 
        
           ],

Most likely that is the reason index.d.ts isn't published:

https://unpkg.com/browse/[email protected]/

I haven't checked if other micromark packages have a similar issue, this is just what I discovered in my particular project:

Expected behavior

Types declared in the package.json should be published.

Actual behavior

Types declared in the package.json are not published.

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

Improved Concrete Syntax Trees

Subject of the feature

I'm in the process of migrating my markdown editor to use remark/micromark instead of markdown-it. One of my goals is not to change the formatting style of my users' input files, at least if I can help it.

At the moment, the micromark tokenizers seem not to record information that might help reconstruct the original input Markdown in cases where Markdown has redundancies:

_ vs * for emphasis
ATX headings vs setext headings
* vs - vs + for unordered lists
* vs - vs = for hrule / thematic breaks, as well as the length of the string used to indicate the break

I'm not terribly interested in preserving superfluous whitespace the user might have, but it would be nice to at least preserve their preferences for emphasis / heading / list syntax. For instance, I personally like to use * for regular lists and +/- for pro/con lists, and at the moment there's no way to preserve that information.

Are there any plans for improved concrete syntax tree support in the future?
If not would PRs be welcome to record syntactic information on syntax tree nodes, as well as in the corresponding serializers?
Otherwise, any advice for implementing this feature as a set of plugins would be appreciated!

Thanks!

Problem

Expected behavior

Alternatives

unravelLinkedTokens RangeError: Maximum call stack size exceeded

Subject of the issue

#18 fed with https://github.com/remarkjs/remark/blob/8108fe54e04640dda119aad366d70e6edf2602f1/test/fixtures/input/title-attributes.text can trigger a call stack exceeded issue in unravelLinkedTokens.
These files are pretty large 1mb and around 30k lines a piece, a more minimal example, at 105kb is also included.

It seems the be related to unterminated links, but more research is needed.

Your environment

OS: Ubuntu
Packages: micromark 2.6.1
Env: node 14

Steps to reproduce

var fs = require('fs')
var micromark = require('./index')

// var doc = fs.readFileSync('crash-395a731d55c510f1338b8c9911c159ab56329d18bc3a12a26b826b750d0b1253.txt')
// var doc = fs.readFileSync('crash-4bf6a4882505b11dea88b5e16e6f0d3766252601ae704e42ebe606d270f9f26f.txt')
var doc = fs.readFileSync('crash-7182fa3e89e1b8fb28bda27b6da6b3769f05b1ce68551d96c46acd0931d95004.txt')

var result = micromark(doc)

console.log(result)

crash-7182fa3e89e1b8fb28bda27b6da6b3769f05b1ce68551d96c46acd0931d95004.txt
crash-4bf6a4882505b11dea88b5e16e6f0d3766252601ae704e42ebe606d270f9f26f.txt
crash-395a731d55c510f1338b8c9911c159ab56329d18bc3a12a26b826b750d0b1253.txt

a more minimal example of what may be the same issue ([]( repeated 35k times in a 105kb file)

repeated-unterminated-links.txt

Expected behavior

If possible no error, alternatively a better error message could help.

Actual behavior

RangeError: Maximum call stack size exceeded
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:16585)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)
    at unravelLinkedTokens (micromark/dist/util/subtokenize.js:1:17944)

Including license in NPM packages

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Problem

While scanning my dependencies I found that micromark NPM packages don't include their actual license file. I believe it would make sense for the micromark NPM packages to include the license since the MIT license requires that it be included in all copies or substantial portions of the Software.

Solution

Since there's 22 NPM packages in the repo and they would presumably all use the same license from the root repo directory, I propose adding a release script that copies the license file from the root repo directory into each of the package directories, like this from vue router. I think it would then make sense to allow git to ignore license files in the package directories (but still allow NPM to include them).

Alternatives

It could also be solved by copy-pasting the license into each of the package directories. I think that may not be preferrable due to causing duplicative content in the repo.

CMSM

micromark is developed jointly with CMSM: Common Markup State Machine, as it’s sometimes easier to make changes in prose.

If you’re interested in micromark, also definitely check out CMSM!

Lack of document and types for turning off constructs

Subject of the issue

micromark doesn't accept {disable: {null: []}} as an extension when using TypeScript.

Your environment

OS: macOS 10.15
Packages: [email protected], [email protected]
Env: node v14.16.0; yarn 1.22.10

Steps to reproduce

please check https://github.com/issueset/micromark-disable-typescript-issue

Expected behavior

No typescript error. And it's better to have some document in the README.md for this feature.

Actual behavior

 Type '{ disable: { null: string[]; }; }' is not assignable to type 'SyntaxExtension[]'.

Performance improvement: linked lists for events

Subject of the feature

Given that on large markdown files we are dealing with tons (literally, 100k or so) of events, improving performance might be switching from arrays to linked event objects.

Problem

Operations on big arrays can be slow, such as #21.
Switching to linked lists adds complexity (while removing it in certain other cases!), but will probably/hopefully improve perf.

Alternatives

We’re already using really fast array methods. And everything is mutating already. Maybe linked lists won’t net a lot.

Trade-off between extensibility and performance

Say we take:

Indented code: it ends when there is a bogus line. But there could be infinity blank lines before that bogus line.
HTML blocks of kind 6 or 7: one blank line ends it.

Do we backtrack to before the blank lines, and check all the tokenisers again (blank line is last probably), or is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly?

The trade-off here is that either, with knowledge of other tokens, we can be more performant and scan the buffer fewer times, or that we are more extensible, allowing blank lines to be turned off, or alternative tokenisers from extensions dealing with them?

Ordered lists starting with non-1 are not parsed when some content is present before them (micromark 3)

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.0.0 (via mdast-util-from-markdown 1.0.0)

Please let me know if this issue should be moved to mdast-util-from-markdown, but I think the bug is somewhere in micromark source :)

Link to runnable example

https://codesandbox.io/s/naughty-ptolemy-y62bf?file=/src/index.ts

Steps to reproduce

This is not parsed as list (paragraph before)

Content.

2. Hello
3. world

This is also not parsed as list (empty line before)


2. Hello
3. world

This is parsed as list (when trimmed, start number is correct)

2. Hello
3. world

This is also parsed as list (obviously)

Content 

1. Hello
2. world

Expected behavior

List starting with non-1 numbers are parsed correctly.

Github handles it:

Hello
world

micromark pre-3 also handled it correctly.

Actual behavior

List starting with non-1 numbers are not parsed correctly when some paragraph or even empty line is present before them (in container?) 🤷

Content.

2. Hello
3. world

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "Content.",
                    "position": {
                        "start": {
                            "line": 2,
                            "column": 1,
                            "offset": 1
                        },
                        "end": {
                            "line": 2,
                            "column": 9,
                            "offset": 9
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 2,
                    "column": 1,
                    "offset": 1
                },
                "end": {
                    "line": 2,
                    "column": 9,
                    "offset": 9
                }
            }
        },
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "2. Hello\n3. world",
                    "position": {
                        "start": {
                            "line": 4,
                            "column": 1,
                            "offset": 11
                        },
                        "end": {
                            "line": 5,
                            "column": 9,
                            "offset": 28
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 4,
                    "column": 1,
                    "offset": 11
                },
                "end": {
                    "line": 5,
                    "column": 9,
                    "offset": 28
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 6,
            "column": 1,
            "offset": 29
        }
    }
}

Runtime etc.

I do not think it's build / runtime dependent - it's some construct issue - but it happens both in browser & node - windows & linux.

Make `definitions` available to extensions

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Problem

i'm writing an extension where I would need the definitions defined in the document.

Solution

the definitions should be available either via the context this.definitions or via getData('mediaDefinitions').
the latter is probably better.

Alternatives

I could probably overwrite all the definition related enter and exit methods, and track them as well, but this sounds like a wrong approach.

What would the usage, api surface and extension points look like?

To get more clarity on where this fits in in the @unifiedjs ecosystem, could the assigned folks add some example usages of this library in this issue please?

Examples of how this would be used by @remarkjs and/or @unifiedjs would be helpful as they would clear up the following questions:

what would the usage of this library look like?
who are the potential consumers of it?
would this cause a rewrite or changes in how remark-parse is written?
what should the api surface look like?
what are potential extension points?
does this impact processor.use from the @unifiedjs world?
will this stream tokens or eat a file and spit out all the tokens at once? (assuming this is a lexer)

..and any other you folks can come up with.

The idea behind this is to discuss and land on a common understanding of this project's technical goals (e.g., is this a lexer? a parser? I've seen both words around here leading to some confusion), nail the api surface and identify potential extension points. This should help speed up dev, lead to some early "documentation" and prevent misalignment on the goals.

Thanks!

crash on reference like structure before directive with parenthesis

Subject of the issue

const micromark = require("micromark/lib");
const directive = require("micromark-extension-directive");

micromark( "[!]:)", "utf-8", { extensions: [directive()] });

throws

AssertionError [ERR_ASSERTION]: expected non-empty token (`chunkString`)

Your environment

OS: Ubuntu

Packages:

├── [email protected]
└── [email protected]

Env: node v15.2.1, npm 7.0.8

Steps to reproduce

run:

const micromark = require("micromark/lib");
const directive = require("micromark-extension-directive");

micromark( "[!]:)", "utf-8", { extensions: [directive()] });

Expected behavior

No error, or a more specific markdown syntax related error

Actual behavior

AssertionError [ERR_ASSERTION]: expected non-empty token (`chunkString`)

HTML with excess whitespace is not parsed correctly

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark 3.0.10, mdast-util-from-markdown 1.2.0

Link to runnable example

https://codesandbox.io/s/awesome-elbakyan-c0gus?file=/src/index.ts

Steps to reproduce

If you remove the line between Some HTML and Spanning multiple lines it does work but excess whitespace makes the parser confused (it thinks the extra line means a new paragraph starts)

Expected behavior

The HTML should be all combined in a single html node

Actual behavior

The parsing fails

How to handle “virtual spacing” in the CST?

I’m going to post a couple of problems I foresee as I’m trying to wrap my head around what micromark will be.

Take the following example:

>␉␠indented.code("in a block quote")

It’s a block quote marker, followed by a tab (tabs are forced to be treated as four spaces).
The first “virtual space” of the tab is part of the block quote marker. The second three “virtual spaces” are part of the indent of the indented code.
One extra real space, and you’ve got a code indent of four spaces, making it a proper indented code, in a block quote.

How is that represented as tokens? In a CST?

Improving performance by reducing useless parsing

Subject of the feature

There are two main places where parsing is done that is (potentially) useless.

content: At the end of a line in content, we parse ahead, to figure out if the paragraph should be closed
This is double, because when the paragraph is closed, we will actually do the parsing.
This was done in remark too. And similar to there, it is a bit optimized.
This point in parsing markdown is rather complex, because of interplay with definitions, setext headings, paragraphs, but also lazy lines.
Removing lookaheadConstruct improves performance by 13%. The alternative should be possible and hopefully is not too big.
document: to figure out whether containers continue, close flow, start new flow, or have lazy lines, another throwaway inspection is done.
Removing document completely improves performance by 28% (although lists are complex so it some time spent there is unavoidable) (SOLVED IN 939e90d)

TokenizeContext.sliceSerialize throws in sliceChunks if first chunk of token is Code instead of string

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

user@HOST micromark-issue % npm ls micromark
micromark-issue@ /Users/user/micromark-issue
└── [email protected]

user@HOST micromark-issue % cat issue.mjs 
import { parse } from "micromark/lib/parse";
import { postprocess } from "micromark/lib/postprocess";
import { preprocess } from "micromark/lib/preprocess";

function repro(markdown) {
  console.log("trying...");
  const encoding = undefined;
  const end = true;
  const options = undefined;
  const chunks = preprocess()(markdown, encoding, end);
  const parseContext = parse(options).document().write(chunks);
  const events = postprocess(parseContext);
  for (const event of events) {
    const [ \_, token, context ] = event;
    context.sliceSerialize(token);
  }
  console.log("ok");
}

repro("Heading\\n=======");
repro("\\nHeading\\n=======");
user@HOST micromark-issue % node issue.mjs 
trying...
ok
trying...
file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:520
      view[0] = view[0].slice(startBufferIndex)
                        ^

TypeError: view[0].slice is not a function
    at sliceChunks (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:520:25)
    at sliceStream (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:154:12)
    at Object.sliceSerialize (file:///Users/user/micromark-issue/node\_modules/micromark/lib/create-tokenizer.js:149:28)
    at repro (file:///Users/user/micromark-issue/issue.mjs:15:13)
    at file:///Users/user/micromark-issue/issue.mjs:21:1
    at ModuleJob.run (node:internal/modules/esm/module\_job:198:25)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:385:24)
    at async loadESM (node:internal/process/esm\_loader:88:5)
    at async handleMainPromise (node:internal/modules/run\_main:61:12)
user@HOST micromark-issue %

Expected behavior

sliceSerialize should always be safe to call in a manner like the above and should return a meaningful string. The presence of a leading \n in Markdown (for example) should not need to be guarded against by library users.

Actual behavior

Exception, see above

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

regression: link references w/o definition are ignored

Subject of the issue

With the old remark parser, link references that didn't have a corresponding definition were nonetheless detected and converted to mdast.

for example, the following:

> [!NOTE]
> This is a note. Who'd have noted?

used to generate:

{
  "type": "root",
  "children": [
    {
      "type": "blockquote",
      "children": [
        {
          "type": "paragraph",
          "children": [
            {
              "type": "linkReference",
              "identifier": "!note",
              "label": "!NOTE",
              "referenceType": "shortcut",
              "children": [
                {
                  "type": "text",
                  "value": "!NOTE"
                }
              ]
            },
            {
              "type": "text",
              "value": "\nThis is a note. Who'd have noted?"
            }
          ]
        }
      ]
    }
  ]
}

With micromark, the linkReference is not inserted, if there is no corresponding definition and a plain paragraph is genererated:

{
  "type": "root",
  "children": [
    {
      "type": "blockquote",
      "children": [
        {
          "type": "paragraph",
          "children": [
            {
              "type": "text",
              "value": "[!NOTE]\nThis is a note. Who’d have noted?"
            }
          ]
        }
      ],
    }
  ],
}

Expected behavior

It should still generate a linkReference node in the mdast, so that the client of the mdast can decide how to handle a missing definition.

Actual behavior

the parser ignores the link reference if no definition is defined.

Strings ending with `\n-` are compiled into a level 2 heading

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

4.0.0

Link to runnable example

https://codesandbox.io/s/trusting-star-wv879z?file=/src/index.mjs

Steps to reproduce

I've created a minimal reproduction of the issue here:

Expected behavior

I'd expect the string to be compiled into a paragraph with the hyphen to be at the start of the 2nd line.

Actual behavior

The string is compiled into a level 2 heading

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Incorrect handling of emphasis for Japanese language

Emphasis markup is parsed incorrectly for Japanese language

Your environment

OS: macOS 11.0.1
Packages: [email protected]
Env: yarn 1.22.10

Steps to reproduce

Input (Japanese) :

console.log(micromark("1.  **新規アプリの追加（NEW APP）**を選択します。"));

Output - incorrect:

<ol>
<li>**新規アプリの追加（NEW APP）**を選択します。</li>
</ol>

Expected behavior

Emphasis should be parsed correctly

<ol>
<li><strong>新規アプリの追加（NEW APP）</strong>を選択します。</li>
</ol>

Additional:

The same text in English and Chinese

Input (English):

console.log(micromark("1.  Select **NEW APP** (top-left corner)"));

Output - correct:

<ol>
<li>Select <strong>NEW APP</strong> (top-left corner)</li>
</ol>

Input (Chinese):

console.log(micromark("1.  选择**添加应用**（左上角）"));

Output - correct:

<ol>
<li>选择<strong>添加应用</strong>（左上角）</li>
</ol>

This bug appeared when I switched to [email protected] from [email protected].

The next code works correctly:

import unified from "unified";
import markdown from "remark-parse"; // 8.0.3
import rehype from "remark-rehype"; // 8.0.0
import stringify from "rehype-stringify"; // 8.0.0

unified()
    .use(markdown)
    .use(rehype)
    .use(stringify)
    .process("1.  __新規アプリの追加（NEW APP）__を選択します。", function(err, file) {
        console.log(String(file));
    });

Reduce coupling by using anylogger

Subject of the feature

Reduce coupling and footprint of minified file by using anylogger i.s.o debug

Problem

Currently, this library has a dependency on debug. Though that is an excellent library, this dependency has 2 major drawbacks:

This library is now forcing debug onto all developers that use this library (high coupling)
debug is 3.1kB minified and gzipped, directly adding 3.1kB to the minimum footprint of this library

Alternatives

Please have a look at anylogger. It's a logging facade specifically designed for libraries. It achieves these goals:

Decouple the library from the underlying logging framework
Reduce the minimal bundle footprint. Anylogger is only 370 bytes.

The decoupling is achieved by only including the minimal facade to allow client code to do logging and using adapters to back that facade with an actual logging framework. The minimal footprint follows naturally from this decoupling as the bulk of the code lives in the adapter.

There are already adapters for some popular logging frameworks and more adapters can easily be created:

anylogger-console (to use the console i.s.o some logging framework)
anylogger-debug
anylogger-loglevel
anylogger-log4js
ulog (logger with native anylogger support)

If this library were to switch to anylogger, you could still install debug as a dev-dependency and then require('anylogger-debug') in your tests to have your tests work exactly as they always did, with debug as the logging framework, while still decoupling it from debug for all clients.

Disclaimer: anylogger was written by me so I'm self-advertising here. However I do honestly believe it is the best solution in this situation and anylogger was written specifically to decrease coupling between libraries and logging frameworks because for any large application, devs typically end up with multiple loggers in their application because some libraries depend on debug, others on loglevel, yet others on log4js and so on. This hurts bundle size badly as we add multiple kB of logging libraries to it.

micromark preserves control characters where commonmark does not

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-aaphim?file=index.js

Steps to reproduce

import { micromark } from "micromark";
import { Parser, HtmlRenderer } from "commonmark";
import rehypeParse from "rehype-parse";
import { unified } from "unified";
import { visit } from "unist-util-visit";
import lodash from "lodash";

const reader = new Parser();
const writer = new HtmlRenderer();
function scrubber(tree) {
  visit(tree, function (node) {
    node.data = undefined;
    node.value = undefined;
    node.position = undefined;
  });

  return tree;
}

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = ``;

const micromarkHtml = micromark(content, {
  allowDangerousHtml: true,
  allowDangerousProtocol: true,
}).trim();
const commonmarkHtml = commonmark(content).trim();

const micromarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(micromarkHtml)
);
const commonmarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(commonmarkHtml)
);

console.log("micromark");
console.log(micromarkHtml);
console.log("");
console.log(JSON.stringify(micromarkHtmlAst, null, 4));
console.log("");
console.log("commonmark");
console.log(commonmark(content));
console.log("");
console.log(JSON.stringify(commonmarkHtmlAst, null, 2));
console.log(lodash.isEqual(micromarkHtmlAst, commonmarkHtmlAst));

📓 the character in content is U+000C

Expected behavior

<p></p>

with the structure

{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "p",
      "properties": {},
      "children": []
    }
  ]
}

Actual behavior

micromark keeps the space

<p>
</p>

changing the structure of the document

{
    "type": "root",
    "children": [
        {
            "type": "element",
            "tagName": "p",
            "properties": {},
            "children": [
                {
                    "type": "text"
                }
            ]
        }
    ]
}

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

control character and puntuation cause extra emphasis to appear

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.0.5

Link to runnable example

https://stackblitz.com/edit/node-njevp4?file=index.js

Steps to reproduce

import { micromark } from 'micromark';
import { Parser, HtmlRenderer } from 'commonmark';
import rehypeParse from 'rehype-parse';
import { unified } from 'unified';
import { visit } from 'unist-util-visit';
import lodash from 'lodash';

const reader = new Parser();
const writer = new HtmlRenderer();
function scrubber(tree) {
  visit(tree, function (node) {
    node.data = undefined;
    node.value = undefined;
    node.position = undefined;
  });

  return tree;
}

const commonmark = (buf) => writer.render(reader.parse(buf));

const content = `example*�.*example example**`;

const micromarkHtml = micromark(content, {
  allowDangerousHtml: true,
  allowDangerousProtocol: true,
}).trim();
const commonmarkHtml = commonmark(content).trim();

const micromarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(micromarkHtml)
);
const commonmarkHtmlAst = scrubber(
  unified().use(rehypeParse, { fragment: true }).parse(commonmarkHtml)
);

console.log('micromark');
console.log(micromarkHtml);
console.log('');
console.log(JSON.stringify(micromarkHtmlAst, null, 4));
console.log('');
console.log('commonmark');
console.log(commonmark(content));
console.log('');
console.log(JSON.stringify(commonmarkHtmlAst, null, 2));
console.log(lodash.isEqual(micromarkHtmlAst, commonmarkHtmlAst));

Expected behavior

single emphasis in the document

<p>example*�.<em>example example</em>*</p>

with the HTML structure

{
  "type": "root",
  "children": [
    {
      "type": "element",
      "tagName": "p",
      "properties": {},
      "children": [
        {
          "type": "text"
        },
        {
          "type": "element",
          "tagName": "em",
          "properties": {},
          "children": [
            {
              "type": "text"
            }
          ]
        },
        {
          "type": "text"
        }
      ]
    }
  ]
}

Actual behavior

extra emphasis is added

<p>example<em>�.<em>example example</em></em></p>

changing the structure

{
    "type": "root",
    "children": [
        {
            "type": "element",
            "tagName": "p",
            "properties": {},
            "children": [
                {
                    "type": "text"
                },
                {
                    "type": "element",
                    "tagName": "em",
                    "properties": {},
                    "children": [
                        {
                            "type": "text"
                        },
                        {
                            "type": "element",
                            "tagName": "em",
                            "properties": {},
                            "children": [
                                {
                                    "type": "text"
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Runtime

Node v16

Package manager

npm v7

OS

Linux

Build and bundle tools

No response

Error - [webpack] 'dist': ./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11 Module parse failed: Identifier directly after number

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

using node 18.17.1

Link to runnable example

No response

Steps to reproduce

I am using botframework-webchat and when i try to build it, the below error message pop-ups.

Error - [webpack] 'dist':
./node_modules/micromark-util-decode-numeric-character-reference/index.js 23:11
Module parse failed: Identifier directly after number (23:11)
You may need an appropriate loader to handle this file type, currently no loaders are configured to process this file. See https://webpack.js.org/concepts#loaders
| code > 126 && code < 160 ||
| // Lone high surrogates and low surrogates.

code > 55_295 && code < 57_344 ||
| // Noncharacters.
| code > 64_975 && code < 65_008 || /* eslint-disable no-bitwise */
@ ./node_modules/mdast-util-from-markdown/lib/index.js 138:0-97 1061:14-45
@ ./node_modules/mdast-util-from-markdown/index.js
@ ./node_modules/botframework-webchat/lib/markdown/private/iterateLinkDefinitions.js
@ ./node_modules/botframework-webchat/lib/markdown/renderMarkdown.js
@ ./node_modules/botframework-webchat/lib/index.js
@ ./lib/extensions/chatbotExtension/renderer/Chatbot.js
@ ./lib/extensions/chatbotExtension/renderer/ChatbotPanel.js
@ ./lib/extensions/chatbotExtension/ChatbotExtensionApplicationCustomizer.js
./node_modules/micromark-util-sanitize-uri/index.js 86:22

Expected behavior

the package should build sucessfully

Actual behavior

currently its giving error while running npm build

Runtime

Node v16

Package manager

npm v7

OS

Windows

Build and bundle tools

Webpack

hard break at the end of a paragraph is not properly parsed

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micormark

Link to runnable example

https://codesandbox.io/s/thirsty-fire-1jdcgn

Steps to reproduce

parse this:

# Trailing hard-break

This break is properly detected\
yes?

But a trailing break is not\



What's worse, it leaves a stray `\`

checking the github behaviour, it's the same :-( but of course this is rather unfortunate and is difficult to find a workaround.

Expected behavior

<h1>Trailing hard-break</h1>
<p>This break is properly detected<br />
yes?</p>
<p>But a trailing break is not<br /></p>
<p>What's worse, it leaves a stray <code>\</code></p>

Actual behavior

<h1>Trailing hard-break</h1>
<p>This break is properly detected<br />
yes?</p>
<p>But a trailing break is not\</p>
<p>What's worse, it leaves a stray <code>\</code></p>

Runtime

Node v14

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

Attention misnests tokens

I have been wishing to write a (simple and lightweight) spec‐compliant editor for Markdown with syntax highlighting for a while now.

Now that this library has become usable (and it seems to be the first of its kind), I have finally gotten an opportunity to write a simple editor with it! (Thank you! 🎉)

Unfortunately, there appears to be a bug in the library! The issue I’m running into is that emphases marked with *** (both regular and strong) have their tokens misnested.

I have written a simple program to demonstrate what I mean:

simple reproduction example

import parser from "https://dev.jspm.io/[email protected]/lib/parse.js"
import preprocessor from "https://dev.jspm.io/[email protected]/lib/preprocess.js"
import postprocessor from "https://dev.jspm.io/[email protected]/lib/postprocess.js"

let preprocess = txt =>
{
	let write = preprocessor()
	return [...write(txt), ...write(null)]
}

let parse = text => postprocessor()(preprocess(text).flatMap(parser().document().write))

let tokens = parse("hello ***world***")
tokens.pop()

let output = ""

let i = 0
let offset
for (let [kind, {type, start, end}] of tokens)
{
	let char = "→"
	if (kind === "enter") offset = start.offset
	else offset = end.offset, i--, char = "←"
	output += `${" ".repeat(i*3) + char} ${type} at ${offset}\n`
	if (kind === "enter") i++
}

console.log(output)

(Note: I’m using dev.jspm.io for now, as opposed to jspm.dev, because jspm.dev bundles the whole library into its index file, as opposed to separating it into multiple files. See more info on jspm.dev’s announcement post)

Currently, the output is the following:

current output

→ content at 0
   → paragraph at 0
      → data at 0
      ← data at 5
      → data at 5
      ← data at 6
      → emphasis at 8
         → emphasisSequence at 8
         ← emphasisSequence at 9
         → emphasisText at 9
            → strong at 6
               → strongSequence at 6
               ← strongSequence at 8
               → strongText at 8
                  → data at 9
                  ← data at 14
               ← strongText at 15
               → strongSequence at 15
               ← strongSequence at 17
            ← strong at 17
         ← emphasisText at 14
         → emphasisSequence at 14
         ← emphasisSequence at 15
      ← emphasis at 15
   ← paragraph at 17
← content at 17

As you can see, when moving from → emphasisText at 9 to → strong at 6 (as well as in other places), the indices go down, which is unexpected. This causes my highlighter to break! 😱

Thanks in advance for the attention!

How far to buffer?

Markdown consists of blocks and inlines. Blocks are parsed per line.

Typically, at a certain point in a line, you know you’re right: take this ATX heading:

###### A heading

When standing on the space, you know you’re in a heading: it can’t be anything else. So ATX headings don’t really need to buffer a lot: at most 6 characters.

Other values, need more, like this link definition:

[take]:
https://this-link-definition
'asd
> block quote?
asd
asd
asd
asd

Only at the last character, the line feed without a closing title marker before it, do you know you need to backtrack, and parse the whole thing again. And it isn’t all a paragraph either, take for example the embedded > block quote?

An alternative example that needs to buffer infinity lines is indented code:

␠␠␠␠this is a chunk (a properly indented non-blank line)
␠␠␠
␠␠
␠
␠␠
␠␠␠
␠␠␠␠
␠␠␠␠␠
␠␠␠␠
␠␠␠
␠␠
␠
␠␠␠
<-- And only here do we know the blank lines are not part of the indented code. Note that the line endings, and more that four spaces in a blank line, still show up in the code, so if we had another chunk, all the above line endings and that one extra space would be there.

🤔 So how far does one buffer? These are edge cases, not common in normal Markdown. But it could be interesting to see if we can cap this to reduce a potential memory problem.

3.0.8 seems to introduce a module level dependency on document

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

3.0.8

Link to runnable example

No response

Steps to reproduce

I'm using Micromark in an astro project, and ever since installing micromark 3.0.8 I get this error:

[15:04:05] [snowpack] + [email protected]
[build] Unable to render src/pages/renew/checkout.astro

ReferenceError: document is not defined

While this might be snowpack being finnicky, [email protected] works totally fine! So just wondering if anything has been introduced which could cause it.

Expected behavior

My project builds:

Actual behavior

[15:04:05] [snowpack] + [email protected]
[build] Unable to render src/pages/renew/checkout.astro

ReferenceError: document is not defined

Runtime

Node v16

Package manager

yarn v1

OS

macOS

Build and bundle tools

Snowpack

Split code into several packages, use export maps and conditions

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Subject

Split code in several packages, use export maps and condition

Problem

Given that:

We have instrumented development code (with assertions and more verbose code such as using codes.greaterThan instead of the actual character code) and optimized production code, that is currently split into dist/ or lib/ respectively
Many of the internals (such as constants like codes, values, constants, but also the utilities on detecting characters, the factories in tokenize/factory-, or the tiny things in lib/util/) are useful in micromark extensions (or inverted: many of the extensions currently use the micromark’s internals)

Solution

I propose:

Making micromark/micromark a monorepo that houses a couple of projects
Using micromark-factory-* as a namespace in the ecosystem for factories: some housed in the monorepo (how to parse a label), some in their own repos in this org (how micromark-extension-directive parses HTML attributes or micromark-extension-expression parses JavaScript), yet some others in the ecosystem
As we already have ESM, combine several files into one exported file, that uses named export to expose their functions. For example:
- micromark-core-character would expose all the ascii*, unicode*, and markdown* functions currently in micromark/lib/character
- micromark-core-constant would expose codes, constants, values, types, html-block-names, html-raw-names
Create a small rollup config file or wrapper that takes a prod/ folder which houses a micromark extension/factory/core, and builds a dev/ folder from it, copying types, inlining constants, and removing assertions
Use export maps with conditions (see endorsed ones) set to either development / production / default (same as prod I guess)

Alternatives

images without alt should not generate alt attribute with empty string

Subject of the issue

see

micromark/test/io/text/image.mjs

Line 152 in 63cf514

'<img src="example.png" alt="" />',

I don't know if this is really a bad thing, but the new behaviour of micromark is to generate an empty string for the alt attribute, where as the old remark-parser used to set the alt property of the mdast node to null.

there is a slight distinction, such as an image with an empty alt text is considered a decorative image and should be ignored by a screen reader. if the alt attribute is missing, it will just read the src (not a brilliant behaviour, either :-)

https://www.w3.org/WAI/tutorials/images/decorative/

In any case, the new behaviour allows the author to specify decorative images in markdown by default, which wasn't possible before.

Expected behavior

not sure. But for backward compatibility's sake: A markdown image w/o an alt text should not create an alt attribute in HTML (mdast node's property should be null)

`micromark-util-symbol` can not be imported by typescript

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

try to import micromark-util-symbol in typescript

Expected behavior

no type error reporting

Actual behavior

get error: Cannot find module 'micromark-util-symbol' or its corresponding type declarations.ts(2307)

Possible Solution

add a field in package.json:

  "types": "./lib/default.d.ts",

Runtime

No response

Package manager

No response

OS

No response

Build and bundle tools

No response

Reduce execution time by ~11% with a simple reimplementation of TokenizeContext.now

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

latest main branch

Link to runnable example

No response

Steps to reproduce

I ran a profile of micromark and noticed TokenizeContext.now was something like the 4th most time-consuming function. It’s quite a simple function as all it does is return a copy of point. I tried a couple of alternate implementations and found one that reduces runtime by ~11% on an Apple Silicon M1 (mac Mini). I imagine the delta is different on other hardware, so maybe other folks can give this a try on their hardware? That said, I expect this change will be more efficient everywhere because it avoids a call into Object.assign and tells the JIT exactly what needs to be done. All tests pass with this change; you can see the code change and minimal test harness here: main...DavidAnson:micromark:TokenizeContext-now.

I added a scenario to perf.js that reads the content of readme.md and calls micromark 500 times. This input seems fairly representative, but I’m happy if folks want to profile on something else. The numbers below are pretty stable, so I only took three samples before/after.

The 3 readings I did before changing anything: ((17.726 + 17.786 + 17.606) / 3) = 17.706s

The 3 readings I did after making the change: ((15.75 + 15.676 + 15.688) / 3) = 15.705s

By my math, the time eliminated is: ((17.706 - 15.705) / 17.706) = 0.1130 = 11.30%

To be sure, the alternate implementation I propose here violates the encapsulation of Point - but a simple test case could be added to ensure any future changes to Point are accomodated.

I can send a proper PR if folks are open to this change.

Expected behavior

N/A

Actual behavior

N/A

Runtime

Node v16

Package manager

npm v7

OS

macOS

Build and bundle tools

No response

Implementation of autolink and literalAutolink (micromark-extension-gfm-autolink-literal) are inconsistent when handling "@."

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

micromark 4.0.0, micromark-extension-gfm-autolink-literal 2.0.0

Link to runnable example

No response

Steps to reproduce

[email protected]

<[email protected]>

<[email protected]>

Expected behavior

Consistent treatment of [email protected] by autolink and literalAutolink.

Actual behavior

[email protected] and <[email protected]> are both emitted as literalAutolink. Expected behavior is observed for <[email protected]> which is emitted as autolink.

This is significant for a linter which can be confused by the current behavior into adding infinite <> wrappers attempting to turn [email protected] from literalAutolink into autolink: DavidAnson/markdownlint#1140

I propose that <[email protected]> should be treated as autolink, which is seemingly possible if emailAtSignOrDot behaved differently:

micromark/packages/micromark-core-commonmark/dev/lib/autolink.js

Lines 203 to 205 in 8b08d16

    
           function emailAtSignOrDot(code) { 
        
             return asciiAlphanumeric(code) ? emailLabel(code) : nok(code) 
        
           }

The micromark tokens (when using micromark-extension-gfm-autolink-literal) for parsing the above Markdown are:

content [email protected]
  paragraph [email protected]
    literalAutolink [email protected]
      literalAutolinkEmail [email protected]
lineEnding \n
lineEndingBlank \n
content <[email protected]>
  paragraph <[email protected]>
    data <
    literalAutolink [email protected]
      literalAutolinkEmail [email protected]
    data >
lineEnding \n
lineEndingBlank \n
content <[email protected]>
  paragraph <[email protected]>
    autolink <[email protected]>
      autolinkMarker <
      autolinkEmail [email protected]
      autolinkMarker >
lineEnding \n

Runtime

Node v16

Package manager

npm v6

OS

macOS

Build and bundle tools

Webpack

Configure collapsing newlines into a single paragraph

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Problem

I want to have several paragraphs like this:

I am a paragraph.
I am part of the same paragraph.

But I am a new paragraph.

This is compiled to the following:

I am a paragraph. I am part of the same paragraph. But I am a new paragraph.

Solution

I'd expect the following result:

I am a paragraph. I am part of the same paragraph.

But I am a new paragraph.

Alternatives

I could use the  tag manually in the Markdown.

	function emailAtSignOrDot(code) {
	return asciiAlphanumeric(code) ? emailLabel(code) : nok(code)
	}