commonmark / commonmark.js Goto Github PK

CommonMark parser and renderer in JavaScript

License: Other

JavaScript 83.67% Shell 1.02% HTML 13.05% Makefile 1.39% CSS 0.88%

commonmark.js's Introduction

CommonMark

CommonMark is a rationalized version of Markdown syntax, with a spec and BSD-licensed reference implementations in C and JavaScript.

Try it now!

For more details, see https://commonmark.org.

This repository contains the spec itself, along with tools for running tests against the spec, and for creating HTML and PDF versions of the spec.

The reference implementations live in separate repositories:

https://github.com/commonmark/cmark (C)
https://github.com/commonmark/commonmark.js (JavaScript)

There is a list of third-party libraries in a dozen different languages here.

Running tests against the spec

The spec contains over 500 embedded examples which serve as conformance tests. To run the tests using an executable $PROG:

python3 test/spec_tests.py --program $PROG

If you want to extract the raw test data from the spec without actually running the tests, you can do:

python3 test/spec_tests.py --dump-tests

and you'll get all the tests in JSON format.

JavaScript developers may find it more convenient to use the commonmark-spec npm package, which is published from this repository. It exports an array tests of JSON objects with the format

{
  "markdown": "Foo\nBar\n---\n",
  "html": "<h2>Foo\nBar</h2>\n",
  "section": "Setext headings",
  "number": 65
}

The spec

The source of the spec is spec.txt. This is basically a Markdown file, with code examples written in a shorthand form:

```````````````````````````````` example
Markdown source
.
expected HTML output
````````````````````````````````

To build an HTML version of the spec, do make spec.html. To build a PDF version, do make spec.pdf. For both versions, you must have the lua rock lcmark installed: after installing lua and lua rocks, luarocks install lcmark. For the PDF you must also have xelatex installed.

The spec is written from the point of view of the human writer, not the computer reader. It is not an algorithm---an English translation of a computer program---but a declarative description of what counts as a block quote, a code block, and each of the other structural elements that can make up a Markdown document.

Because John Gruber's canonical syntax description leaves many aspects of the syntax undetermined, writing a precise spec requires making a large number of decisions, many of them somewhat arbitrary. In making them, we have appealed to existing conventions and considerations of simplicity, readability, expressive power, and consistency. We have tried to ensure that "normal" documents in the many incompatible existing implementations of Markdown will render, as far as possible, as their authors intended. And we have tried to make the rules for different elements work together harmoniously. In places where different decisions could have been made (for example, the rules governing list indentation), we have explained the rationale for our choices. In a few cases, we have departed slightly from the canonical syntax description, in ways that we think further the goals of Markdown as stated in that description.

For the most part, we have limited ourselves to the basic elements described in Gruber's canonical syntax description, eschewing extensions like footnotes and definition lists. It is important to get the core right before considering such things. However, we have included a visible syntax for line breaks and fenced code blocks.

Differences from original Markdown

There are only a few places where this spec says things that contradict the canonical syntax description:

It allows all punctuation symbols to be backslash-escaped, not just the symbols with special meanings in Markdown. We found that it was just too hard to remember which symbols could be escaped.
It introduces an alternative syntax for hard line breaks, a backslash at the end of the line, supplementing the two-spaces-at-the-end-of-line rule. This is motivated by persistent complaints about the “invisible” nature of the two-space rule.
Link syntax has been made a bit more predictable (in a backwards-compatible way). For example, Markdown.pl allows single quotes around a title in inline links, but not in reference links. This kind of difference is really hard for users to remember, so the spec allows single quotes in both contexts.
The rule for HTML blocks differs, though in most real cases it shouldn't make a difference. (See the section on HTML Blocks for details.) The spec's proposal makes it easy to include Markdown inside HTML block-level tags, if you want to, but also allows you to exclude this. It also makes parsing much easier, avoiding expensive backtracking.

It does not collapse adjacent bird-track blocks into a single blockquote:

> these are two

> blockquotes

> this is a single
>
> blockquote with two paragraphs

Rules for content in lists differ in a few respects, though (as with HTML blocks), most lists in existing documents should render as intended. There is some discussion of the choice points and differences in the subsection of List Items entitled Motivation. We think that the spec's proposal does better than any existing implementation in rendering lists the way a human writer or reader would intuitively understand them. (We could give numerous examples of perfectly natural looking lists that nearly every existing implementation flubs up.)
Changing bullet characters, or changing from bullets to numbers or vice versa, starts a new list. We think that is almost always going to be the writer's intent.
The number that begins an ordered list item may be followed by either . or ). Changing the delimiter style starts a new list.
The start number of an ordered list is significant.
Fenced code blocks are supported, delimited by either backticks (```) or tildes (~~~).

Contributing

There is a forum for discussing CommonMark; you should use it instead of github issues for questions and possibly open-ended discussions. Use the github issue tracker only for simple, clear, actionable issues.

Authors

The spec was written by John MacFarlane, drawing on

his experience writing and maintaining Markdown implementations in several languages, including the first Markdown parser not based on regular expression substitutions (pandoc) and the first markdown parsers based on PEG grammars (peg-markdown, lunamark)
a detailed examination of the differences between existing Markdown implementations using BabelMark 2, and
extensive discussions with David Greenspan, Jeff Atwood, Vicent Marti, Neil Williams, and Benjamin Dumke-von der Ehe.

Since the first announcement, many people have contributed ideas. Kārlis Gaņģis was especially helpful in refining the rules for emphasis, strong emphasis, links, and images.

commonmark.js's People

Contributors

Stargazers

Watchers

Forkers

brianleroux peterarmstrong rayray pri17 hours alberthilb robinst kublaj crissov ara303 iamstarkov balpha 0b10011 aureliojargas dudb glowdan chrisui xxgreg mcanthony mitaki28 tastes syaiful6 myinitialsarepm substance nikolas inno-v cyj100 danielbaird tmpfs nicojs fxcebx groupystinks pajn prasannavl rlugojr yiqideren ashang mgs255 dikmax curtis-fletcher timothygu brentonstrine mcannonbrookes happy-ferret orangeshark baig mattermost jmk2142 kirillfish rarara nodeframe maxsaxedesignweb rowhit thejimmyg noclew rainfore namse techhtml dwetterau kevmoo geyang liangklfang machour abi pastuhov 5dw larrikinventures tmr232 zischwartz sthagen murphymark dygapp fnd acidburn0zzz ludwigfrank bquast jiaochanglong gpzjyw mccasey kazssym sebastienh llqhz robertdober rmoorman digideskio vcode28629 seth4618 andersk sheraff rileytomasek vassudanagunta brunoscv kemitchell bikrone dhavalbudhelia iamahuman xzl8028 gosukiwi lcsingleton sitedata

commonmark.js's Issues

tests for AST tree

Do you remember my topic about commonmark API design?

I’m in process of figuring out how to do it in the best way. To proof the concept, I chose to create helper module. One of the method is ast(), but I have no idea how to test it properly and didn‘t find any tests related to AST tree in commonmark.js too =(.

Firstly, I thought about deepEqual, but it failed due to circular deps, then I tried simple equal with JSON.stringify() but circular deps broke everything here too.

How can I verify that AST tree have proper structure? any advice or tip will be useful, thanks

Fuse/merge adjacent text nodes

Currently, the AST returned by commonmark.js can contain multiple adjacent text nodes. E.g. the following (dingus):

https://www.google.com/?q=foo_bar

Results in this AST:

<document>
  <paragraph>
    <text>https://www.google.com/?q=foo</text>
    <text>_</text>
    <text>bar</text>
  </paragraph>
</document>

For uses that require post-processing the AST before rendering (e.g. autolinking plain URLs), this makes it a little bit more difficult, because adjacent text nodes may have to be merged first.

Could this be implemented in the inline parser directly, so that the resulting AST never contains adjacent text nodes?

This isn't really a bug report. It's more of a discussion starter, and to hear your thoughts about this. I'm thinking about how to implement auto-linking in my implementation of CommonMark, and post-processing might be a good option.

Any plans for a Grunt plugin?

Emphasis regression?

http://spec.commonmark.org/dingus/?text=_%28hai%29_.%20%3C-%20bad%0A%0A_%28hai%29_%0A%0A*%28hai%29*.%0A%0A

Happened somewhere after 0.15.

Is it a bug or intentional behaviour, when user should use *?

Inconsistent handling of malformed link reference titles

Consider these two samples. Both have malformed link titles.

[foo]: /url
"title

[foo]

http://spec.commonmark.org/dingus/?text=%5Bfoo%5D%3A%20%2Furl%0A%22title%0A%0A%5Bfoo%5D

and

[foo]: /url
"title" ok

[foo]

http://spec.commonmark.org/dingus/?text=%5Bfoo%5D%3A%20%2Furl%0A%22title%22%20ok%0A%0A%5Bfoo%5D

The first sample creates a link reference, but the second one doesn't. I believe they should both be treated the same - ie. links in both cases.

Fix url normalizer

Discussed here commonmark/commonmark-spec#270

My last attempt to use url for honest parse caused tons of broken tests (see commonmark/commonmark-spec#270 (comment)):

It always add missed / after domain name (http://example.com?abc -> http://example.com/?abc)
It replaces \ with / in query http://example.com?abc nodejs/node-v0.x-archive@f7ede33

Need to decide, what to do with (tests|spec|implementation). I've stopped working on this issue, until direction given. Code is available in separate branch https://github.com/markdown-it/markdown-it/commits/normalize (last commit).

Example:

node -e "console.log(require('url').parse('http://example.com?foo'))"

{ protocol: 'http:',
  slashes: true,
  auth: null,
  host: 'example.com',
  port: null,
  hostname: 'example.com',
  hash: null,
  search: '?foo',
  query: 'foo',
  pathname: '/',
  path: '/?foo',
  href: 'http://example.com/?foo' }

see href

Named HTML entities with multiple codepoints not parsed correctly

See the following example: http://spec.commonmark.org/dingus/?text=%26ngE%3B%0A%0A%26gE%3B

&ngE; should be rendered as "≧̸" (U+02267 U+00338), but it's actually rendered as "≧" (which is the same as &gE;)

It looks like other such named entities are also not handled correctly.

Would probably also be good to add such an entity to the spec so that implementations are checked for this.

Parser adds lines to a tip which can't accept them

Here's an example which exhibits this behavior:

10. Bullet

        code


Test

The issue occurs when parsing line 5. As you can see, it checks whether the container (a CodeBlock) accepts lines, but then adds the line to the tip instead (which is a Document):

According to the comments on lines 117-118, the tip should be checked to see if it handles lines. Would it therefore be true that line 723 should check the tip type instead of the container type? Or is there perhaps an issue with the tip being out-of-sync with the container?

Dingus shows incorrect sourcepos value for paragraphs

http://spec.commonmark.org/dingus.html?text=123

At first the value seems correct (<paragraph data-sourcepos="1:1-1:3">) but as soon as you start typing/clicking in the editor, it goes off: <paragraph data-sourcepos="19:1-1:12">

Probably because this.lineNumber is not reset.

Invalid node type

I'm ported commonmark.js to java (https://github.com/hidekatsu-izuno/commonmark4j).

Then I found the non existing node type HtmlInline in the xml.js. Is this not HtmlBlock?

        unescapedContents = nodetype === 'Html' || nodetype === 'HtmlInline';

AST confusion

Given the following script:

{Parser} = require 'commonmark'

text = """
# test

some text

- list item one
- list item two
- list item three

more text
"""

parser = new Parser()
walker = parser.parse(text).walker()
console.log('''
hasNext | hasPrev | type | literal
------- | ------- | ---- | -------
''')
while event = walker.next()
  node = event.node
  console.log(
    [
      node.next isnt null
      node.prev isnt null
      node.type
      node.literal
    ].join(' | ')
  )

I get the following output:

hasNext	hasPrev	type	literal
false	false	Document
true	false	Header
false	false	Text	test
true	false	Header
true	true	Paragraph
false	false	Text	some text
true	true	Paragraph
true	true	List
true	false	Item
false	false	Paragraph
false	false	Text	list item one
false	false	Paragraph
true	false	Item
true	true	Item
false	false	Paragraph
false	false	Text	list item two
false	false	Paragraph
true	true	Item
false	true	Item
false	false	Paragraph
false	false	Text	list item three
false	false	Paragraph
false	true	Item
true	true	List
false	true	Paragraph
false	false	Text	more text
false	true	Paragraph
false	false	Document

Which has a couple problems:

Child-nodes are returned without using the walker in each node (which makes the next/prev values really confusing)
Container nodes are repeated (like Header-Text-Header and Paragraph-Text-Paragraph sets). This would make sense if the AST was flat and didn't show child / parent nodes (meaning that the duplicate nodes represent start/end HTML tags), However, there's no indication of which tags are start / end tags, and the AST isn't flat.

Source maps

See commonmark/commonmark-spec#57, especially @zdne's comment.

Instead of starting and ending line and column for each element, we need to associate each element with a possibly non-contiguous range of positions in the source.

This is because CommonMark inline elements can be broken by indicators of block structure:

> *emphasized
> text*

Here the second > should not be considered part of the emphasized text, even though it occurs after the start of the emphasized text and before the end.

isContainer, attribute or method?

Hi,

in the README, isContainer is described as an attribute of a node. But in the code, it's a method.

Updating the README should be enough, but maybe it make more sense to modify the code to make it a real attribute. What do you think?

Blockquote termination edge-case

The spec says, “An indented code block cannot interrupt a paragraph.”

So this looks right:

$ echo -e '> 111\n    222' | ./commonmark-0.20 
<blockquote>
<p>111
222</p>
</blockquote>

But a list can:

$ echo -e '> 111\n - 222' | ./commonmark-0.20 
<blockquote>
<p>111</p>
</blockquote>
<ul>
<li>222</li>
</ul>

The trouble comes when the thing after the blockquote looks like this:

> 111
    - 222

$ echo -e '> 111\n    - 222' | ./commonmark-0.20 
<blockquote>
<p>111</p>
</blockquote>
<pre><code>- 222
</code></pre>

It terminates a blockquote saying “hey, I'm a list”, but is parsed as a code block afterwards.

Imho, it should be parsed like a paragraph continuation after all:

<blockquote>
<p>111
- 222</p>
</blockquote>

Wrong parse on nested links/emphasis

% bin/commonmark 
**x [a*b**c*](d)
<p><em><em>x <a href="d">a<em>b</em>c</a></em></em></p>

See commonmark/cmark#59.

Add version of dingus that works better with screen readers.

See http://talk.commonmark.org/t/the-commonmark-dingus-edit-box-is-inaccessible-to-some-screen-reader-users/1067/4.

"&" entity doesn't get converted to "&" character

Update: I was wrong about the issue, see comment below. Sorry.

Hello.

Specification says:

With the goal of making this standard as HTML-agnostic as possible, all valid HTML entities (except in code blocks and code spans) are recognized as such and converted into unicode characters before they are stored in the AST.

& is a valid HTML entity and when it's stored in AST it should be converted into '&', but it's not.

It can be easily demonstrated with http://try.commonmark.org. If you input that code:

&amp;
&mu;

You will get following output in AST tab:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:4">
  <paragraph sourcepos="1:1-2:4">
    <text>&amp;</text>
    <softbreak />
    <text>μ</text>
  </paragraph>
</document>

The output that I expect is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CommonMark SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:4">
  <paragraph sourcepos="1:1-2:4">
    <text>&</text>
    <softbreak />
    <text>μ</text>
  </paragraph>
</document>

Thanks

Escaped backslash at end of link label produces wrong output

If I am reading the spec right, the following sample should be a valid link label shouldn't it?

[bar\\]: /url "title"

[bar\\]

http://spec.commonmark.org/dingus/?text=%5Bbar%5C%5C%5D%3A%20%2Furl%20%22title%22%0A%0A%5Bbar%5C%5C%5D%0A

However commonmark.js is not recognising it as such, and is instead generating a paragraph.

Separate parser library

I've been using the commonmark.js parser combined with react for rendering.

Since I am not using the html or xml renderer, it would be great to have a separate js file built which only contains the parser, node, and walker objects.

Definitions of `can_open` and `can_close`

This is not really an issue -- more like an extended comment, which could be used to simplify the code and/or the definitions if you think it's better. (Posting it here though it could be a comment on the spec or probably the cmark code.)

On first reading, I found the definitions of left/right flanking and their use in can-open/close very confusing. My guess is that initially the flanking concept was simplifying things, but IMO it's not as helpful now when more conditions are piled up. With the recent additional change I resorted to a drawing karnaugh maps, and the result is much shorter:

var sp_after  = reWhitespaceChar.test(char_after);
var sp_before = reWhitespaceChar.test(char_before);
var pn_after  = rePunctuation.test(char_after);
var pn_before = rePunctuation.test(char_before);
can_open  = !sp_after  && (pn_before || sp_before || (cc===C_ASTERISK && !pn_after ));
can_close = !sp_before && (pn_after  || sp_after  || (cc===C_ASTERISK && !pn_before));

The length is of course not too relevant, and I'm guessing that the speed would be practically the same. What seems important to me, and this might be just me, is that it's much easier to read, and makes more easily sense as a definition (had it been in the text).

Make HTML renderer more customizable

It would be nice if renderers for individual elements (e.g. links) could be customized by setting properties of the renderer object.

0.22.0 minified version does not parse inline links correctly

With example input:

# H1

Lorem ipsum.

## H2

[link][foo]

[foo]: http://foo.com

### H3

1. Item 1
2. Item 2

* Bullet 1
* Bullet 2

    ~~~
    blockquote here
    ~~~

* An example [link](http://example.com 'link title').

And the unminified dist/commonmark.js, the (correct) parse tree is:

Document
.Header
..Text# H1
.Paragraph
..Text# Lorem i...
.Header
..Text# H2
.Paragraph
..Link
...Text# link
.Header
..Text# H3
.List
..Item
...Paragraph
....Text# Item 1
..Item
...Paragraph
....Text# Item 2
.List
..Item
...Paragraph
....Text# Bullet 1
..Item
...Paragraph
....Text# Bullet 2
...CodeBlock# blockqu...
..Item
...Paragraph
....Text# An exam...
....Link
.....Text# link
....Text# .

However, using dist/commonmark.min.js, the parse tree is:

Document
.Header
..Text# H1
.Paragraph
..Text# Lorem i...
.Header
..Text# H2
.Paragraph
..Link
...Text# link
.Header
..Text# H3
.List
..Item
...Paragraph
....Text# Item 1
..Item
...Paragraph
....Text# Item 2
.List
..Item
...Paragraph
....Text# Bullet 1
..Item
...Paragraph
....Text# Bullet 2
...CodeBlock# blockqu...
..Item
...Paragraph
....Text# An exam...
....Text# [
....Text# link
....Text# ]
....Text# (http:/...
....Text# '
....Text# link title
....Text# '
....Text# ).

Any ideas? I am just reporting this observation, I have not tried to look into why the minified version displays this behavior. I am seeing the same thing with master 8fefa4954a76bd1b78fe7144c4aef7d4eb499cc3 as well.

Regards,
Paul

HTTPS download link

The readme has a link to the compiled source here: http://spec.commonmark.org/js/commonmark.js

I think we should try to be encouraging developers to get their code over HTTPS, to prevent problems like the recent Xcode attack.

Simply changing the above link to https didn't work (cert domain error). It seems to be hosted on GitHub but I couldn't figure out the correct invocation through GitHub pages or whatever. This URL does work though: https://raw.githubusercontent.com/jgm/CommonMark-site/gh-pages/js/commonmark.js

Thanks.

Edit: this would also apply to the whole CommonMark site, but one step at a time...

roadmap

cannot find any roadmap for commonmark. can anybody point me?

Putting slighlty misaligned blockquote in a list

Is the following a bug in commonmark.js? The behaviour seems to disobey rule 4 of List Items.

 > Blockquote
> continued here.


1.  > Blockquote
   > continued here.

Dingus here.

class injection in code block renderer

Did you know that CommonMark supports buttons? Apparently it does:

The trouble is: renderer appends the character class blindly to the language- part. If there are spaces there, we'll end up with <code class="language-foo bar"> which is two separate classes.

Well technically space (0x20) is filtered. But HTML5 allows 5 space characters according to 2.4 part of the spec. I'll quote:

The space characters, for the purposes of this specification, are U+0020 SPACE, "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), and "CR" (U+000D).

So, this code block will have three classes: language-foo, bar and baz:

```foo&#x09;bar&#x0C;baz
code
```

Second and third are essentially user-supplied, which is usually a bad thing.

Update mdurl dependency

Unfortunately the google closure compiler is unable to parse the commonmark library as there is a variable called "char" which is a reserved keyword in javascript.

This has been fixed upstream in mdurl.

See: markdown-it/mdurl#1 (comment)

Italics inside Bold text can parse as double em instead of strong

When trying to parse:
1. **one t*w*o three**
you are returned:

<ol>
<li><em><em>one t</em>w</em>o three**</li>
</ol>

You would expect:

<ol>
<li><strong>one t<em>w</em>o three</strong></li>
</ol>

It's taking the first to asterisks as opening EMs and the other 2 around the "w" as closing

Live exmaple: http://spec.commonmark.org/dingus/?text=1.%20**one%20t*w*o%20three**

note: it also has the same effect if you put the string in without the list: **one t*w*o three**
Live example: http://spec.commonmark.org/dingus/?text=**one%20t*w*o%20three**

Odd list behavior

10000. ok
    1. ok

should give a list with two items, but it does not.
Similarly

   10. hi
    11. there

The trigger is a four space indent. See this topic.

Diagnosis: In lib/blocks.js, around lines 384-399, the parser assumes that if a line is indented 4 or more spaces but it's not a code block (because it would be interrupting a paragraph), then it's a lazy paragraph continuation. That assumption is wrong, because it might be a list item.

the README links to an old version of commonmark.js (on spec.commonmark.org)

There's a link in the README to http://spec.commonmark.org/js/commonmark.js, which is old enough that it refers to DocParser instead of Parser, so then the example usage later in the README is wrong.

If I knew where the source of that website was, I'd give you a pull request updating it. :)

Dingus permalink doesn't include smart punctuation flag

To replicate the issue:

Visit this example
Check the "Smart punctuation" box
Click on "permalink" - note that "Smart punctuation" is not checked

Test #8 from smart_punct.txt

It looks like test #8 from smart_punct.txt fail, at least in dingus.

"A paragraph with no closing quote.

"Second paragraph by same speaker, in fiction."

Dingus shows first paragraph with closing quote, while there should be open quote.

Add source positions for inline elements

This would make it possible to use commonmark.js for syntax highlighting.

Softbreak and Hardbreak

Accordingly to readme and code there are two types of breaks: Softbreak and Hardbreak. Hardbreak prepended by double space, but then I cannot understand how Softbreak is detected. Can you help me?

## Expected
Input: "YOLO  \nmd ftw\n"
AST:
  Document: null
  Paragraph: null
  Text: YOLO
  Hardbreak: null // it’s fine, it’s how it is suppoused to be
  Text: md ftw
  Paragraph: null
  Document: null

## Not Expected
Input: "YOLO\n\n\nmd ftw\n"
AST:
  Document: null
  Paragraph: null
  Text: YOLO
  // where is Softbreak here?
  Paragraph: null
  Paragraph: null
  Text: md ftw
  Paragraph: null
  Document: null

Allow Node classes to be set

I have to transform some custom syntax like this to HTML:

# Demo

T> This is some tip.

W> This is some warning.

I realized I could get 90% there by transforming those into blockquotes. That's not enough, though. I would need to attach that metadata (tip, warning) there to style these appropriately.

Here's what I came up with:

'use strict';
var markdown = require('commonmark');

var mdReader = new markdown.Parser();
var mdWriter = new markdown.HtmlRenderer();

main();

function main() {
    var content = '# demo\nT> some tip\n\nW> some warning\n'
    var content2 = '> some';
    var parsed = mdReader.parse(content);

    parsed = transform(parsed);

    var result = mdWriter.render(parsed);

    console.log(result);
}

function transform(parsed) {
    var walker = parsed.walker();
    var event, node;

    while ((event = walker.next())) {
      node = event.node;
      if (event.entering && node.type === 'Text') {
        if(node.literal.indexOf('T>') === 0) {
            node._parent._classes = ['tip']; // XXX: not possible yet
            node._parent._type = 'BlockQuote';
            node.literal = node.literal.slice(2).trim();
        }
        if(node.literal.indexOf('W>') === 0) {
            // ... same thing for warning
        }
      }
    }

    return parsed;
}

Do you think it would be alright to add support for something like ._classes? This would make it so much easier to do custom stuff like this. No doubt there are some other applications.

Typo in blocks.js?

While porting this to a typed language (Haxe) I noticed a potential typo in lib/blocks.js:
https://github.com/jgm/commonmark.js/blob/master/lib/blocks.js#L686

I'm thinking above should just be block._parent as we want to exit here when tip is null:
https://github.com/jgm/commonmark.js/blob/master/lib/blocks.js#L737

Or maybe not... I'm new here ;)

"Smart" replacement of hyphens with em/en dash seems strange

The current tests in smart_punct.txt for en/em dashes don't define behavior for certain longer combinations and the current code ends up resulting with hanging hyphens when they could be easily replaced in a different manner to only replace them with em/en dashes.

For example, a series of 10 hyphens results in 3 em dashes followed by a hyphen. In my opinion, it would make more sense for this to result in 5 en dashes. Additionally, 7 hyphens are converted into 2 em dashes and a hyphen, but I believe it should be 1 em dash and 2 en dashes. That is:

Current: ---------- => --- --- --- -  => ———-
 Better: ---------- => -- -- -- -- -- => –––––

Current: ------- => --- --- - => ——-
 Better: ------- => --- -- -- => —––

To achieve this behavior, each group of hyphens would be collected and counted at once, assuming it is 2 hyphens or more (eg, /^(?<!-)(-{2,})/), and then the most optimal grouping would be figured out (in PHP for thephpleague/commonmark, but should be able to be converted to JavaScript/C fairly easily):

$count = strlen($matched);
$en_dash = '–';
$en_count = 0;
$em_dash = '—';
$em_count = 0;
if ($count % 3 === 0) { // If divisible by 3, use all em dashes
    $em_count = $count / 3;
} elseif ($count % 2 === 0) { // If divisible by 2, use all en dashes
    $en_count = $count / 2;
} elseif (($count - 2) % 3 === 0) { // If 2 extra dashes, use en dash for last 2; em dashes for rest
    $em_count = floor(($count - 2) / 3);
    $en_count = 1;
} else { // Use en dashes for last 4 hyphens; em dashes for rest
    $em_count = floor(($count - 4) / 3);
    $en_count = 2;
}
$inlineContext->getInlines()->add(new Text(
    str_repeat($em_dash, $em_count).
    str_repeat($en_dash, $en_count)
));
return true;

Is this something that CommonMark would be interested in implementing? (I can do, I just don't want to spend the time writing the code if it won't be accepted.) Or should the smart_punct.txt file be updated with tests that check for these edge cases?

ETA until stability?

I'm considering switching dox over to using commonmark, but I got bit by the massive API and AST changes between 0.12 and 0.17 and am hesitant to make it a dependency while things are changing so much.

Do you have an idea of when the package might be stable for a 1.0 release?

semver

Does commonmark following semver? There is no features added in two last minor versions

sourcepos on links

Is there a reason why we don't get data-sourcepos on links?

test [test](https://google.com)

Current get:
<p data-sourcepos="1:1-1:31">test <a href="https://google.com">test</a></p>
Would be nice to get:
<p data-sourcepos="1:1-1:31">test <a data-sourcepos="1:6-1:31" href="https://google.com">test</a></p>

CommonMark renderer

The C library (cmark) has a CommonMark renderer. This could be ported over to commonmark.js (the code would be fairly similar), but I haven't done it yet.

Bad perf case: Lots of delimiters that can close but no openers

Paste the text from this gist into dingus.
It takes about 1.3 seconds to parse
Paste the text a second time (append, so that the input is doubled in size)
It takes 6 to 13 seconds to parse or even longer

The test input is a_ (a, underscore, space) repeated 20000 times. The problem seems to be that in processEmphasis, for each potential closer, the opener is searched all the way back to the stack bottom.

Maybe it could be improved by removing a closer after not finding a corresponding opener iff the closer can not be an opener, so as to not have to check it again. Not sure if this is correct in all cases though or if there are other worst case inputs (need to think about it more).

No way to set list attributes on a commonmark.Node

When using the abstract syntax tree directly there is no way to set the list attributes (listType, listDelimiter, listStart and listTight) without initializing the _listData yourself. As it starts with an underscore i think this seems to be a private property and should not be interacted with directly.

For example:

var node = new commonmark.Node('List');
node.listType = 'bullet'; // TypeError: Cannot set property 'delimiter' of null

unnecessary \n in empty blockquotes

That's not a bug, only suggestion:

now:

<blockquote>
</blockquote>


could be:

<blockquote></blockquote>

Currently it's the only thing we have to normalize in markdown-it tests (because i decided to not add special cases to renderer).

If you don't like to make this change - just close this ticket.

tab-related regressions

I didn't quite understand 0.21 spec changes related to tabs. The general idea seems good, but devil is in the details as they say. So I checked this implementation, unfortunately, it's behavior appears to be buggy.

Tab immediately after list item marker was allowed, now it's not. Is it intentional?

$ echo -e ' -\tlist' | ./commonmark-0.20
<ul>
<li>list</li>
</ul>

$ echo -e ' -\tlist' | ./commonmark-0.21 
<p>-    list</p>

If code block indentation is using half a tab, what happens?

$ echo -e ' - foo\n\n\t\tbar' | ./commonmark-0.20
<ul>
<li>
<p>foo</p>
<pre><code> bar
</code></pre>
</li>
</ul>

$ echo -e ' - foo\n\n\t\tbar' | ./commonmark-0.21
<ul>
<li>
<p>foo</p>
<p>ar</p>
</li>
</ul>

Variation of the bug above. But it might deserve a special place because it's unclear whether - \t\tcode should be a code block or not (might be a bug in 0.20 actually):

$ echo -e ' - \t\tcode' | ./commonmark-0.20
<ul>
<li>code</li>
</ul>

$ echo -e ' - \t\tcode' | ./commonmark-0.21 
<ul>
<li>
<pre><code>de
</code></pre>
</li>
</ul>

Dingus doesn't display list markers if first block in item is Code Block

If you try the following bit of markdown in the dingus:

-     new list with indented code block

Show in dingus

the bullet will not be displayed.

This seems to be due to a clash with the Bootstrap CSS definition for pre. Bootstrap is doing a

pre {
  overflow: auto;
}

which is causing the list markers to be invisible. If you add overflow: visible to the definition for pre in dingus.css it is visible again.

pre {
  display: block;
  padding: 0.5em;
  color: #333;
  background: #f8f8ff;
  overflow: visible;
}

Small typo in README.md

Hi. README.md says:

prependChild(child): Prepend a Node child to the end of the Node's children.

Should that say "to the beginning of the Node's children"?

Ordered lists starting with 0

According to the spec:

An ordered list marker is a sequence of one of more digits (0-9), followed by either a . character or a ) character.

0. seems to be recognized as an ordered list marker, but the resulting start attribute does not get set to 0 as expected: http://spec.commonmark.org/dingus/?text=0.%20Zero%0A1.%20One

(This issue may also affect cmark - I have not tested that)

sourcepos not correct

sourcepos attribute in AST is not correct when parser run for more than 1 time:

     var reader = new commonmark.Parser();
     var writer = new commonmark.HtmlRenderer({sourcepos:true});
     console.log(writer.render(reader.parse("Hello *world*")));
     console.log(writer.render(reader.parse("Hello *world*")));

Now the result is

    <p data-sourcepos="1:1-1:13">Hello <em>world</em></p>
    <p data-sourcepos="2:1-1:13">Hello <em>world</em></p>

And I wonder why should the sourcepos be 1-based instead of 0-based?

header vs heading

Why H1-6 titles have header type instead of more w3c compliant heading?

PS. Sorry for asking too many questions, is it okay?