hellux / jotdown Goto Github PK

View Code? Open in Web Editor NEW

131.0 131.0 8.0 557 KB

A Djot parser library

License: MIT License

Makefile 1.54% Rust 97.33% Awk 0.44% Shell 0.69%

jotdown's People

Contributors

Stargazers

Watchers

Forkers

matklad marcusadair kickhead13 sondr3 black-desk clbarnes lucianolorenti

jotdown's Issues

Benchmarking against other implementations

Would be useful to be able to compare performance against other implementations.

bug: consecutive attributes are ignored

e.g.

word{.a}{#b}

turns into

Start(Paragraph, {})
Start(Span, {class="a"})
Str("word")
End(Span)
End(Paragraph)

instead of

Start(Paragraph, {})
Start(Span, {class="a", id="b"})
Str("word")
End(Span)
End(Paragraph)

Render from borrowed events?

Hi! The jotdown format and this crate seems awesome, thanks!

I'd like to be able to create multiple html outputs from the same jotdown file, by finding different parts of the event stream and rendering them separately. Something like this:

let jd = jotdown::Parser::new(&read_to_string(path)?).collect::<Vec<_>>();

let part_a: &[&Event] = some_filter(&jd);
let html_a = String::new();
jotdown::html::push(part_a.iter(), &mut html_a);

let part_b: &[&Event] = other_filter(&jd);
let html_b = String::new();
jotdown::html::push(part_b.iter(), &mut html_b);

This fails because iterating a slice can only give an iterator of borrowed events, not owned events.

It could be fixed by using part_a.iter().cloned(), but Event currently don't implement Clone. But I don't really see a reason that formatting html should require ownership of the events, so I think it should be relatively easy to add formatting of borrowed events.

If there is agreement that this would be useful, I'm open for attempting to write a PR implementing it.

`Render` trait

Djot is not meant to be HTML-centric, and jotdown should not be either. Lets let library consumers express anything that can render jotdown events via a Render trait (or similarly named.) Like the HTML implementation, we would have to distinguish between a unicode-accepting writer and a byte sink. The push and write names seem suitable:

pub trait Render {
    fn push<'s, I: Iterator<Item = Event<'s>>, W: fmt::Write>(events: I, out: W);
    fn write<'s, I: Iterator<Item = Event<'s>>, W: io::Write>(events: I, out: W) -> io::Result<()>;
}

The advantage of this approach over a module with standalone functions is obvious: Well-defined extensibility. This also lets libraries accept any rendering format in a more succinct way and additionally makes it clearer how one would implement their own renderer.

I'm debating whether we require a reference to self in the trait methods or not... One might want to allow extensions or rendering customisation, so might be better to include a reference.

Add an AST

It is often useful to work with an AST rather than a sequence of events. We could implement an optional module that provides AST objects that correspond to the AST defined by the djot spec (https://github.com/jgm/djot.js/blob/main/src/ast.ts).

It would be useful to be able to create it from events, and create events from the AST so you can e.g. parse events -> create ast -> modify ast -> create events -> render events.

It could also be useful to read/write the AST from/to e.g. json. We may then be able to read/write ASTs identically to the reference implementation. It might also be useful in tests to match against JSON produced by the reference implementation. We should be able to automatically implement the serialization/deserialization using serde, and then the downstream client can use any serde-compatible format.

A quick sketch of what it could look like:

#[cfg(feature = "ast")]
pub mod ast {
    use super::Event;

    use std::collections::HashMap as Map;

    #[cfg(feature = "serde")]
    use serde::{Deserialize, Serialize};

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Doc {
        children: Vec<Block>,
        references: Map<String, Reference>,
        footnotes: Map<String, Reference>,
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Reference {
        // todo
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Footnote {
        // todo
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Block {
        kind: BlockKind,
        children: Vec<Block>,
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub enum BlockKind {
        Para,
        Heading { level: usize },
        // todo
    }

    pub struct Iter<'a> {
        // todo
        _s: std::marker::PhantomData<&'a ()>,
    }

    impl<'a> Iterator for Iter<'a> {
        type Item = Event<'a>;

        fn next(&mut self) -> Option<Self::Item> {
            todo!()
        }
    }

    #[derive(Debug)]
    pub enum Error {
        EventNotEnded,
        UnexpectedStart,
        BlockInsideLeaf,
    }

    impl<'s> FromIterator<Event<'s>> for Result<Doc, Error> {
        fn from_iter<I: IntoIterator<Item = Event<'s>>>(events: I) -> Self {
            todo!()
        }
    }

    impl<'a> IntoIterator for &'a Doc {
        type Item = Event<'a>;
        type IntoIter = Iter<'a>;

        fn into_iter(self) -> Self::IntoIter {
            todo!()
        }
    }
}

clientside:

let src = "# heading

para";

let events = jotdown::Parser::new(src);
let ast = events.collect::<Result<jotdown::ast::Doc, _>>().unwrap();
let json = serde_json::to_string(&ast);

assert_eq!(
    json,
    r##"
    {
      "tag": "doc",
      "references": {},
      "footnotes": {},
      "children": [
        {
          "tag": "para",
          "children": [
            {
              "tag": "str",
              "text": "para"
            }
          ]
        }
      ]
    }
    "##
);

Attributes are parsed inside math

When writing something like \sum_{j=0} in a math environment, the {j=0} part is interpreted by the parser as an attribute. This leads to it being removed from the rendered TeX math. Everything inside a math environment should be parsed verbatim.

Comments don't work with headings

I think that this code

{%
# title

paragraph
%}

should produce nothing.

Instead, it produces:

<p>{%
# title</p>
<p>paragraph
%}</p>

So, it's not possible to comment the headings and the paragraphs too in oneshot, without multiple inline comments?

CowStr

As mentioned in #19 (comment), making CowStr an actual type instead of an alias would allow us to reuse many common patterns and helper functions. We could also then inline short strings, similar to pulldown_cmark.

Emit Footnotes as they are encountered

All footnote events are currently emitted at the end of the document, in the order they are referenced. The reason for this is that the block elements are already stored in the block tree, so we can just skip it and parse and render it at the end. The alternative option is for the renderer to store the parsed result and render that at the end.

The current approach is problematic if the original order and location is needed. E.g. a pretty-printer would now be forced to place all footnotes at the end.

The current approach is also impossible for a single-pass parser, which we are trying to move to.

This is related to #14, both trying to achieve better 1:1 ratio between input and events.

Avoid unsafe in span::Discontinuous::chars

The borrowed spans for inlines right now are quite unsound right now:

fn chars(&self) -> Self::Chars {
    unsafe { std::mem::transmute(InlineChars::new(self.src, self.spans.iter().copied())) }
}

We should try to get rid of the unsafe block here. Problem is that the Parser holds a vector of spans and we need to borrow a slice of it from within the Parser. So we get kind of a self-reference.

This doesn't really cause any problems right now, I think, but if we want to e.g. allow cloning the Parser it will be problematic.

Source positions

Currently, source positions are not available for events. However, they are very useful and we should be able to implement this.

pulldown-cmark has a iter_with_offsets method on the Parser that gives an iterator of tuples that contain both an event and a start/end position. We could do something similar.

bug: formatting is kept in link reference text tag

E.g.

[some _text_][]

[some text]: without_formatting
[some _text_]: with_formatting

yields

<p><a href="with_formatting">some <em>text</em></a></p>

instead of

<p><a href="without_formatting">some <em>text</em></a></p>

Setting up Criterion benchmarks

We should set up benchmarks in order to evaluate performance of different implementation choices in jotdown. Criterion seems like a good tool for this. To begin with, we can use the benchmark files from jgm/djot.js and generate benchmarks, similar to how the unit tests are generated right now.

Expose tag for unresolved links in Event

See previous discussion: #14 (comment)

bug: ignore closing fence inside code blocks

We have the bug described in jgm/djot#109, currently.

We can track open/close of code blocks in block::Kind::continues.

There is a test for this, can be run with:

$ make test_html_ut
$ cargo test -p test-html-ut -- --ignored fenced_divs

Multi-line block attributes

Reference implementation supports multi-line block attributes:

{#id .class
  style="color:red"}
A paragraph

<p id="id" class="class" style="color:red">A paragraph</p>

while jotdown instead treats it as an inline attr attached to nothing:

<p>
A paragraph</p>

Ambiguity between alphabetical and Roman numeral lists

Hi, great work with this library!

I've hit this bug in the list parsing: alphabetical lists are parsed correctly until C and D where they become Roman numeral lists. Those characters are ambiguous, so a little context is needed to know what kind of list item they are.

Example input:

A) first
B) second
C) third
D) fourth
E) fifth

Output:

<ol type="A">
<li>
first
</li>
<li>
second
</li>
</ol>
<ol start="100" type="I">
<li>
third
</li>
<li>
fourth
</li>
</ol>
<ol start="5" type="A">
<li>
fifth
</li>
</ol>

A) first
B) second
C) third
CI) fourth
E) fifth

Emit events for link definitions

Currently, link definitions are only used to resolve links and are then ignored. However, the position of link definitions may be useful to a consumer, e.g. for a pretty-printer. If we also want to let consumers resolve links it would become required emit them.

Would need to

not ignore link definitions in Parser::block, add and emit link definition Event
ignore link definitions for html output, using html::FilteredEvents

paragraph tags disappear after numbered list

1. item

para

yields

<ol>
<li>
item
</li>
</ol>
para

instead of

<ol>
<li>
item
</li>
</ol>
<p>para</p>

The events are as expected, seems to be a bug with the HTML renderer.

bug: disappearing attributes

a{
%%
a=a}

yields

<p><span>a</span></p>

instead of

<p><span a="a">a</span></p>

Avoid unsafe in PrePass::new

We currently have an unsafe block when performing the prepass:

// SAFETY: used_ids is dropped before the id_auto strings in headings. even if
// the strings move due to headings reallocating, the string data on the heap
// will not move
used_ids.insert(unsafe {
    std::mem::transmute::<&str, &'static str>(id_auto.as_ref())
});

I don't think it is causing any problems but it would be good to get rid of it.

Provide C API

As far as I know, there is no C implementation of djot yet. So, I think it would be nice if there is C bindings of this library.

Escapes in inline links

Reference implementation allows escaping with \ in inline URLs, e.g. from one of its unit tests:

*[closed](hello\*)

<p>*<a href="hello*">closed</a></p>

jotdown currently does not:

<p>*<a href="hello\*">closed</a></p>

also useful for e.g. URLs with parenthesis.

Set up CI

Would be nice to at least build, run tests and check formatting automatically on merge requests.

impl IntoIterator for Attributes

should be possible to do e.g.

for (k, v) in attrs {}

and

for (k, v) in &attrs {}

The current Attributes::iter function is not very idiomatic. Ideally, there should instead be implementations for &Attributes, &mut Attributes and Attributes with corresponding iterators: attr::Iter, attr::IterMut, attr::IntoIter.

Escapes in attributes

Escapes in attributes are currently unimplemented.

@matklad had a good idea that can be used to avoid an intermediate string: jgm/djot#203 (comment).

The parser must allow escapes and ignore them. And instead of storing a Cow in the Attributes object, we can create a custom Attr object that implements Display and skips over unescaped backslashes.

Note that we may still have to copy strings that are interrupted by e.g. a blockquote:

> word{a="multi-line
> value"}

However, it should be possible to filter out newline/multiple spaces if it is just across multiple lines:

- word{a="multi-line
  value"}

Newlines and multiple spaces can safely be replaced by a single space, but we can't know if a ">" is part of the value or happens to be there because it is within a blockquote.

Provide a shortcut to just render to String

95% of times when I render markdown (or now djot) I just want to get a String. 4% write to a io::Writer. Having top-level convenience functions for it seems like a good idea, and doesn't make switching to more flexible versions when needed.

bug: attributes disappear from raw inline

`abc`{.cls}

yields

<p><code>abc</code></p>

instead of

<p><code class="cls">abc</code></p>

hellux / jotdown Goto Github PK

jotdown's People

Contributors

Stargazers

Watchers

Forkers

jotdown's Issues

Recommend Projects

Recommend Topics

Recommend Org