hellux / jotdown Goto Github PK
View Code? Open in Web Editor NEWA Djot parser library
License: MIT License
A Djot parser library
License: MIT License
Would be useful to be able to compare performance against other implementations.
e.g.
word{.a}{#b}
turns into
Start(Paragraph, {})
Start(Span, {class="a"})
Str("word")
End(Span)
End(Paragraph)
instead of
Start(Paragraph, {})
Start(Span, {class="a", id="b"})
Str("word")
End(Span)
End(Paragraph)
Hi! The jotdown format and this crate seems awesome, thanks!
I'd like to be able to create multiple html outputs from the same jotdown file, by finding different parts of the event stream and rendering them separately. Something like this:
let jd = jotdown::Parser::new(&read_to_string(path)?).collect::<Vec<_>>();
let part_a: &[&Event] = some_filter(&jd);
let html_a = String::new();
jotdown::html::push(part_a.iter(), &mut html_a);
let part_b: &[&Event] = other_filter(&jd);
let html_b = String::new();
jotdown::html::push(part_b.iter(), &mut html_b);
This fails because iterating a slice can only give an iterator of borrowed events, not owned events.
It could be fixed by using part_a.iter().cloned()
, but Event
currently don't implement Clone
. But I don't really see a reason that formatting html should require ownership of the events, so I think it should be relatively easy to add formatting of borrowed events.
If there is agreement that this would be useful, I'm open for attempting to write a PR implementing it.
Djot is not meant to be HTML-centric, and jotdown should not be either. Lets let library consumers express anything that can render jotdown events via a Render
trait (or similarly named.) Like the HTML implementation, we would have to distinguish between a unicode-accepting writer and a byte sink. The push
and write
names seem suitable:
pub trait Render {
fn push<'s, I: Iterator<Item = Event<'s>>, W: fmt::Write>(events: I, out: W);
fn write<'s, I: Iterator<Item = Event<'s>>, W: io::Write>(events: I, out: W) -> io::Result<()>;
}
The advantage of this approach over a module with standalone functions is obvious: Well-defined extensibility. This also lets libraries accept any rendering format in a more succinct way and additionally makes it clearer how one would implement their own renderer.
I'm debating whether we require a reference to self in the trait methods or not... One might want to allow extensions or rendering customisation, so might be better to include a reference.
It is often useful to work with an AST rather than a sequence of events. We could implement an optional module that provides AST objects that correspond to the AST defined by the djot spec (https://github.com/jgm/djot.js/blob/main/src/ast.ts).
It would be useful to be able to create it from events, and create events from the AST so you can e.g. parse events -> create ast -> modify ast -> create events -> render events.
It could also be useful to read/write the AST from/to e.g. json. We may then be able to read/write ASTs identically to the reference implementation. It might also be useful in tests to match against JSON produced by the reference implementation. We should be able to automatically implement the serialization/deserialization using serde, and then the downstream client can use any serde-compatible format.
A quick sketch of what it could look like:
#[cfg(feature = "ast")]
pub mod ast {
use super::Event;
use std::collections::HashMap as Map;
#[cfg(feature = "serde")]
use serde::{Deserialize, Serialize};
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
pub struct Doc {
children: Vec<Block>,
references: Map<String, Reference>,
footnotes: Map<String, Reference>,
}
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
pub struct Reference {
// todo
}
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
pub struct Footnote {
// todo
}
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
pub struct Block {
kind: BlockKind,
children: Vec<Block>,
}
#[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
pub enum BlockKind {
Para,
Heading { level: usize },
// todo
}
pub struct Iter<'a> {
// todo
_s: std::marker::PhantomData<&'a ()>,
}
impl<'a> Iterator for Iter<'a> {
type Item = Event<'a>;
fn next(&mut self) -> Option<Self::Item> {
todo!()
}
}
#[derive(Debug)]
pub enum Error {
EventNotEnded,
UnexpectedStart,
BlockInsideLeaf,
}
impl<'s> FromIterator<Event<'s>> for Result<Doc, Error> {
fn from_iter<I: IntoIterator<Item = Event<'s>>>(events: I) -> Self {
todo!()
}
}
impl<'a> IntoIterator for &'a Doc {
type Item = Event<'a>;
type IntoIter = Iter<'a>;
fn into_iter(self) -> Self::IntoIter {
todo!()
}
}
}
clientside:
let src = "# heading
para";
let events = jotdown::Parser::new(src);
let ast = events.collect::<Result<jotdown::ast::Doc, _>>().unwrap();
let json = serde_json::to_string(&ast);
assert_eq!(
json,
r##"
{
"tag": "doc",
"references": {},
"footnotes": {},
"children": [
{
"tag": "para",
"children": [
{
"tag": "str",
"text": "para"
}
]
}
]
}
"##
);
When writing something like \sum_{j=0}
in a math environment, the {j=0}
part is interpreted by the parser as an attribute. This leads to it being removed from the rendered TeX math. Everything inside a math environment should be parsed verbatim.
I think that this code
{%
# title
paragraph
%}
should produce nothing.
Instead, it produces:
<p>{%
# title</p>
<p>paragraph
%}</p>
So, it's not possible to comment the headings and the paragraphs too in oneshot, without multiple inline comments?
As mentioned in #19 (comment), making CowStr
an actual type instead of an alias would allow us to reuse many common patterns and helper functions. We could also then inline short strings, similar to pulldown_cmark
.
All footnote events are currently emitted at the end of the document, in the order they are referenced. The reason for this is that the block elements are already stored in the block tree, so we can just skip it and parse and render it at the end. The alternative option is for the renderer to store the parsed result and render that at the end.
The current approach is problematic if the original order and location is needed. E.g. a pretty-printer would now be forced to place all footnotes at the end.
The current approach is also impossible for a single-pass parser, which we are trying to move to.
This is related to #14, both trying to achieve better 1:1 ratio between input and events.
The borrowed spans for inlines right now are quite unsound right now:
fn chars(&self) -> Self::Chars {
unsafe { std::mem::transmute(InlineChars::new(self.src, self.spans.iter().copied())) }
}
We should try to get rid of the unsafe block here. Problem is that the Parser holds a vector of spans and we need to borrow a slice of it from within the Parser. So we get kind of a self-reference.
This doesn't really cause any problems right now, I think, but if we want to e.g. allow cloning the Parser it will be problematic.
Currently, source positions are not available for events. However, they are very useful and we should be able to implement this.
pulldown-cmark has a iter_with_offsets method on the Parser that gives an iterator of tuples that contain both an event and a start/end position. We could do something similar.
E.g.
[some _text_][]
[some text]: without_formatting
[some _text_]: with_formatting
yields
<p><a href="with_formatting">some <em>text</em></a></p>
instead of
<p><a href="without_formatting">some <em>text</em></a></p>
We should set up benchmarks in order to evaluate performance of different implementation choices in jotdown. Criterion seems like a good tool for this. To begin with, we can use the benchmark files from jgm/djot.js and generate benchmarks, similar to how the unit tests are generated right now.
See previous discussion: #14 (comment)
We have the bug described in jgm/djot#109, currently.
We can track open/close of code blocks in block::Kind::continues.
There is a test for this, can be run with:
$ make test_html_ut
$ cargo test -p test-html-ut -- --ignored fenced_divs
Reference implementation supports multi-line block attributes:
{#id .class
style="color:red"}
A paragraph
<p id="id" class="class" style="color:red">A paragraph</p>
while jotdown instead treats it as an inline attr attached to nothing:
<p>
A paragraph</p>
Hi, great work with this library!
I've hit this bug in the list parsing: alphabetical lists are parsed correctly until C
and D
where they become Roman numeral lists. Those characters are ambiguous, so a little context is needed to know what kind of list item they are.
Example input:
A) first
B) second
C) third
D) fourth
E) fifth
Output:
<ol type="A">
<li>
first
</li>
<li>
second
</li>
</ol>
<ol start="100" type="I">
<li>
third
</li>
<li>
fourth
</li>
</ol>
<ol start="5" type="A">
<li>
fifth
</li>
</ol>
A) first
B) second
C) third
CI) fourth
E) fifth
Currently, link definitions are only used to resolve links and are then ignored. However, the position of link definitions may be useful to a consumer, e.g. for a pretty-printer. If we also want to let consumers resolve links it would become required emit them.
Would need to
1. item
para
yields
<ol>
<li>
item
</li>
</ol>
para
instead of
<ol>
<li>
item
</li>
</ol>
<p>para</p>
The events are as expected, seems to be a bug with the HTML renderer.
a{
%%
a=a}
yields
<p><span>a</span></p>
instead of
<p><span a="a">a</span></p>
We currently have an unsafe block when performing the prepass:
// SAFETY: used_ids is dropped before the id_auto strings in headings. even if
// the strings move due to headings reallocating, the string data on the heap
// will not move
used_ids.insert(unsafe {
std::mem::transmute::<&str, &'static str>(id_auto.as_ref())
});
I don't think it is causing any problems but it would be good to get rid of it.
As far as I know, there is no C implementation of djot yet. So, I think it would be nice if there is C bindings of this library.
see also: https://github.com/eqrion/cbindgen
Reference implementation allows escaping with \ in inline URLs, e.g. from one of its unit tests:
*[closed](hello\*)
<p>*<a href="hello*">closed</a></p>
jotdown currently does not:
<p>*<a href="hello\*">closed</a></p>
also useful for e.g. URLs with parenthesis.
Would be nice to at least build, run tests and check formatting automatically on merge requests.
should be possible to do e.g.
for (k, v) in attrs {}
and
for (k, v) in &attrs {}
The current Attributes::iter
function is not very idiomatic. Ideally, there should instead be implementations for &Attributes
, &mut Attributes
and Attributes
with corresponding iterators: attr::Iter
, attr::IterMut
, attr::IntoIter
.
Escapes in attributes are currently unimplemented.
@matklad had a good idea that can be used to avoid an intermediate string: jgm/djot#203 (comment).
The parser must allow escapes and ignore them. And instead of storing a Cow in the Attributes object, we can create a custom Attr object that implements Display and skips over unescaped backslashes.
Note that we may still have to copy strings that are interrupted by e.g. a blockquote:
> word{a="multi-line
> value"}
However, it should be possible to filter out newline/multiple spaces if it is just across multiple lines:
- word{a="multi-line
value"}
Newlines and multiple spaces can safely be replaced by a single space, but we can't know if a ">" is part of the value or happens to be there because it is within a blockquote.
95% of times when I render markdown (or now djot) I just want to get a String
. 4% write to a io::Write
r. Having top-level convenience functions for it seems like a good idea, and doesn't make switching to more flexible versions when needed.
`abc`{.cls}
yields
<p><code>abc</code></p>
instead of
<p><code class="cls">abc</code></p>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.