instead of (current) byte offset?

Provide line/col position in matches about djot HOT 13 OPEN

jgm commented on June 19, 2024 7

Provide line/col position in matches

from djot.

Comments (13)

jgm commented on June 19, 2024

djot.ast now contains a function that builds a table associating byte offsets with line/col/char positions.
It does affect performance to build this up, but it provides good positions in both AST and HTML.
At this point, the -m (matches) output still just uses byte offsets, but in principle this could be changed too.

Sample output:

 % ./run.sh -ap
Hello there.

> quote
doc
  para (1:1:1-2:0:13)
    str (1:1:1-1:12:12) text="Hello there."
  blockquote (3:1:15-nil)
    para (3:3:17-4:0:22)
      str (3:3:17-3:7:21) text="quote"
references = {
}
footnotes = {
}

from djot.

jgm commented on June 19, 2024

The effect on performance of -p is pretty large: e.g., 83 ms without it, 140 ms with it on one benchmark.

from djot.

jgm commented on June 19, 2024

With the latest commit I managed to get it down to 124 ms with source positions. Still a sizable penalty, but not as bad.

It would be worth providing an option to add these line/col/char source positions to matches .

from djot.

matklad commented on June 19, 2024

I think one problem is that we essentially bulid a lua object for each byte of input. That seems like a lot. Perhaps a faster way would be to build an index of just \n characters?

That is, a sorted array of byteoffsets of newlines, which you can then binary search to get the line number.

from djot.

jgm commented on June 19, 2024

That's another option, yes. But of course that makes lookups slower. So we should probably try both approaches and benchmark. It's a tradeoff between more time constructing the table + fast lookups or quicker table construction + slower lookups.

EDIT: Also, an index of newlines would just get us the line number; we'd then need to do some additional processing on the line to get the column (as opposed to byte). This approach also wouldn't give us a character offset into the whole document, which we currently provide and which I thought might be useful (e.g., for snipping out a section in a language that operates with character rather than byte offsets). Still, it's worth experimenting.

from djot.

jgm commented on June 19, 2024

In light of the above, one important question is: is it important to provide character offsets, or do line offsets and character offsets within a line (i.e. column) suffice?

If we need character offsets, then I don't think the approach you sketch will work.

from djot.

jgm commented on June 19, 2024

Another option might be allowing the generation of the source positions to be "lazy."

That is, asking for the source pos corresponding to byte position will go through the first N bytes to calculate a position, and then stop, leaving the state information about the current line, column, and char pos in a closure.

This might avoid the need to allocate memory for a gigantic table, and maybe that would be faster. Of course, the table is better if we might need to "jump around" in asking for source positions, but if we are always going sequentially, maybe this could work. I'm not sure if our requests are guaranteed to be sequential, though.

from djot.

jgm commented on June 19, 2024

I did an experiment with a long document, and there are a few places where we jump around and ask for a source position for an earlier byte position after asking for a later one. Perhaps we could memoize and store, say, the last 10 requests.

from djot.

matklad commented on June 19, 2024

In light of the above, one important question is: is it important to provide character offsets, or do line offsets and character offsets within a line (i.e. column) suffice?

tbh, I am unsure about the usefulness of line/character-in-line info altogether. You need line-column info to show stuff to the user, and for the user column should count something like grapheme clusters (https://www.foonathan.net/2021/02/column/). So my gut instinct here would be to provide only absolute byte offsets, and maybe export a couple of helper functions to compute a line table from the input string to do line/column -> byte offset translation.

On the other hand, VS Code and LSP, for example, use utf16-codepoint metrics to display column information for the user in the UI and, while it makes zero sense, it seems human users don't complain in practice?

from djot.

jgm commented on June 19, 2024

One place I used the line information is in the live playground. I get the line visible at the top of the left hand pane, and then find an element with that line in its data-sourcepos attribute on the right hand pane, and scroll so that it is visible. I don't know how I'd do that without line offsets.

What about absolute character offsets? Byte offsets aren't that helpful if you're working with strings in a language that indexes them by code points rather than bytes.

You're right, though, that calculating "column" by counting code points won't always match the perceived column, because of clusters.

from djot.

matklad commented on June 19, 2024

One place I used the line information is in the live playground.

Curiously, column offsets would be wrong in playgrpound to implement, eg, horizontal scrolling, because JavaScript indexes by utf16 codepoints.

What about absolute character offsets? Byte offsets aren't that helpful if you're working with strings in a language that indexes them by code points rather than bytes.

I don't think the indexing scheme should be the same for all implementation languages, it seems reasonable to use whatever indexing is native for a particular language (utf16 for Java[Script], utf8 for Rust, codpepoint for Unicode).

If we do want to have a single way to index things in the spec, than, yeah, unicode codepoint offests would be the least controversial option. Though, I'd prefer to use utf8 offsets as I believe that's technically the best choice. Codepoint indices are somewhat useless, and utf8 ones are great if you store the text in utf8 internally, which all modern implementations do.

from djot.

jgm commented on June 19, 2024

Questions:

Should anything about sourcepos go into the "spec" for the AST?
If so, what? byte offset? code point offset? also line offset? also code point offset within line?
If not, is it okay if implementations add a sourcepos in whatever format is most useful?

Codepoint indices are somewhat useless, and utf8 ones are great if you store the text in utf8 internally, which all modern implementations do.

I wouldn't want to count on this. Haskell only recently started using UTF8 in its text library; previously it used UTF16. But even now, that's just an implementation detail: the API for text doesn't allow you to index a text by a byte offset, only by a code point offset. (And this seems to me a good choice: the fact that it's UTF-8 is an implementation detail and needn't be exposed to the user; indeed, the fact that this detail wasn't exposed in the original text library allowed the transition from an underlying UTF16 to and underying UTF8 representation relatively painless.)

EDIT: I'm not an expert on JavaScript, but it seems that its string interface also doesn't allow byte access?

from djot.

matklad commented on June 19, 2024

I don't think so. For some use cases you don't need sourcepos, and the question of which sourcepos is the best doesn't have an obvious correct answer.
many options indeed
yeah, I think it should be OK for impls to add extra stuff to AST

(And this seems to me a good choice: the fact that it's UTF-8 is an implementation detail and needn't be exposed to the user;

Yeah, I have reasonable people disagree with me on this one, my position here is definitely contrarian :) The way I see it, utf8 offset is no more of an implementation detail than unicode codepoint offset. uft8 offsets, utf16 offsets, and codepoint offsets are only useful to slice strings, I don't think codepoints has any inherent technical advantage. It's just a way to talk about substrings. To put it another way, operations of "get nth character" and "get nth utf16 code unit" are equally useless, because character is never the correct answer (indeed, unicode doesn't even define "character", only codepoint).

If we accept that it doesn't really matter which of utf8, utf16, utf32 we pick as an "index space" for strings, utf8 has the clear advantage as the one not requiring an offset-translation table in practice.

(*) one advantage of utf32 index-space is that it is dense -- all indices between 0..n are valid. But indices larger than n are still invalid, so there's still an "is this index ok for this string?" invariant to maintain.

EDIT: I'm not an expert on JavaScript, but it seems that its string interface also doesn't allow byte access?

Yeah, JavaScript has only access to utf16 code units by utf16 offsets.

from djot.

Provide line/col position in matches about djot HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent