Code Monkey home page Code Monkey logo

Comments (13)

jgm avatar jgm commented on June 19, 2024

djot.ast now contains a function that builds a table associating byte offsets with line/col/char positions.
It does affect performance to build this up, but it provides good positions in both AST and HTML.
At this point, the -m (matches) output still just uses byte offsets, but in principle this could be changed too.

Sample output:

 % ./run.sh -ap
Hello there.

> quote
doc
  para (1:1:1-2:0:13)
    str (1:1:1-1:12:12) text="Hello there."
  blockquote (3:1:15-nil)
    para (3:3:17-4:0:22)
      str (3:3:17-3:7:21) text="quote"
references = {
}
footnotes = {
}

from djot.

jgm avatar jgm commented on June 19, 2024

The effect on performance of -p is pretty large: e.g., 83 ms without it, 140 ms with it on one benchmark.

from djot.

jgm avatar jgm commented on June 19, 2024

With the latest commit I managed to get it down to 124 ms with source positions. Still a sizable penalty, but not as bad.

It would be worth providing an option to add these line/col/char source positions to matches .

from djot.

matklad avatar matklad commented on June 19, 2024

I think one problem is that we essentially bulid a lua object for each byte of input. That seems like a lot. Perhaps a faster way would be to build an index of just \n characters?

That is, a sorted array of byteoffsets of newlines, which you can then binary search to get the line number.

from djot.

jgm avatar jgm commented on June 19, 2024

That's another option, yes. But of course that makes lookups slower. So we should probably try both approaches and benchmark. It's a tradeoff between more time constructing the table + fast lookups or quicker table construction + slower lookups.

EDIT: Also, an index of newlines would just get us the line number; we'd then need to do some additional processing on the line to get the column (as opposed to byte). This approach also wouldn't give us a character offset into the whole document, which we currently provide and which I thought might be useful (e.g., for snipping out a section in a language that operates with character rather than byte offsets). Still, it's worth experimenting.

from djot.

jgm avatar jgm commented on June 19, 2024

In light of the above, one important question is: is it important to provide character offsets, or do line offsets and character offsets within a line (i.e. column) suffice?

If we need character offsets, then I don't think the approach you sketch will work.

from djot.

jgm avatar jgm commented on June 19, 2024

Another option might be allowing the generation of the source positions to be "lazy."

That is, asking for the source pos corresponding to byte position will go through the first N bytes to calculate a position, and then stop, leaving the state information about the current line, column, and char pos in a closure.

This might avoid the need to allocate memory for a gigantic table, and maybe that would be faster. Of course, the table is better if we might need to "jump around" in asking for source positions, but if we are always going sequentially, maybe this could work. I'm not sure if our requests are guaranteed to be sequential, though.

from djot.

jgm avatar jgm commented on June 19, 2024

I did an experiment with a long document, and there are a few places where we jump around and ask for a source position for an earlier byte position after asking for a later one. Perhaps we could memoize and store, say, the last 10 requests.

from djot.

matklad avatar matklad commented on June 19, 2024

In light of the above, one important question is: is it important to provide character offsets, or do line offsets and character offsets within a line (i.e. column) suffice?

tbh, I am unsure about the usefulness of line/character-in-line info altogether. You need line-column info to show stuff to the user, and for the user column should count something like grapheme clusters (https://www.foonathan.net/2021/02/column/). So my gut instinct here would be to provide only absolute byte offsets, and maybe export a couple of helper functions to compute a line table from the input string to do line/column -> byte offset translation.

On the other hand, VS Code and LSP, for example, use utf16-codepoint metrics to display column information for the user in the UI and, while it makes zero sense, it seems human users don't complain in practice?

from djot.

jgm avatar jgm commented on June 19, 2024

One place I used the line information is in the live playground. I get the line visible at the top of the left hand pane, and then find an element with that line in its data-sourcepos attribute on the right hand pane, and scroll so that it is visible. I don't know how I'd do that without line offsets.

What about absolute character offsets? Byte offsets aren't that helpful if you're working with strings in a language that indexes them by code points rather than bytes.

You're right, though, that calculating "column" by counting code points won't always match the perceived column, because of clusters.

from djot.

matklad avatar matklad commented on June 19, 2024

One place I used the line information is in the live playground.

Curiously, column offsets would be wrong in playgrpound to implement, eg, horizontal scrolling, because JavaScript indexes by utf16 codepoints.

What about absolute character offsets? Byte offsets aren't that helpful if you're working with strings in a language that indexes them by code points rather than bytes.

I don't think the indexing scheme should be the same for all implementation languages, it seems reasonable to use whatever indexing is native for a particular language (utf16 for Java[Script], utf8 for Rust, codpepoint for Unicode).

If we do want to have a single way to index things in the spec, than, yeah, unicode codepoint offests would be the least controversial option. Though, I'd prefer to use utf8 offsets as I believe that's technically the best choice. Codepoint indices are somewhat useless, and utf8 ones are great if you store the text in utf8 internally, which all modern implementations do.

from djot.

jgm avatar jgm commented on June 19, 2024

Questions:

  1. Should anything about sourcepos go into the "spec" for the AST?
  2. If so, what? byte offset? code point offset? also line offset? also code point offset within line?
  3. If not, is it okay if implementations add a sourcepos in whatever format is most useful?

Codepoint indices are somewhat useless, and utf8 ones are great if you store the text in utf8 internally, which all modern implementations do.

I wouldn't want to count on this. Haskell only recently started using UTF8 in its text library; previously it used UTF16. But even now, that's just an implementation detail: the API for text doesn't allow you to index a text by a byte offset, only by a code point offset. (And this seems to me a good choice: the fact that it's UTF-8 is an implementation detail and needn't be exposed to the user; indeed, the fact that this detail wasn't exposed in the original text library allowed the transition from an underlying UTF16 to and underying UTF8 representation relatively painless.)

EDIT: I'm not an expert on JavaScript, but it seems that its string interface also doesn't allow byte access?

from djot.

matklad avatar matklad commented on June 19, 2024
  1. I don't think so. For some use cases you don't need sourcepos, and the question of which sourcepos is the best doesn't have an obvious correct answer.
  2. many options indeed
  3. yeah, I think it should be OK for impls to add extra stuff to AST

(And this seems to me a good choice: the fact that it's UTF-8 is an implementation detail and needn't be exposed to the user;

Yeah, I have reasonable people disagree with me on this one, my position here is definitely contrarian :) The way I see it, utf8 offset is no more of an implementation detail than unicode codepoint offset. uft8 offsets, utf16 offsets, and codepoint offsets are only useful to slice strings, I don't think codepoints has any inherent technical advantage. It's just a way to talk about substrings. To put it another way, operations of "get nth character" and "get nth utf16 code unit" are equally useless, because character is never the correct answer (indeed, unicode doesn't even define "character", only codepoint).

If we accept that it doesn't really matter which of utf8, utf16, utf32 we pick as an "index space" for strings, utf8 has the clear advantage as the one not requiring an offset-translation table in practice.

(*) one advantage of utf32 index-space is that it is dense -- all indices between 0..n are valid. But indices larger than n are still invalid, so there's still an "is this index ok for this string?" invariant to maintain.

EDIT: I'm not an expert on JavaScript, but it seems that its string interface also doesn't allow byte access?

Yeah, JavaScript has only access to utf16 code units by utf16 offsets.

from djot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.