tc39 / proposal-string-prototype-codepoints Goto Github PK

View Code? Open in Web Editor NEW

40.0 37.0 8.0 53 KB

String.prototype.codePoints proposal for ECMAScript (stage 1)

Home Page: https://tc39.github.io/proposal-string-prototype-codepoints/

License: MIT License

HTML 100.00%

proposal-string-prototype-codepoints's Introduction

String.prototype.codePoints

ECMAScript proposal for String.prototype.codePoints

Status

The proposal is in stage 1 of the TC39 process.

Motivation

Lexers for languages that involve code points above 0xFFFF (such as ECMAScript syntax itself), need to be able to tokenise a string into separate code points before handling them with own state machine.

Currently language APIs provide two ways to access entire code points:

codePointAt allows to retrieve a code point at a known position. The issue is that position is usually unknown in advance if you're just iterating over the string, and you need to manually calculate it on each iteration with a manual for(;;) loop and a magically looking expression like pos += currentCodePoint <= 0xFFFF ? 1 : 2.
String.prototype[Symbol.iterator] which allows a hassle-free iteration over string codepoints, but yields their string values, which are inefficient to work with in performance-critical lexers, and still lack position information.

Proposed solution

We propose the addition of a codePoints() method functionally similar to the [@@iterator], but yielding positions and numerical values of code points instead of just string values, this way combining the benefits of both approaches presented above while avoiding the related pitfalls in consumer code.

Naming

The name and casing of codePoints was chosen to be consistent with existing codePointAt API.

Illustrative examples

Test if something is an identifier

function isIdent(input) {
    let codePoints = input.codePoints();
    let first = codePoints.next();

    if (first.done || !isIdentifierStart(first.value.codePoint)) {
        return false;
    }

    for (let { codePoint } of codePoints) {
        if (!isIdentifierContinue(codePoint)) {
            return false;
        }
    }

    return true;
}

Full-blown tokeniser

function toDigit(cp) {
    return cp - /* '0' */ 48;
}

// Generic helper
class LookaheadIterator {
    constructor(inner) {
        this[Symbol.iterator] = this;
        this.inner = inner;
        this.next();
    }

    next() {
        let next = this.lookahead;
        this.lookahead = this.inner.next();
        return next;
    }

    skipWhile(cond) {
        while (!this.lookahead.done && cond(this.lookahead.value.codePoint)) {
            this.next();
        }
        // even when `done == true`, the returned `.value.position` is still valid
        // and represents position at the end of the string
        return this.lookahead.value.position;
    }
}

// Main tokeniser
function* tokenise(input) {
    let iter = new LookaheadIterator(input.codePoints());

    for (let { position: start, codePoint } of iter) {
        if (isIdentifierStart(codePoint)) {
            yield {
                type: 'Identifier',
                start,
                end: iter.skipWhile(isIdentifierContinue)
            };
        } else if (isDigit(codePoint)) {
            yield {
                type: 'Number',
                start,
                end: iter.skipWhile(isDigit)
            };
        } else {
            throw new SyntaxError(`Expected an identifier or digit at ${start}`);
        }
    }
}

FAQ

Why does iterator emit an object instead of an array like other key-value iterators?

[key, value] format is usually used for entries of collections which can be directly indexed by key.

Unlike those collections, strings in ECMAScript are indexed as 16-bit units of UTF-16 text and not code points, so emitted objects won't have consequent indices but rather positions which might be 1 or 2 16-bit units away from each other.

To make the fact that they represent different measurement units and string representations explicit, we decided on { position, codePoint } object format.

See #1 for more details.
What about iteration over different string representations - code units, grapheme clusters etc.?

These are not covered by this particular proposal, but should be easy to add as separate methods or APIs. In particular, language-specific representations are being worked on as Intl.Segmenter proposal.

Specification

You can view the rendered spec here.

Implementations

Polyfill

proposal-string-prototype-codepoints's People

Contributors

Stargazers

Watchers

Forkers

gsathya trotyl dalavancloud isabella232 badges-bot serrin seanpm2001

proposal-string-prototype-codepoints's Issues

Consider aligning with Intl.Segmenter

Intl.Segmenter also has a "high-speed"/"convenient iterator" split. I'm wondering if the design for the split would be useful for this proposal. Concretely, it's a different API shape, consisting of an extra method on the iterator rather than a different iterator. Performance may be even better because it avoids allocating the IteratorResult object.

Integer versus string representation of code points

#5 (comment) reminded me that I wanted to ask: Is there a particular reason why an integer representation for code points was chosen instead of a 1-or-2-UTF16-code-unit string representation? With the latter,
"\u0041\ud801\udc00\u0042".codePoints() would then yield
"\u0041" then
"\ud801\udc00" then
"\u0042"?

I would personally find such a string representation to be more generally useful. My parsers concatenate code points into new strings much more often than they perform integer arithmetic on them. But people’s usage here may vary, I suppose.

Value inconsistent with codePointAt

I could definitely see myself using this and agree that you would likely want position as well as the code point in most cases, but it grates somewhat that while codePointAt() returns just a number, codePoints() yields objects: from the naming I would expect s.codePointAt(0) === s.codePoints().next().value.

Similarly, without type checking, its pretty natural to type for (const codePoint of s.codePoints()) ..., instead of for (const { codePoint } of s.codePoints()) ...

The solution might be as simple as a different name: codePointTokens for example.

Alternatively, since .value.position being valid even when .done is true is pretty funky already, and usage of position seemingly being most useful when directly using the iterator, perhaps instead having next() return { done: boolean, value: number, position: number }, where value is valid only when done is false, but position is always valid would make sense?

Discussion: [position, codePoint] pairs

In some cases it would be useful to know the current position within the string as you go through the codepoints (for example, to store it and use later for slicing or error reporting).

To achieve this, .codePoints() iterator could yield pairs [position, codePoint] instead of just codePoint.

The obvious downside is that this would be inconsistent with the default chars iterator.

On the other hand, with regular chars iterator, this can be done in a more or less obvious manner already (you can sum up char.length as you go through the string), and, if not, we can add an extra method to yield [position, char] pairs too in future if required.

Thoughts?

Quantify the performance improvement

The README claims:

String.prototype[Symbol.iterator] which allows a hassle-free iteration over string codepoints, but yields their string values, which are inefficient to work with in performance-critical lexers, and still lack position information.

I would expect the iteration protocol to dominate baseline performance, which is still true for this proposal. I'm curious about the performance gains with this proposal. Is there a benchmark for this claim?

Name suggests it’s like length

The name “codepoints” to me suggests it returns the number of code points in the string, like length does. Might there be another similar name that suggests it’s an iterator?

what about chars?

codePointAt → codePoints

charAt → chars?

I don’t have a use case and this might not be needed, but i wanted to ask about the potential inconsistency.