projectfluent / fluent-langneg-rs Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 11.0 135 KB

Library for language and locale identifier negotiation.

Home Page: https://projectfluent.org/

License: Apache License 2.0

Rust 100.00%

i18n internationalization l10n localization rust

fluent-langneg-rs's People

Contributors

Stargazers

Watchers

Forkers

klosspeter raphlinus cmyr desiderantes emilio fhoehle atouchet nftico seanpm2001 alerque

fluent-langneg-rs's Issues

Mismatch `LanguageIdentifier` type with other Fluent crates

Other Fluent crates use the LanguageIdentifier from unic-langid, whereas fluent-langneg uses one from icu_locid, hence, the result of negotiate_languages cannot be used for other crates.

Deterministic results

At the moment we have a class of non-deterministic results when a request for en matches two regions (GB and CA for example). I'd like to find a way to make it deterministic if only to remove the papercut when writing tests.

Switch negotiation APIs to accept an iterator.

Feedback on reddit from Quxxy:

Just a quick bit of feedback: you generally shouldn't be taking Vec<_> as an argument like you are with negotiate_languages. Unless you need to own the elements (which you don't appear to), or you need resizable storage (which you don't appear to), you should be taking &[_] instead.

It's like asking for an IKEA shelf specifically, when really any brand will do.

Edit: for bonus points: if all you ever do is iterate over the elements once, you could also take an iterator for maximum client-side flexibility.

https://www.reddit.com/r/rust/comments/74rv6r/fluent_locale_library_for_language_tag/do13yto/

copyright holder missing in LICENSE file

License file didn't set the copyright holder. It currently has a placeholder value:

   Copyright {yyyy} {name of copyright owner}

https://github.com/projectfluent/fluent-langneg-rs/blob/master/LICENSE#L189C4-L189C46

Could you please fix that and release a version with that fix included?

Incorrect existing likely subtags

This issue is related to #9, and I want to try to figure it out while making a PR.

First, there are two locales for which the existing logic is simply wrong. "cs" results in "cs-CS", while the region should be CZ, and similarly for "sr", should be "sr-Cyrl-RS" rather than "sr-Cyrl-SR".

A deeper issue is that this logic is very inconsistent whether it adds a script or not. It feels to me like the right thing to do is add a script. However, this makes three of the negotiate tests fail. I feel that if any downstream logic is dependent on the script not being present (as is the case now), it is pretty fragile. However, fixing this feels like a bit of a yak-shave, so I'm filing this issue asking for advice.

I'll also prepare a minimally invasive PR for likely subtags.

No way to own the result of negotiate_languages()

Pardon me if this is a Rust newbie question (I am one), but I'm struggling to implement this library in my app.

Very roughly I want to do some negotiation up front and decide on a language fallback stack, then retain that in an immutable struct for the lifetime of the app (a CLI tool). I don't have any problem with this config struct otherwise, it's working and I can even put some language information in it. For example I can use this libraries accepted_languages::parse() and get an owned result back (Vec<LanguageIdentifier>) which I can easily keep in my struct. The issue I have is there seems to be no way to use negotiate_languages() and get back something owned by the calling function. It always returns Vec<&LanguageIdentifier> (which I can't retain in my struct).

Shouldn't there be a built in method that returns something that can be owned by the parent scope?

Ideas for improving performance

I'm interested in a very high performance representation of locales for skribo. I think what fluent-locale has is a good base, but have some ideas how to make it more performant, both in speed and in object size.

The main cost is likely the allocation of the many small String objects in a locale. There are existing tiny string implementations (tendril, inlinable_string, iString), but I think it's possible to do better by specializing to the needs of bcp47. Most of these strings are in the ballpark of 16 bytes each, and much of the cost is the need to spill to allocation when the strings get big. In bcp47, most of the subtags have a small, fixed maximum size.

I've prototyped a "tinystr" that uses a NonZeroU32 as its backing store, and thus takes 4 bytes, even when used as an option. It also uses SIMD-like math to verify ASCII and no NUL bytes. I'm happy to PR that into this repo, or make a separate crate (there are a number of file formats that use 4 byte tags, and this would be good for those). Use of this string type would probably not be a huge code change, as it doesn't fundamentally change the architecture, just the representation. There is unsafe code, but I think it should be possible to review it to get good confidence.

A more aggressive optimization is to use an enum between a fast-path and a general-case representation. The fast path would be optional 4 byte tiny strings for language, script, and region. The general case would be a boxed struct similar to the current one, but with an 8 byte tiny string for language and variant, and 4 byte tiny strings for the other subtags. This enum is 16 bytes on both 32 and 64 bit platforms.

I'm posting an issue to get a sense of how welcome these changes are, and also whether tinystr should be its own crate or just a source file in fluent-locale.

Re-export `unic_langid::{LanguageIdentifier, LanguageIdentifierError}`

Please re-export unic_langid::{LanguageIdentifier, LanguageIdentifierError} as they are included in the public API.

Empty locale (or `x-testing`) should not match everything

Our current strategy uses empty field to match wildcard, which results in empty locales matching all other locales.

That means that unless specifically blocked (which we did in 0.4.1) x-testing matches everything.

Likely subtags

I need "likely subtags" for script-aware fallback. ICU has an implementation.

I actually have this pretty well implemented. The question is whether it belongs in fluent-locale-rs or whether it should be in skribo. I estimate that it's in the ballpark of 50k of code and data; I could probably get it down a little.

Also, I haven't implemented the deprecated subtags (for example, the conversion of "sh" to "sr_Latn". I suspect I won't miss them for text rendering, but other applications might want them (for example, so that hyphenation can handle "no-NO"). If I submitted a PR, would you want these?

A "no" answer is fine - it'll just live in skribo.

Update the code to use 1.24

Rust 1.24 got released and it brings two goodies for us:

AsciiExt on char
stable rustfmt-preview.

I'd like to start using both in fluent-locale-rs ASAP.

projectfluent / fluent-langneg-rs Goto Github PK

fluent-langneg-rs's People

Contributors

Stargazers

Watchers

Forkers

fluent-langneg-rs's Issues

Mismatch `LanguageIdentifier` type with other Fluent crates

Deterministic results

Switch negotiation APIs to accept an iterator.

copyright holder missing in LICENSE file

Incorrect existing likely subtags

No way to own the result of negotiate_languages()

Ideas for improving performance

Re-export `unic_langid::{LanguageIdentifier, LanguageIdentifierError}`

Empty locale (or `x-testing`) should not match everything

Likely subtags

Update the code to use 1.24

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent