pyfisch / rust-language-tags Goto Github PK

Language tags for Rust

License: Apache License 2.0

Rust 98.16% Python 1.84%

rust-language-tags's Introduction

rust-language-tags

Language tags can be used identify human languages, scripts e.g. Latin script, countries and other regions.

Language tags are defined in BCP47, an introduction is "Language tags in HTML and XML" by the W3C. They are commonly used in HTML and HTTP Content-Language and Accept-Language header fields.

This package currently supports parsing (fully conformant parser), formatting and comparing language tags.

Examples

Create a simple language tag representing the French language as spoken in Belgium and print it:

use language_tags::LanguageTag;
let langtag = LanguageTag::parse("fr-BE").unwrap();
assert_eq!(format!("{}", langtag), "fr-BE");

Parse a tag representing a special type of English specified by private agreement:

use language_tags::LanguageTag;
use std::iter::FromIterator;
let langtag: LanguageTag = "en-x-twain".parse().unwrap();
assert_eq!(langtag.primary_language(), "en");
assert_eq!(Vec::from_iter(langtag.private_use_subtags()), vec!["twain"]);

You can check for equality, but more often you should test if two tags match. In this example we check if the resource in German language is suitable for a user from Austria. While people speaking Austrian German normally understand standard German the opposite is not always true. So the resource can be presented to the user but if the resource was in de-AT and a user asked for a representation in de the request should be rejected.

use language_tags::LanguageTag;
let mut langtag_server = LanguageTag::parse("de-AT").unwrap();
let mut langtag_user = LanguageTag::parse("de").unwrap();
assert!(langtag_user.matches(&langtag_server));

Related crates

If you only want to validate and normalize the formatting of language tags or you are working with RDF consider using the oxilangtag crate. It is much more lightweight as it doesn't contain a language tag database and has a very similar interface to this crate.

rust-language-tags's People

Contributors

Stargazers

Watchers

Forkers

nox jturner314 behnam mgeisler zbraniecki tpt calebwin swfsql alyoshavasilieva filips123 glebpom bbqsrc

rust-language-tags's Issues

Attribute citation

From what I've seen, only the docstring of LanguageTag itself includes a portion of RFC 5646 verbatim.

License-wise, this is probably not ok. RFC 5646 states at the beginning:

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info).

Given that it was published in september 2009, I had a look at the Trust Legal Provisions 3.0 from Sept 12, 2009.
The relevant portion here is

[3.c]. Licenses For Use Outside the IETF Standards Process. In addition to the rights
granted with respect to Code Components described in Section 4 below, the IETF Trust hereby
grants to each person who wishes to exercise such rights, to the greatest extent that it is permitted
to do so, a non-exclusive, royalty-free, worldwide right and license under all copyrights and
rights of authors:
  iii. to copy, publish, display and distribute unmodified portions of IETF Contributions and IETF Documents and translations thereof, provided that:
    (x) each such portion is clearly attributed to IETF and identifies the RFC or other IETF Document or IETF Contribution from which it is taken,
    (y) all IETF legends, legal notices and indications of authorship contained in the original IETF RFC must also be included where any substantial portion of the text of an IETF RFC, and in any event where more than one-fifth of such text, is reproduced in a single document or series of related documents.

Since (from what I can tell) documentation doesn't count as Code Components, you need to attribute that portion of the RFC properly.
(I'm pretty sure that that single paragraph isn't a substantial portion, so (y) probably doesn't apply.)

Of course, the paragraph does come after a link to the original document, which could maybe be seen as enough, but, at least to me, it wasn't obvious that that paragraph was a citation before I looked it up.

Disclaimer: I am not a lawyer and am only going off my layman's understanding obtained from simply reading the legal documents.

language tag validation and dependency

I have a work-in-progress CLDR crate which may be of use to this project. With CLDR data, you can perform validation and substitution/replacement of subtags. Do you have any interest in using the CLDR crate as an optional dependency for validation?

In addition, I want to use LanguageTags as keys for data lookup in CLDR data, but I don't know how to deal with circular dependencies (if language-tags uses cldr to perform validation, and cldr uses language-tags to lookup locale-specific data). Thoughts?

Use only one allocation

It would be nice to have just one allocatation per tag like the url crate.

I have started to work on it here: https://github.com/Tpt/rudf/blob/master/lib/src/model/language_tag.rs

It provides a full parser that checks if the tag is well-formed" and a work in progress of full validator. I still need to implement the validation against the BCP 47 database.

If you are interested I could prepare a PR to merge it in your repository.

New release?

There seem to have been a fair bit of changes in this repo since 0.2.2 but they never made it to a released version.

Deriving Ord for LanguageTag

I need to use LanguageTag as a key to BTreeMap, would it be possible to add PartialOrd, Ord to the derive list of that struct?

Use cast to usize

Hi, this is just a tip, I saw your comment on the RFC. As long as you use a simple "C-like" enum, just labels, no contained values in any variant of the enum, then you can cast the enum variants to integers directly. You can use this to implement Ord/Eq much more efficiently.

enum Example {
 A, B, C
}

fn example() {  let a = Example::A as usize; let b = Example::B as usize; }

Shouldn't zh-CN fail to validate?

I don't see zh-CN in https://datatracker.ietf.org/doc/html/rfc5646#appendix-A but it seemed to show as pass in test

rust-language-tags/tests/tests.rs

Line 96 in e2bcd0c

assert_eq!(("zh", "zh", None, vec![]), parts(&"zh-CN".parse().unwrap()));

But browsers like firefox and chromium don't seemed to support the language tag given when I put zh-CN, it did not take in the font for zh which I have set custom for zh, but instead zh-Hans-CN and zh-Hans works.

I am currently looking for a solution for getzola/zola#2169.

Normalization of language tags?

For my use case it would be useful if I could normalize language tags into a common format. What I want to do is receive general LanguageTags through the API (like en-UK), but only but only use the language name (en) for storage and comparison (because I dont want to deal with too many possible values for now).

I think a good way to implement this would be with a new struct PrimaryLanguageTag, which implements the same traits as LanguageTag and can be converted back and forth with that type. Then the type checker would ensure that only basic language tags end up in the database, and we cant easily forget the conversion.

BCP 47 Extensions Parsing

Problem:

Currently, the language-tags crate does not support parsing of Unicode Locale Extensions (u and t).
These extensions are crucial for specifying additional information about language behavior, such as collation order, numbering systems, and calendar preferences.

Example:

A valid BCP 47 tag with a "u" extension might look like this:

de-DE-u-co-phonebk

In this example:

de-DE specifies German language as used in Germany.
u indicates the start of the Unicode Locale Extension.
co is the key for collation, with phonebk specifying the collation type (phonebook order).

Request:

Please consider adding support for parsing and validating these extensions in the language-tags crate. This feature would enhance the crate's utility for developers working on internationalization and localization features.

I would be interested in contributing to this feature if you decide to move forward, depending on my availability.

Resources:

You can find the BCP 47 extensions description here.

Additionally, here is the official BCP 47 Unicode Extensions description.

Bug in stringification of the language tag

Looking at

rust-language-tags/src/lib.rs

Line 526 in a1332c6

if self.language.is_none() {

I think there's a bug. If the tag has an empty language subtag, but contains region/script etc. it should serialize to "und-Latn" etc. But if it also contains a private use tag, it should serialize to "und-Latn-x-foo".

Instead of just checking for language subtag, you probably should check if any subtag has been formatted before.

Feature for serde (de)serialize?

Would be very helpful for use in a json api (and many other cases). I'm trying to implement it myself, but it seems rather complicated so direct support would be nice.

Can this crate parse language tag specified in rfc1766?

I need to parse language tag specified in rfc1766 and it seems to be quite similar to language tags shown in the examples section.

List of supported Languages?

For my use case it would be useful if the library could expose a list of all supported languages, so that the user can pick their spoken languages from this list. It should also include the name of each language (without localization, so es -> Español, ko -> 한국어 etc). For now I only care about the primary language.

I couldnt find any library with such a list in a quick search on crates.io, and I think it makes sense to include this here as you already have the list of languages.

please update / include LICENSE file(s) to match project relicensing to dual MIT/Apache-2.0

It looks like the pull request that updated the SPDX license tag in the project's Cargo.toml file (#18) didn't also update the LICENSE files to match. It would be great if you could rename the LICENSE file to LICENSE-MIT and include the Apache-2.0 license text as LICENSE-APACHE. This matches what other dual-licensed Rust projects do (for example, serde_json).

Additionally, you shoud probably be aware that currently, your published crates are, technically, not in compliance with your own license terms. Both the MIT and Apache-2.0 licenses require that redistributed sources contain a copy of the actual license text, but the crates that were published and which are redistributed by crates.io don't satisfy that requirement since language-tags version 0.3.0.

Setters

There has been a regression of functionality between 0.2 and 0.3 -- since fields are no longer public, I can no longer mutate the struct and for my use case I need to be able to do that.

If you do not wish for the fields to be public, would it be possible to implement setters (i.e. set_script, etc) for the fields?

Add CHANGELOG.md

I'm in the process of updating a crate that depended on language-tags 0.2 to the latest version (0.3.2). The lack of CHANGELOG.md file in the repo with a summary of changes makes it harder. I'll deal with my issue by checking the commits directly, but I still recommend to keep a changelog so it's easier to track the evolution of this lib.

You don't have to backfill entries for previous versions, but having it for future releases would be great.

Implement benchmark

It would be nice to have some benchmark to get an idea of the performance affect of changes to the library. For example, it would be nice to see if the IANA subtag registry integration does not slow done too much the validation/normalization and see if some better datastructure algorithm would provide a significant speed-up.

I would suggest to use bencher for that (it is the one that is used in the url crate)

Converting from ISO 639-1 and ISO 639-2

I'm currently trying to compare language tags from two sources, in different formats that can be considered as being ISO 639-1 and ISO 639-3.
I'm converting these language tags to BCP47 so that I have something standard accross my whole dataset, then test the equality.

The following code serves as an example:

        let zh_6391 = LanguageTag::parse("zh");
        let zh_6392 = LanguageTag::parse("zho");
        assert_eq!(zh_6391, zh_6392); //fails

While I understand why it fails, I was wondering if there would be some way of converting BCP47 tags into a more canonical form. I thought that canonicalize would help but the test still fails (expectedly).

I have checked the python library langcodes and it looks that they advise to use two-letter codes (based on the BCP47 spec?), and it seems that the library automagically converts ISO 639-2 into ISO 639-1 when possible, keeping a 3-letter code when not.

zho_6391 = langcodes.Language.get("zho")
zho_6392 = langcodes.Language.get("zh")

assert zho_6391 == zho_6392

print(langcodes.Language.get("lij")) # prints lij
print(langcodes.Language.get("zho")) # prints zh

Would such a feature be in the scope of this library? And if so, I'd be glad to try to tackle the problem :)

Figure out what `LanguageTag::extension_subtags` should return

Currently, on the tag "en-r-az-qt" the LanguageTag::extension_subtags method returns [("r", "az"), ("r", "qt")]. It is not nice at all because the same value would be also returned for "en-r-az-r-qt". We have three options here:

Keep the method as it is, retuning impl Iterator<Type=(&str, &str)>
Make it return [("r", "az-qt")] and then, users could call .split('-') to get the values of the extensions to get its component (same return type)
Make it return [("r", ["az", "qt"])] i.e. the type impl Iterator<Type=(&str, impl Iterator<Type=&str>)>.

I have a small preference for option 2 that is both correct and let the API users do what they prefer with the value.

Release new version 0.2.1

For heapsize support.

Canonicalization for redunant tags

@Tpt what do you think should be done here?

TODO: what if a redundant has a some extensions/private use?