elixir-gettext / expo Goto Github PK

View Code? Open in Web Editor NEW

10.0 3.0 5.0 345 KB

Low-level Elixir parser for GNU Gettext files (PO, POT, MO).

License: Apache License 2.0

Elixir 95.85% Erlang 4.15%

elixir gettext po mo pot parser

expo's People

Contributors

Stargazers

Watchers

Forkers

avinayak kianmeng magusd

expo's Issues

Support Mo Write

Typespec for `msgctxt` is incorrect

Hello! When switching from the parsers inside Gettext to using Expo, I noticed that the typesepc for msgctxt is incorrect.

expo/lib/expo/message.ex

Line 18 in e03b873

@type msgctxt :: String.t()

It should be:

  @type msgctxt :: [String.t(), ...]

When a message has a msgctxt, it will be a list of strings, not a single string -- just like the msgid and msgstr typespecs above. This can be verified elsewhere in this repo in several tests (example). I've provided an example (using v0.4.1) here as well:

msgid "single without context"
msgstr "without"

msgctxt "context single"
msgid "single with context"
msgstr "with"

msgid "singular form without context"
msgid_plural "plural form without context"
msgstr[0] "one without"
msgstr[1] "some without"

msgctxt "context plural"
msgid "singular form with context"
msgid_plural "plural form with context"
msgstr[0] "one with"
msgstr[1] "some with"

iex(22)> Expo.PO.parse_file!("context.po")         
%Expo.Messages{
  headers: [],
  messages: [
    #Expo.Message.Singular<
      msgid: ["single without context"],
      msgstr: ["without"],
      msgctxt: nil,
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >,
    #Expo.Message.Singular<
      msgid: ["single with context"],
      msgstr: ["with"],
      msgctxt: ["context single"],
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >,
    #Expo.Message.Plural<
      msgid: ["singular form without context"],
      msgid_plural: ["plural form without context"],
      msgstr: %{0 => ["one without"], 1 => ["some without"]},
      msgctxt: nil,
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >,
    #Expo.Message.Plural<
      msgid: ["singular form with context"],
      msgid_plural: ["plural form with context"],
      msgstr: %{0 => ["one with"], 1 => ["some with"]},
      msgctxt: ["context plural"],
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >
  ],
  top_comments: [],
  file: "context.po"
}

Thanks!

Performance Comparison with Gettext

⚠️ Currently, this library is not performance optimized at all.

Based on: https://github.com/jshmrtn/expo/tree/performance_comparisor/performance_test

read.exs

Operating System: Linux
CPU Information: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Number of Available Cores: 8
Available memory: 46.77 GB
Elixir 1.13.3
Erlang 24.3.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 42 s

Benchmarking Expo.Parser.Mo.parse ...
Benchmarking Expo.Parser.Po.parse ...
Benchmarking Gettext.PO.parse_string ...

Name                              ips        average  deviation         median         99th %
Expo.Parser.Mo.parse           525.98        1.90 ms    ±24.75%        1.81 ms        2.80 ms
Gettext.PO.parse_string        116.38        8.59 ms    ±10.98%        8.77 ms       10.80 ms
Expo.Parser.Po.parse            62.41       16.02 ms    ±13.50%       15.61 ms       23.78 ms

Comparison: 
Expo.Parser.Mo.parse           525.98
Gettext.PO.parse_string        116.38 - 4.52x slower +6.69 ms
Expo.Parser.Po.parse            62.41 - 8.43x slower +14.12 ms

Memory usage statistics:

Name                       Memory usage
Expo.Parser.Mo.parse            1.57 MB
Gettext.PO.parse_string        10.35 MB - 6.59x memory usage +8.78 MB
Expo.Parser.Po.parse           45.78 MB - 29.12x memory usage +44.21 MB

**All measurements for memory usage were the same**

write.exs

Operating System: Linux
CPU Information: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Number of Available Cores: 8
Available memory: 46.77 GB
Elixir 1.13.3
Erlang 24.3.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 42 s

Benchmarking Expo.Composer.Mo.compose ...
Benchmarking Expo.Composer.Po.compose ...
Benchmarking Gettext.PO.dump ...

Name                               ips        average  deviation         median         99th %
Expo.Composer.Mo.compose       2354.75        0.42 ms    ±19.23%        0.40 ms        0.84 ms
Gettext.PO.dump                 148.69        6.73 ms    ±17.47%        6.54 ms        9.87 ms
Expo.Composer.Po.compose        136.89        7.30 ms    ±34.08%        6.64 ms       16.90 ms

Comparison: 
Expo.Composer.Mo.compose       2354.75
Gettext.PO.dump                 148.69 - 15.84x slower +6.30 ms
Expo.Composer.Po.compose        136.89 - 17.20x slower +6.88 ms

Memory usage statistics:

Name                        Memory usage
Expo.Composer.Mo.compose         0.50 MB
Gettext.PO.dump                  3.59 MB - 7.19x memory usage +3.10 MB
Expo.Composer.Po.compose         3.81 MB - 7.63x memory usage +3.31 MB

**All measurements for memory usage were the same**

Comparison is based on the follwoing gettext file and its mo counterpart: https://github.com/jshmrtn/hygeia/blob/4f08c2b68f5de8cad6a84b9d4a0b01be63a7c32c/priv/gettext/de/LC_MESSAGES/default.po

It contains 6'355 lines of po content for 1'398 translations + the header.

String Formatting

Support restructuring of message strings.

Right now, the only option is to preserve the line splitting:

.po (read & write) should produce an identical file
.mo does not contain line information, everything is one single string

TODO

Introduce rebalance_strings function on translation struct
Introduce rebalance_strings function on all translations (including headers)

Behaviour of `rebalance_strings`

Fields
- msgid
- msgid_plural
- msgstr
- headers
Split at newlines and put every line in its own string
Split words at maxlength?

Improve Error Message with invalid entry structure

https://github.com/jshmrtn/expo/blob/363630bb98f4f71d48a0400e4a449564a0aa9668/test/expo/parser/po_test.exs#L214

Incorporate “Dump flags after references in PO files”

elixir-gettext/gettext#310 (review)

Support Po Write

Provide compile time version of `PluralForms.index/2`

Evaluating the index for every call is very expensive.

Therefore a macro PluralForms.compile_index/1 should be provided that converts the plural expression into Elixir AST.

Fails to parse lines starting with `#~ ##`

== Compilation error in file lib/pleroma/web/gettext.ex ==
1092** (Expo.PO.SyntaxError) priv/gettext/en_test/LC_MESSAGES/static_pages.po:16: unexpected token: "#" (codepoint U+0023)
1093    (expo 0.1.0) lib/expo/po.ex:171: Expo.PO.parse_file!/2
1094    (gettext 0.21.0) lib/gettext/compiler.ex:504: Gettext.Compiler.compile_po_file/5
1095    (gettext 0.21.0) lib/gettext/compiler.ex:449: Gettext.Compiler.compile_unified_po_file/4
1096    (elixir 1.11.4) lib/enum.ex:1411: Enum."-map/2-lists^map/1-0-"/2
1097    (elixir 1.11.4) lib/enum.ex:1411: Enum."-map/2-lists^map/1-0-"/2
1098    (gettext 0.21.0) expanding macro: Gettext.Compiler.__before_compile__/1
1099    lib/pleroma/web/gettext.ex:5: Pleroma.Web.Gettext (module)

Full log: https://git.pleroma.social/pleroma/pleroma/-/jobs/227931

Permalink to priv/gettext/en_test/LC_MESSAGES/static_pages.po:16: https://git.pleroma.social/pleroma/pleroma/-/blob/2a244b391d8c1d9d8e960532758110928cb5ef7c/priv/gettext/en_test/LC_MESSAGES/static_pages.po#L16

Expose Line Information in Translation Structs

If parsing from a .pot? file, add line information to the struct.

Usage: https://github.com/elixir-gettext/gettext/blob/f16cb4542687c349326f3a0fc62c3e8d1867f189/lib/gettext/compiler.ex#L622

Support multiline msgid and msgstr

I believe the current Gettext parser (and the Gettext 'standard') support multiline messages. Currently they do not parse:

iex(4)> Expo.Parser.Po.parse """
...(4)> msgid "hello            
...(4)> beautiful"              
...(4)> msgstr "ciao            
...(4)> bella"
...(4)> """
{:error,
 {:parse_error, "did not expect newline inside string",
  "\nbeautiful\"\nmsgstr \"ciao\nbella\"\n", 1}}

Ubuntu 18.04 no longer supported in CI

Ubuntu 18.04 images are no longer supported: https://github.blog/changelog/2022-08-09-github-actions-the-ubuntu-18-04-actions-runner-image-is-being-deprecated-and-will-be-removed-by-12-1-22/

This cuases our CI to fail: https://github.com/elixir-gettext/expo/actions/runs/4853307131

I think the idea was to test the oldest possible version combination and the newest one.

Do we want to make sure 21.3 can install on ubuntu 20.04 or should we just raise the minimum requirements?

Failing CI for OTP 23.3

Reported upstream: erlef/setup-beam#175

Multi line strings for plural messages in PO files lead to syntax error

Hi, I just stumbled over this problem, where multi line strings don't work in plural messages.

Multi line strings in singular messages as well as single line strings in plural messages work very much fine:

msgid "a"
msgstr "This is a"
"multi line string"

msgid "b"
msgid_plural "b_plural"
msgstr[0] "single line"
msgstr[1] "single line"

iex(7)> Expo.PO.parse_file!("good.po")
%Expo.Messages{
  headers: [],
  messages: [
    #Expo.Message.Singular<
      msgid: ["One participation request for event %{title} to process"],
      msgstr: ["a", "a"],
      msgctxt: nil,
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >,
    #Expo.Message.Plural<
      msgid: ["One participation request for event %{title} to process"],
      msgid_plural: ["One participation request for event %{title} to process"],
      msgstr: %{0 => ["a"], 1 => ["a"]},
      msgctxt: nil,
      comments: [],
      extracted_comments: [],
      flags: [],
      previous_messages: [],
      references: [],
      obsolete: false,
      ...
    >
  ],
  top_comments: [],
  file: "good.po"
}

But the combination of plural message and multi line string does not:

msgid "a"
msgid_plural "a_plural"
msgstr[0] "single line"
msgstr[1] "This is a"
"multi line string"

iex(8)> Expo.PO.parse_file!("bad.po") 
** (Expo.PO.SyntaxError) bad.po:5: syntax error before: "multi line string"
    (expo 0.4.0) lib/expo/po.ex:171: Expo.PO.parse_file!/2
    iex:8: (file)

But from my understanding, both files should be valid .po files? Plus the necessary headers ofc.

I already tried to look into the parsing logic, but realised that Elixir is still pretty new to me.