apertium / lttoolbox Goto Github PK

View Code? Open in Web Editor NEW

18.0 15.0 22.0 1.36 MB

Finite state compiler, processor and helper tools used by apertium

Home Page: http://wiki.apertium.org/wiki/Lttoolbox

License: GNU General Public License v2.0

Makefile 0.44% Shell 0.13% M4 3.80% C++ 62.52% Roff 2.68% C 24.44% Python 6.00%

apertium-core

lttoolbox's People

Contributors

Stargazers

Watchers

lttoolbox's Issues

man lt-proc doesn't mention -L/-N/-W (new weight features) nor -C

These could be copied over from lt-proc --help.

Failure in corpus test for weighted branch

The corpus test with the en-es pair yields different results for master branch and the weighted branch. The difference arises in the apertium-tagger part of the pipeline and after much investigation I have found out that the size of transducers in for the both branches differ.

Master branch => Transducer size = 591
Weighted branch => Transducer size = 7

Update the code so that initial epsilons/whitespace are treated properly or diagnosable

Either (1) update the code so that we never get these errors/warnings:

Error: Invalid dictionary (hint: the left side of an entry is empty)
Error: Invalid dictionary (hint: entry beginning with whitespace)

Or (2) give an example string so that the problem can be more easily diagnosed.

Fedora test fail with LC_ALL=C.utf8

If export LC_ALL=C.utf8 then make test fails with:

runTest (lt_print.NonWeightedFst) ... FAIL

======================================================================
FAIL: runTest (lt_print.NonWeightedFst)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/lttoolbox/tests/printtest.py", line 62, in runTest
    self.assertEqual(self.communicateFlush(), self.expectedOutput)
AssertionError: u'0\t1\tv\tv\t0.000000\t\n1\t2\ti\ti\t0.000000\t\n2\t3\th\th\t0.000000\t\n3\t4\t [truncated]... != u'0\t1\tv\tv\t0.000000\t\n1\t2\ti\ti\t0.000000\t\n2\t3\th\th\t0.000000\t\n3\t4\t [truncated]...
  0     1       v       v       0.000000
  1     2       i       i       0.000000
  2     3       h       h       0.000000
  3     4       k       k       0.000000
  4     5       i       i       0.000000
  5     6       <KEPT>  <KEPT>  0.000000
- 6     10      \u03b5  \u03b5  8238976959774720.000000
+ 6     10      \u03b5  \u03b5  0.000000
  6     7       <MATCHSOFAR>    <MATCHSOFAR>    0.000000
  7     8       <STILLMATCHING> <STILLMATCHING> 0.000000
  8     9       <NONMATCHL>     <NONMATCHR>     0.000000
- 9     10      \u03b5  \u03b5  8238976959774720.000000
+ 9     10      \u03b5  \u03b5  0.000000
  10    0.000000

If export LC_ALL=en_US.utf8 then it passes. How the hell locale has an influence on weights, I don't yet know.

(Also, if LC_ALL is not a UTF-8 locale then all weight tests fail because they use Unicode characters - but that's expected.)

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc

Lttoolbox generates forms, but fails to analyze them.
For ex.:
echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin aguata

But, there's no such a form in morph analyzer:
echo "aguata" | apertium -d . grn-morph ^aguata/*aguata$^./.<sent>$
Although some forms are analyzed correctly:

echo "ndaguatái" | apertium -d . grn-morph ^ndaguatái/nd<neg>+a<prn><p1><sg>+guata<v><iv><pres>+i<neg>$^./.<sent>$
We will be very grateful if you fix this.

Modify the DTD and parser to allow weights on entries in lttoolbox XML

Something like:

<section id="main" type="standard">
  <e w="0.6"><p><l>estación<s n="n"/><s n="f"/></l><r>station<s n="n"/></r></p></e>
  <e w="0.4><p><l>estación<s n="n"/><s n="f"/></l><r>season<s n="n"/></r></p></e>
</section>

Support infinite weights in lt-comp/lt-proc

We need to implement a way to represent infinite weights.
The current outcome is strange!

$ cat sample.att
0       1       a       b       2
1       2       b       c       1
1       2       c       d       inf
2       0

$ lt-comp lr sample.att sa.bin
main@standard 3 3

$ lt-print sa.bin
0       1       a       b       1.000000
1       2       b       c       2.000000
1       2       c       d       -2.000000
2       0.000000

garden-path mwe's cause superblanks to be moved

If you have legge# opp til in your monolingual analyser, and try to analyse input

legge opp blah

in html-format, lt-proc will shift the   into the middle of the analysis:

$ echo 'legge opp<br/>blah' |apertium-deshtml 
legge opp[<br\/>]blah.[][
]

↑ here it's still at the end

$ echo 'legge opp<br/>blah' |apertium-deshtml |lt-proc -we ../apertium-nno-nob/nob-nno.automorf.bin
^legge/legge<vblex><inf>$[<br\/>]^opp/opp<pr>/opp<adv>/oppe<vblex><imp>$ ^blah/*blah$^./.<sent><clb>$[][
]

but ↑ here it's in the middle of the multiword.

From the code, it seems like what happens is that we

read until legge , we've now seen a nonalphabetic after a final, so the index last=6 and lf=/legge<vblex><inf>.
read further until legge opp[ ] , where we still don't know if we'll see til at the right, so [ ] ends up in blankqueue
see b, meaning we can't go further in that mwe, so we have to skip back to the last full analysis
call printWord with surface form legge
call printSpace, which completely flushes blankqueue if there is one, otherwise outputs a space.

Double quotes gets a strange analysis

A double quotes token get a simple " analysis.

$ echo '"' | lt-proc eng.automorf.bin 
"

Expected output:

^"/"<dquotes>$

Would it be better if we add the double quotes to the .dix files?

Other missing characters:

°
°C can be also handled instead of getting °^C/*C$ analysis.
– (unicode decimal value: 8211)

form is empty if followed by space and soft hyphen

Unzip softhyph.zip

The form i is missing below:

$ lt-proc -we nob-nno.automorf.bin < softhyph
^/i<pr>/ialphabet<n><m><sg><ind>$ ^xyzzy/*xyzzy$

The third character in the input file is a soft hyphen (utf8 bytes C2AD):

$ hexdump -C softhyph
00000000  69 20 c2 ad 78 79 7a 7a  79 0a                    |i ..xyzzy.|
0000000a

Remove the soft hyphen, and it gives the expected

^i/i<pr>/ialphabet<n><m><sg><ind>$ ^xyzzy/*xyzzy$

Implement the supervised weighting of automata

This feature will allow weighting of FSTs given a tagged corpus.

3.5.1: cannot find input file: `python/setup.py.in'

Testing the uploaded lttoolbox-3.5.1.tar.gz on OpenBSD -current.

Using /ptmp/pobj/lttoolbox-3.5.1/config.site (generated)
configure: WARNING: unrecognized options: --disable-gtk-doc
configure: loading site script /ptmp/pobj/lttoolbox-3.5.1/config.site
checking build system type... x86_64-unknown-openbsd6.6
checking host system type... x86_64-unknown-openbsd6.6
checking target system type... x86_64-unknown-openbsd6.6
checking for a BSD-compatible install... /ptmp/pobj/lttoolbox-3.5.1/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... mkdir -p
checking for gawk... (cached) awk
checking whether make sets $(MAKE)... (cached) yes
checking whether make supports nested variables... yes
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... (cached) o
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether c++ accepts -g... (cached) yes
checking for style of include used by make... GNU
checking dependency style of c++... gcc3
checking how to print strings... print -r
checking for gcc... cc
checking whether we are using the GNU C compiler... (cached) yes
checking whether cc accepts -g... (cached) yes
checking for cc option to accept ISO C89... none needed
checking whether cc understands -c and -o together... yes
checking dependency style of cc... gcc3
checking for a sed that does not truncate output... (cached) /usr/bin/sed
checking for grep that handles long lines and -e... (cached) /usr/bin/grep
checking for egrep... (cached) /usr/bin/egrep
checking for fgrep... (cached) /usr/bin/fgrep
checking for ld used by cc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... (cached) 131072
checking how to convert x86_64-unknown-openbsd6.6 file names to x86_64-unknown-openbsd6.6 format... func_convert_file_noop
checking how to convert x86_64-unknown-openbsd6.6 file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... match_pattern /lib[^/]+(\.so\.[0-9]+\.[0-9]+|\.so|_pic\.a)$
checking for dlltool... no
checking how to associate runtime and link libraries... print -r --
checking for ar... (cached) ar
checking for archiver @FILE support... @
checking for strip... (cached) strip
checking for ranlib... (cached) ranlib
checking command to parse /usr/bin/nm -B output from cc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... cc -E
checking for ANSI C header files... (cached) yes
checking for sys/types.h... (cached) yes
checking for sys/stat.h... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for memory.h... (cached) yes
checking for strings.h... (cached) yes
checking for inttypes.h... (cached) yes
checking for stdint.h... (cached) yes
checking for unistd.h... (cached) yes
checking for dlfcn.h... (cached) yes
checking for objdir... .libs
checking if cc supports -fno-rtti -fno-exceptions... yes
checking for cc option to produce PIC... -fPIC -DPIC
checking if cc PIC flag -fPIC -DPIC works... yes
checking if cc static flag -static works... yes
checking if cc supports -c -o file.o... yes
checking if cc supports -c -o file.o... (cached) yes
checking whether the cc linker (/usr/bin/ld) supports shared libraries... yes
checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... openbsd6.6 ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... c++ -E
checking for ld used by c++... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking whether the c++ linker (/usr/bin/ld) supports shared libraries... yes
checking for c++ option to produce PIC... -fPIC -DPIC
checking if c++ PIC flag -fPIC -DPIC works... yes
checking if c++ static flag -static works... yes
checking if c++ supports -c -o file.o... yes
checking if c++ supports -c -o file.o... (cached) yes
checking whether the c++ linker (/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... openbsd6.6 ld.so
checking how to hardcode library paths into programs... immediate
checking whether build environment is sane... yes
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for LTTOOLBOX... yes
checking whether the compiler supports wide strings... yes
checking for xmlReaderForFile in -lxml2... no
checking for ANSI C header files... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for unistd.h... (cached) yes
checking for stddef.h... (cached) yes
checking for stdbool.h that conforms to C99... no
checking for _Bool... no
checking for an ANSI C-conforming const... (cached) yes
checking for size_t... (cached) yes
checking for error_at_line... no
checking whether fread_unlocked is declared... no
checking whether fwrite_unlocked is declared... no
checking whether fgetc_unlocked is declared... no
checking whether fputc_unlocked is declared... no
checking whether fputs_unlocked is declared... no
checking whether fgetwc_unlocked is declared... no
checking whether fputwc_unlocked is declared... no
checking whether fputws_unlocked is declared... no
checking for setlocale... (cached) yes
checking for strdup... (cached) yes
checking for getopt_long... (cached) yes
checking whether C++ compiler accepts -std=c++20... no
checking whether C++ compiler accepts -std=c++2a... yes
checking for a Python interpreter with version >= 3.4... python3
checking for python3... /usr/local/bin/python3
checking for python3 version... 3.7
checking for python3 platform... openbsd6
checking for python3 script directory... ${prefix}/lib/python3.7/site-packages
checking for python3 extension module directory... ${exec_prefix}/lib/python3.7/site-packages
checking that generated files are newer than configure... done
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: error: cannot find input file: `python/setup.py.in'

Support direction restrictions in weights

Currently, weights are specified using the w attribute. Using it in a bidix gives an entry the same weight for both directions, potentially producing unwanted effects if there are lexical units with multiple translations in both directions.

I suggest adding the attributes wr and wl with a behaviour equivalent to what vr and vl already do with variants.

Thanks!

Write a utility to assign weights to a compiled transducer based on a corpus

I imagine it will be called lt-reweight

It should have two arguments:

a binary lttoolbox file e.g. grn.automorf.bin
a tagged corpus grn.tagged

$ lt-reweight grn.automorf.bin grn.tagged

Where grn.tagged looks like:

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^Guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^ombohéra/o<prn><p3><sg>+mbohéra<v><tv><pres>$
^hikuái/hikuái<aux><impf><p3><pl>$
^umi/umi<adj><dem><pl>$
^Guaranikuéra/guarani<n>+kuéra<det><pl>$
^pe/pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^teépe/tee<n>+pe<post>$
^./.<sent>$

^Guarani/guarani<n>$
^haʼe/haʼe<vbser><iv><pres>$
^peteĩva/peteĩ<num>+va<subs><dem>$
^umi/umi<adj><dem><pl>$
^teʼyikuéra/teʼyi<n>+kuéra<det><pl>$
^Amérika-gua/Amérika<np><top>+gua<post>$
^ñeʼẽnguéra/ñeʼẽ<n>+kuéra<det><pl>$
^apytépe/apytépe<post>$
^hetave/heta<adv>+ve<comp>$
^iñeʼẽhárava/iñeʼẽhárava<adj>$
^,/,<cm>$
^oñemohendáva/o<prn><p3><sg>+je<pass>+mohenda<v><tv><pres>+va<subs><dem>$
^irundy/irundy<num>$
^tetãnguéra/tetã<n>+kuéra<det><pl>$
^iñambuévape/iñambuéva<adj>+pe<post>$
^(/(<lpar>$
^Paraguái/Paraguái<np><top>$
^,/,<cm>$
^Argentina/Argentina<np><top>$
^,/,<cm>$
^Volívia/Volívia<np><top>$
^ha/ha<cnjcoo>$
^Brasil/Brasil<np><top>$
^)/)<rpar>$
^./.<sent>$

^Avei/avei<adv>$
^,/,<cm>$
^haʼe/haʼe<vbser><iv><pres>$
^ñoite/ñoite<adv>$
^ojehechakuaáva/o<prn><p3><sg>+je<pass>+hechakuaa<v><tv><pres>+va<subs><dem>$
^ñeʼẽ/ñeʼẽ<n>$
^teéramo/tee<n>+ramo<post>$
^peteĩ/peteĩ<num>$
^tetã/tetã<n>$
^Ñembyamérika-guápe/Ñembyamérika<np><top>+gua<post>+pe<post>$
^./.<sent>$

^Tupi/Tupi<n>$
^ha/ha<cnjcoo>$
^guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^aty/aty<n>$
^guasu/guasu<adj>$
^rehegua/rehegua<post>$

^,/,<cm>$
^oguereko/o<prn><p3><sg>+guereko<v><tv><pres>$
^hetáichagua/hetáichagua<adj>$

^ñeʼẽnunga/ñeʼẽnunga<n>$
^,/,<cm>$
^upéicharõ/upéicha<adv>+rõ<post>$
^jepe/jepe<adv>$
^oĩ/oĩ<v><iv><pres>$
^jekupyty/jekupyty<v><tv><pres>$
^ijapytepekuéra/i<prn><p3><sg>+japyte<n>+pe<post>+kuéra<det><pl>$
^ha/ha<cnjcoo>$
^heta/heta<adv>$
^mbaʼépe/mbaʼe<n>+pe<post>$
^ojojogua/ojojogua<n>$
^koʼã/koʼã<adj><dem><pl>$
^ñeʼẽnungakuéra/ñeʼẽnunga<n>+kuéra<det><pl>$
^./.<sent>$

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^haʼe/haʼe<vbser><iv><pres>$
^Paraguái/Paraguái<np><top>$
^retãme/tetã<n>+pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^ary/ary<n>$
^1992/1992<num>$
^guive/guive<post>$
^./.<sent>$

^Japypateĩ/Japypateĩ<num>$
^2006/2006<num>$
^guive/guive<post>$
^haʼe/haʼe<vbser><iv><pres>$
^avei/avei<adv>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^Mercosur-pe/Mercosur<np><org>+pe<case>$
^,/,<cm>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^ha/ha<cnjcoo>$
^poytugañeʼẽ/poytugañeʼẽ<n>$
^ykére/ykére<post>$
^./.<sent>$

And the output of the analyser for e.g. poytugañeʼẽ is:

^poytugañeʼẽ/poytugañeʼẽ<n>/a<prn><p1><sg>+poytugañeʼẽ<n>/re<prn><p2><sg>+poytugañeʼẽ<n>$^./.<sent>$

So, the analyses should be weighted

poytugañeʼẽ : poytugañeʼẽ<n> = 1.0
poytugañeʼẽ : a<prn><p1><sg>+poytugañeʼẽ<n> = 0.0
poytugañeʼẽ  : re<prn><p2><sg>+poytugañeʼẽ<n>  = 0.0

Implement n-best output in lt-proc

Implement an option to lt-proc to output n-best paths. We can use the same option names as in hfst-proc:

  -N N, --analyses=N      Output no more than N analyses
                          (if the transducer is weighted, the N best analyses)
  --weight-classes N      Output no more than N best weight classes
                          (where analyses with equal weight constitute a class

They should work for both analysis and generation.

Use ICU

How about we switch all I/O and wide char use to ICU instead? That would get rid of all the locale irritations and make the code more portable.

We already require ICU, both directly and indirectly. We could even get rid of PCRE in downstream tools.

Add support for weights to lttoolbox binary format

Start with the att_compiler. This means that you won't have to implement determinisation/minimisation code to start with. The code should not break reading existing files. It might be a good idea to have a small one-byte version header.

Here are a few places to start looking:

transducer.h:

  /**
   * Transitions of the transducer
   */
  map<int, multimap<int, int> > transitions;

transducer.cc:

void
Transducer::write(FILE *output, int const decalage)

void
Transducer::read(FILE *input, int const decalage)

node.h:

class Node
{
private:
  friend class State;

  /**
   * The outgoing transitions of this node. 
   * Schema: (input symbol, (output symbol, destination))
   */
  map<int, Dest> transitions;

@TinoDidriksen do you have any thoughts on properly storing floats in a binary file ?

Carefulcase option -C not compatible with -g

@jimregan implemented -C in https://sourceforge.net/p/apertium/tickets/121/ for generation with dictionary-case. But it doesn't seem to be compatible with the -g option in the lt-proc main function, that just shows the help message. Since the apertium driver always adds -g, it'd be nice to have it compatible.

Fails to build on many archs

From Debian:

Fails to build/test on all ARMs, PPC, and s390x.

Maybe related to whatever #30 was.

std::bad_alloc on empty pardefs

See apertium/apertium-nob@51f51ae

<pardef n="RL_s_case">
  <e>       <p><l></l>          <r></r></p></e>
  <e r="RL"><p><l>s</l>         <r><s n="gen"/></r></p></e>
</pardef>

creates a std::bad_alloc when trying to analyse something using this pardef.

The workaround is to add an entry that never matches, but this really should be fixed in lt-comp.

Serializing when size_t != cstdint types

Build failure on OpenBSD: apertium/apertium#15
Offending commit: fb61b82

Need a better way to have size_t separate when size_t is not one of the cstdint types.

Support empty alphabet, for simple CJK word segmentation

Before
944ed25 / #52
it was possible to use monodix files with an empty <alphabet> in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.

See 944ed25#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.

Maybe the iswalnum test could be turned off by a flag, e.g. lt-proc --no-implicit-alphabet ?

parsing issues with converted transducer

hfst-proc behaviour (expected):

$ echo "с." | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$
$ echo "с.1" | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

lt-proc behaviour (second one is unexpected):

$ echo "с." | lt-proc sah.automorf.bin 
^с./с.<abbr>$
$ echo "с.1" | lt-proc sah.automorf.bin 
^с/*с$^./.<sent>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

Specifically, с. doesn't receive an analysis above; instead the . alone receives an analysis. My expectation is that the parsing would be LMLR, but it seems to be something else?

Unicode lemma of tags-item in TSX file does not work

lttoolbox/lttoolbox/pattern_list.cc

Line 127 in 0285bab

result.push_back(int((unsigned char) lemma[i]));

result.push_back(int((unsigned char) lemma[i]));
should be
result.push_back(int((wchar_t) lemma[i]));

Otherwise, Unicode lemma of tags-item in TSX file will not work.

[Test case]
unicode.tsx

<?xml version="1.0" encoding="UTF-8"?>
<tagger name="unicode">
   <tagset>
      <def-label name="unicode" closed="true">
         <tags-item lemma="아" tags="noun"/>
      </def-label>
   </tagset>
</tagger>

In case of unsigned char:

$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
Warning: There is not coarse tag for the fine tag '아<noun>'
         This is because of an incomplete tagset definition or a dictionary error
^아/아<noun>$

In case of wchar_t:

$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
^아/아<noun>$

Same is true for apertium-tagger.

This issue is copied from 76287d2#commitcomment-36208453

Python install DESTDIR

@Vaydheesh (CC @unhammer), when running "make install DESTDIR=/tmp/hubba", the Python parts do not respect that installation prefix.

If I add --prefix=$(DESTDIR)$(prefix) to Makefile.am install step, an error occurs:

make[3]: Entering directory '/misc/lttoolbox/python'
/usr/bin/python3 setup.py install --prefix=/tmp/hubba/usr/local
running install
Checking .pth file support in /tmp/hubba/usr/local/lib/python3.5/site-packages/
/usr/bin/python3 -E -c pass
TEST FAILED: /tmp/hubba/usr/local/lib/python3.5/site-packages/ does NOT support .pth files
error: bad install directory or PYTHONPATH

You are attempting to install a package to a directory that is not
on PYTHONPATH and which Python does not read ".pth" files from.  The
installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /tmp/hubba/usr/local/lib/python3.5/site-packages/
...
Makefile:475: recipe for target 'install-exec-local' failed

But libdivvun's https://github.com/divvun/libdivvun/tree/master/python install works with just that, and I don't know why. I don't see any relevant difference, but libdivvun does not run that .pth test step.

Implement a tool to extract segments (morphemes) from .dix files

In addition to #78, it would be great to have a tool, let's call it lt-segment that would calculate a segment vocabulary from a .dix file. E.g.

...
<pardef n="cat__n">
<e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
<e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="m/ouse__n">
<e><p><l>ouse</l><r>ouse<s n="n"/><s n="sg"/></r></p></e>
<e><p><l>ice</l><r>ouse<s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="happ/y__adj">
<e><p><l>y</l><r>y<s n="adj"/></r></p></e>
<e><p><l>ier</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
<e><p><l>iest</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
</pardef>

<e><i>cat</i><par n="cat__n"/></e>
<e><i>bat</i><par n="cat__n"/></e>
<e><i>happ</i><par n="happ/y__adj"/></e>
<e><i>eas</i><par n="happ/y__adj"/></e>
<e><i>m</i><par n="m/ouse__n"/></e>
<e><i>l</i><par n="m/ouse__n"/></e>

Would produce something like

cat bat happ eas m l @s @ouse @ice @y @ier @iest

It could also be good to have the frequency.

Use python3 for tests instead of python2

Python2 is about to get out of support soon so it would be better if we could update the tests scripts to be python3 complaint.

I tried changing the shebang line to #!/usr/bin/env python3 and it worked without any issues.

lttoolbox/tests/run_tests.py

Line 1 in 1fefd20

#!/usr/bin/env python2

We will also need to update the README file.

Create a single file format for ACX and ICX

It seem a bit of overkill to have two separate file formats for this stuff.

Weights are ignored in monolingual dictionary entries

Given the following paradigms and entries:

<pardef n="liv/e__vblex">
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="inf"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="imp"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="pp"/></r></p></e>
  <e w="1"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
  <e w="3"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="ger"/></r></p></e>
  <e w="2"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="subs"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="pres"/></r></p></e>
  <e>       <p><l>es</l>        <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="past"/></r></p></e>
</pardef>
<pardef n="house__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="sg"/></r></p></e>
  <e r="RL"><p><l>'s</l>        <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
  <e>       <p><l>s</l>         <r><s n="n"/><s n="pl"/></r></p></e>
  <e r="RL"><p><l>s'</l>        <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
</pardef>

<e lm="house" w="1">     <i>house</i><par n="house__n"/></e>
<e lm="house" w="2">     <i>hous</i><par n="liv/e__vblex"/></e>

lt-proc seems to ignore the weights for the entries:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:0.000000>/house<vblex><inf><W:0.000000>/house<vblex><pres><W:0.000000>/house<vblex><imp><W:0.000000>$

The expected result would be:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:1.000000>/house<vblex><inf><W:2.000000>/house<vblex><pres><W:2.000000>/house<vblex><imp><W:2.000000>$

However, the weights work fine when they are used inside a paradigm:

$ echo "housing" | lt-proc -wW eng-cat.automorf.bin
^housing/housing<n><sg><W:0.000000>/house<vblex><pprs><W:1.000000>/house<vblex><subs><W:2.000000>/house<vblex><ger><W:3.000000>$

Remove pkg-config CFLAGS and LIBS

lttoolbox.pc.in has @LTTOOLBOX_LIBS@ and @LTTOOLBOX_CFLAGS@, but it shouldn't have those. Need to remove them and check if any dependents break from this.

Same goes for https://github.com/apertium/apertium/blob/master/apertium.pc.in

Can't compound-only-L on digits

$ cat cmpnum.dix

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
  <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖšŠčČðđÐýÝñÑüÜíÍ-</alphabet>
  <sdefs>
    <sdef n="n"                 c="Noun"/>
    <sdef n="acr"               c="Acronym"/>
    <sdef n="compound-only-L"   c="May only be the left-side of a compound"/>
    <sdef n="compound-R"        c="May be the right-side of a compound, or a full word"/>
    <sdef n="guio"              c="Dash"/>
  </sdefs>

  <pardefs>
    <pardef n="blah">
      <e>       <p><l>blah</l>   <r>blah</r></p></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">
    <e><i>9</i><p><l>-</l>     <r><s n="n"/><s n="compound-only-L"/></r></p></e>
    <e><i>x</i><p><l>-</l>     <r><s n="n"/><s n="compound-only-L"/></r></p></e>

    <e><i>y</i><p><l></l>     <r><s n="acr"/><s n="compound-R"/></r></p></e>
  </section>

  <section id="main" type="inconditional">
    <e><i>-</i><p><l></l>     <r><s n="guio"/></r></p></e>
  </section>

</dictionary>

$ lt-comp lr cmpnum.dix cmpnum.bin
main@inconditional 3 2
main@standard 6 7

$ echo x-y | lt-proc -we cmpnum.bin
^x-y/x<n>+y<acr>$

$ echo 9-y | lt-proc -we cmpnum.bin
9^-/-<guio>$^y/y<acr>$

$ lt-print cmpnum.bin
0       1       -       -       0.000000
1       2       ε       <guio>  0.000000
2       0.000000
--
0       1       9       9       0.000000
0       1       x       x       0.000000
0       2       y       y       0.000000
1       3       -       <n>     0.000000
2       4       ε       <acr>   0.000000
3       5       ε       <compound-only-L>       0.000000
4       5       ε       <compound-R>    0.000000
5       0.000000

The 9 and the x are represented the same in both the dix and the compiled bin as shown by lt-print above; however only the x works as a compound-left.

Implement a tool to calculate a BPE vocabulary

A tool should be included in lttoolbox which calculates a BPE vocabulary as defined in this paper: https://arxiv.org/pdf/1508.07909.pdf

The idea is to use BPE to weight our morphological transducers.

escape special symbols in lt-expand?

Perhaps lt-expand should escape e.g. : as part of form/lemma (DJ:en:DJ<n><m><sg><def> looks like it has three fields). Similarly other reserved chars

Incorrect output without escaping special chars

Probably, incorrect output for:


*+* => ^*+*/*+*$
* => ^*/*$
$ => ^$/$<dollar>$
/ => ^///<sent>$

Tested on apertium-tat.

allow post-generation to work without wake-up-mark (<a/>, ~)

Post-generation should be able to just run on everything LRLM and only apply the changes where it matches (as if it were a version of sed that respects deformatting).

Say for all words in your dictionary, you want to apply the rule …inh t… → …is…. It's just noisy to have to add a <a/> (or explicit ~ in hfst/lexc) to the RL form-side of every place in your dictionary where that happens, and it's especially noisy if the parts of the form inh are generated by different pardefs.

If postgen didn't have to have a wake-up-mark, but stayed awake constantly, you could just put <l>inht<l> <r>is</r> in post.dix and not have any changes to the generator at all.

This might have to be a new option (lt-proc -P, --post-generation-everywhere or something).

(via https://sourceforge.net/p/apertium/mailman/message/36600451/ )

Carefulcase eats words it can't generate

If the dictionary has

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
 <alphabet/>
 <sdefs>
   <sdef n="n"/>
   <sdef n="m"/>
   <sdef n="pl"/>
   <sdef n="def"/>
 </sdefs>
 <section id="main" type="standard">

<e><p><l>kakene</l><r>kake<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

<e><p><l>pc-ane</l><r>pc<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
<e><p><l>PC-ane</l><r>PC<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

 </section>
</dictionary>

then we get

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene  kakene

I would like it to just fall back to "normal" generation for words it can't find exact case for, ie.

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene KAKENE kakene

while still retaining the -C functionality for words it can find exact matches for

$ echo '^PC<n><m><pl><def>$ ^pc<n><m><pl><def>$' | lt-proc -C nob.autogen.bin
PC-ane pc-ane

lt-comp an lt-printed fst fails

For the breton compiled dictionary bre.automorf.bin

$ lt-print bre.automorf.bin > bre.att
$ lt-comp lr bre.att bre.bin
Segmentation fault      (core dumped)

This is caused by the way Attcompiler deduces the type of the edge "https://github.com/apertium/lttoolbox/blob/master/lttoolbox/att_compiler.cc#L381"

I tried setting the type of all edges to word such that they are part of the main section.
The lt-comp worked but the lt-proc is now entering an infinite loop on initializing the root state (Finding Epsilon closure).

Strange output when transducer is compiled from a .att file

I am trying to add weights to the morphological analyser.
So while I was checking last year's project (http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed that the output of the analyser isn't correct (according to my understanding).
The wiki suggests that to do so I will need to:

$ cat test.att
0	1	c	c	4.567895
1	2	a	a	0.989532
2	3	t	t	2.796193
3	4	@0@	+	-3.824564
4	5	@0@	n	1.824564
5	0.525487
4	5	@0@	v	2.845989
 
$ lt-comp lr test.att test.bin 
main@standard 6 6
 
$ lt-print test.bin
0	1	c	c	4.567895	
1	2	a	a	0.989532	
2	3	t	t	2.796193	
3	4	ε	+	-3.824564	
4	5	ε	n	1.824564	
4	5	ε	v	2.845989	
5	0.525487

However, the output of the transducer is a bit strange:

$ echo "cats" | lt-proc test.bin
^cat/cat+n/cat+v$s

Shouldn't the $ sign mark the end of the analysis. Why is there an s following the $ sign?

Python parallel build fail

@Vaydheesh (maybe @unhammer can help?), commit 0fd248f fails when building in parallel (make -j4) with error:

Making all in python
make[1]: Entering directory '/misc/lttoolbox/python'
/usr/bin/python3 setup.py build
make[1]: *** No rule to make target 'lttoolbox.py', needed by 'all'.  Stop.

Works fine when re-run or in serial (make -j1), but need it to work in parallel from a pristine clone.

<ig>foo</ig> as sugar for <l>foo</l><r><g>foo</g></r>

<e lm="ta ille opp">   <i>t</i><par n="t/a__vblex_adj"/><p><l><b/>ille<b/>opp</l><r><g><b/>ille<b/>opp</g></r></p></e>

is very noisy and hard to read, and it's easy to miss a bit from the <l> or <r> when creating new entries by copy-pasting old ones. It'd be nice to have a syntax sugar so we could instead write

<e lm="ta ille opp">   <i>t</i><par n="t/a__vblex_adj"/><ig><b/>ille<b/>opp</ig></e>

and have it be equivalent to the first example.

Run a roundtrip test for lt-print and lt-comp

It would be useful to try and run a roundtrip test for lt-print and lt-comp over available dictionary, to make sure that nothing segfaults or causes infinite loops. See e.g. #68

Fix transliteration mode so that it works

We currently have a problem in transliteration mode in that sometimes characters are dropped.

I would like to be able to convert non-alphabetic apostrophes (U+2019 and U+0027) to the alphabetic apostrophe (U+02BC) cleanly in the stream without having to rely on sed. At the moment I am using:

$ cat wiki.txt | sed "s/\([^ ]\)['’]\([^ ]\)/\1ʼ\2/g" | apertium -d  apertium-grn grn-morph

If I try and do it in lttoolbox, e.g. using the following transducer:

$ lt-print apostrophe.bin 
0	1	a	a	
0	1	á	á	
0	1	ã	ã	
0	1	b	b	
0	1	c	c	
0	1	d	d	
0	1	e	e	
0	1	é	é	
0	1	ê	ê	
0	1	ë	ë	
0	1	ẽ	ẽ	
0	1	f	f	
0	1	g	g	
0	1	h	h	
0	1	i	i	
0	1	í	í	
0	1	ï	ï	
0	1	ĩ	ĩ	
0	1	j	j	
0	1	k	k	
0	1	l	l	
0	1	m	m	
0	1	n	n	
0	1	ñ	ñ	
0	1	o	o	
0	1	ó	ó	
0	1	ô	ô	
0	1	õ	õ	
0	1	p	p	
0	1	q	q	
0	1	r	r	
0	1	s	s	
0	1	t	t	
0	1	u	u	
0	1	ú	ú	
0	1	ü	ü	
0	1	ũ	ũ	
0	1	v	v	
0	1	x	x	
0	1	y	y	
0	1	ý	ý	
0	1	ỹ	ỹ	
0	1	z	z	
1	2	'	ʼ	
1	2	’	ʼ	
2	3	a	a	
2	3	á	á	
2	3	ã	ã	
2	3	b	b	
2	3	c	c	
2	3	d	d	
2	3	e	e	
2	3	é	é	
2	3	ê	ê	
2	3	ë	ë	
2	3	ẽ	ẽ	
2	3	f	f	
2	3	g	g	
2	3	h	h	
2	3	i	i	
2	3	í	í	
2	3	ï	ï	
2	3	ĩ	ĩ	
2	3	j	j	
2	3	k	k	
2	3	l	l	
2	3	m	m	
2	3	n	n	
2	3	ñ	ñ	
2	3	o	o	
2	3	ó	ó	
2	3	ô	ô	
2	3	õ	õ	
2	3	p	p	
2	3	q	q	
2	3	r	r	
2	3	s	s	
2	3	t	t	
2	3	u	u	
2	3	ú	ú	
2	3	ü	ü	
2	3	ũ	ũ	
2	3	v	v	
2	3	x	x	
2	3	y	y	
2	3	ý	ý	
2	3	ỹ	ỹ	
2	3	z	z	
3

I get:

$ echo "ka'aguy" | lt-proc -t apostrophe.bin 
a'gy

The expected output is:

$ echo "ka'aguy ka'aguy" | lt-proc -t apostrophe.bin 
kaʼaguy kaʼaguy

Special tag for all unrecognized symbols

Hello!

There are so many non-whitespace symbols that are not recognized by Apertium's tagger and not marked in any way. For example, apertium-tat does not recognize the following symbols:
_ @ % ~ |
and thousands others.

Is it possible to use some special tag (^_/_<unknown> or <sym>$) for such cases?

Without tagging it is difficult to process Apertium's output. Streamparser also leaves such cases in "blank" variable or skips them.

non-ASCII punctuation not recognised as such

The current problem I'm having is that Arabic commas, semicolons, question marks, etc. (،, ؛ ,؟ — all in the U+0600 block) are not placed in the punctuation level of an lttoolbox transducer when converting from HFST transducers via att format.

Probably due to use of iswpunct() in this function:

lttoolbox/lttoolbox/att_compiler.cc

Lines 375 to 377 in 5e69502

    
           void 
        
           AttCompiler::classify(int from, map<int, TransducerType>& visited, bool path, 
        
                                 TransducerType type)

Resolving #81 would probably resolve this issue as well.

Weighted segfault

There is a massive segfault / memory leak somewhere in the weight code. After upgrading to it, translations started randomly overloading, with some part of lttoolbox eating the APy machine's whole 32 + 64 GB RAM in seconds and then dying. Haven't taken the time to isolate it yet - for now, I've rolled back the install.

[29857716.421446] lt-proc[30665]: segfault at 824100 ip 00007f728d000bbc sp 00007fffe8c3f6b0 error 4 in liblttoolbox3-3.4.so.1.0.0[7f728cfa8000+6c000]

(ping @Techievena)

A symbol for morpheme boundary

The compiler should compile it out by default but there should be an option to retain them to lt-comp. This can be used to make e.g. segmenters. It is needed because at the moment many languages get around .dix restrictions by just duplicating entries, so we can't just add morpheme boundary symbols to the beginning of <l> sides (although that works nicely for paradigms.

    <e lm="parastin"><p><l>parast</l><r>parastin</r></p><par n="kir/__vblex_tv"/></e>
    <e lm="parastin"><p><l>diparast</l><r>parastin</r></p><par n="dikir/__vblex_tv"/></e>
    <e lm="parastin"><p><l>diparêz</l><r>parastin</r></p><par n="dik/e__vblex_tv"/></e>
    <e lm="parastin"><p><l>biparêz</l><r>parastin</r></p><par n="bik/e__vblex_tv"/></e>
    <e lm="parastin"><p><l>neparast</l><r>parastin</r></p><par n="nekir/__vblex_tv"/></e>
    <e lm="parastin"><p><l>naparêz</l><r>parastin</r></p><par n="nak/e__vblex_tv"/></e>
    <e lm="parastin"><p><l>neparêz</l><r>parastin</r></p><par n="nek/e__vblex_tv"/></e>

this would give segmentations like biparêz>in for bi>parêz>in. So it would be nice to be able to give explicit morpheme boundaries. Option (1) is that there are a few single letters left, c d f h k m n o q t u v w x y z, here is what the code looks like with each of them:
examples.txt

Ideally we could come up with something with a good mnemonic too.

<m/> muga "border" (Basque)
<f/> frontera "border" (Catalan)
<f/> finis "border" (Latin)
<h/> hranice "border" (Czech), határ "border" (Hungarian)

Another option (2) would be to use a XML entity or (3) a simple Unicode symbol, like ¦ or ‖.

Handle compiling multiple transducers in the same ATT file

Currently, lt-comp can only compile at&t files containing a single FST.
It would be better if it can also compile multiple disjunct FSTs encoded in the same at&t file.

Example:

$ cat transducer.att

--
0       1       a       b       0.000000
1       1.000000
--
0       1       b       c       0.000000
1       1.000000
--

The current behaviour is:

$ lt-comp lr transducer.att transducer.bin

Error: invalid format 'transducer.att'.

Language-specific optional characters

It would be cool to be able to define, on a per-transducer basis language specific basis certain characters which can appear anywhere in the stream but that don't effect the analysis.

This could possibly be used for soft hyphen,[1] for tatweel[2] and various kinds of zero-width joiners/non-joiners and floating punctuation symbols, e.g. Armenian interrogative signs.[3]

ManYou
رحــــــيم = رحيم
Ինչպե՞ս

There are many open questions regarding what exact form this should have and what kind of behaviours we should support.

lt-comp --var-left

"lt-comp --var-left" is used in several Makefile, but it does not appear if I type
lt-comp --help

lt-print segfaults on heb-mlt.autobil.bin

fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ make
apertium-validate-dictionary apertium-mlt-heb.mlt-heb.dix
lt-comp rl apertium-mlt-heb.mlt-heb.dix heb-mlt.autobil.bin
lt-trim .deps/heb.automorf.bin heb-mlt.autobil.bin heb-mlt.automorf.bin
Error: empty set of final states
Makefile:764: recipe for target 'heb-mlt.automorf.bin' failed
make: *** [heb-mlt.automorf.bin] Error 1

fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ ls -lsrth heb-mlt.autobil.bin 
8,0K -rw-r--r-- 1 fran fran 6,6K oct 27 03:34 heb-mlt.autobil.bin

fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ lt-print heb-mlt.autobil.bin
Violació de segment

	void
	AttCompiler::classify(int from, map<int, TransducerType>& visited, bool path,
	TransducerType type)

apertium / lttoolbox Goto Github PK

lttoolbox's People

Contributors

Stargazers

Watchers

Forkers

lttoolbox's Issues

Expected output:

Recommend Projects

Recommend Topics

Recommend Org