apertium / lttoolbox Goto Github PK
View Code? Open in Web Editor NEWFinite state compiler, processor and helper tools used by apertium
Home Page: http://wiki.apertium.org/wiki/Lttoolbox
License: GNU General Public License v2.0
Finite state compiler, processor and helper tools used by apertium
Home Page: http://wiki.apertium.org/wiki/Lttoolbox
License: GNU General Public License v2.0
These could be copied over from lt-proc --help
.
The corpus test with the en-es
pair yields different results for master branch and the weighted branch. The difference arises in the apertium-tagger part of the pipeline and after much investigation I have found out that the size of transducers in for the both branches differ.
Transducer size = 591
Transducer size = 7
Either (1) update the code so that we never get these errors/warnings:
Error: Invalid dictionary (hint: the left side of an entry is empty)
Error: Invalid dictionary (hint: entry beginning with whitespace)
Or (2) give an example string so that the problem can be more easily diagnosed.
If export LC_ALL=C.utf8
then make test
fails with:
runTest (lt_print.NonWeightedFst) ... FAIL
======================================================================
FAIL: runTest (lt_print.NonWeightedFst)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/lttoolbox/tests/printtest.py", line 62, in runTest
self.assertEqual(self.communicateFlush(), self.expectedOutput)
AssertionError: u'0\t1\tv\tv\t0.000000\t\n1\t2\ti\ti\t0.000000\t\n2\t3\th\th\t0.000000\t\n3\t4\t [truncated]... != u'0\t1\tv\tv\t0.000000\t\n1\t2\ti\ti\t0.000000\t\n2\t3\th\th\t0.000000\t\n3\t4\t [truncated]...
0 1 v v 0.000000
1 2 i i 0.000000
2 3 h h 0.000000
3 4 k k 0.000000
4 5 i i 0.000000
5 6 <KEPT> <KEPT> 0.000000
- 6 10 \u03b5 \u03b5 8238976959774720.000000
+ 6 10 \u03b5 \u03b5 0.000000
6 7 <MATCHSOFAR> <MATCHSOFAR> 0.000000
7 8 <STILLMATCHING> <STILLMATCHING> 0.000000
8 9 <NONMATCHL> <NONMATCHR> 0.000000
- 9 10 \u03b5 \u03b5 8238976959774720.000000
+ 9 10 \u03b5 \u03b5 0.000000
10 0.000000
If export LC_ALL=en_US.utf8
then it passes. How the hell locale has an influence on weights, I don't yet know.
(Also, if LC_ALL
is not a UTF-8 locale then all weight tests fail because they use Unicode characters - but that's expected.)
Lttoolbox generates forms, but fails to analyze them.
For ex.:
echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin aguata
But, there's no such a form in morph analyzer:
echo "aguata" | apertium -d . grn-morph ^aguata/*aguata$^./.<sent>$
Although some forms are analyzed correctly:
echo "ndaguatái" | apertium -d . grn-morph ^ndaguatái/nd<neg>+a<prn><p1><sg>+guata<v><iv><pres>+i<neg>$^./.<sent>$
We will be very grateful if you fix this.
Something like:
<section id="main" type="standard">
<e w="0.6"><p><l>estación<s n="n"/><s n="f"/></l><r>station<s n="n"/></r></p></e>
<e w="0.4><p><l>estación<s n="n"/><s n="f"/></l><r>season<s n="n"/></r></p></e>
</section>
We need to implement a way to represent infinite weights.
The current outcome is strange!
$ cat sample.att
0 1 a b 2
1 2 b c 1
1 2 c d inf
2 0
$ lt-comp lr sample.att sa.bin
main@standard 3 3
$ lt-print sa.bin
0 1 a b 1.000000
1 2 b c 2.000000
1 2 c d -2.000000
2 0.000000
If you have legge# opp til
in your monolingual analyser, and try to analyse input
legge opp<br/>blah
in html-format, lt-proc will shift the <br/>
into the middle of the analysis:
$ echo 'legge opp<br/>blah' |apertium-deshtml
legge opp[<br\/>]blah.[][
]
↑ here it's still at the end
$ echo 'legge opp<br/>blah' |apertium-deshtml |lt-proc -we ../apertium-nno-nob/nob-nno.automorf.bin
^legge/legge<vblex><inf>$[<br\/>]^opp/opp<pr>/opp<adv>/oppe<vblex><imp>$ ^blah/*blah$^./.<sent><clb>$[][
]
but ↑ here it's in the middle of the multiword.
From the code, it seems like what happens is that we
legge
, we've now seen a nonalphabetic after a final, so the index last=6
and lf=/legge<vblex><inf>
.legge opp[<br/>]
, where we still don't know if we'll see til
at the right, so [<br/>]
ends up in blankqueue
b
, meaning we can't go further in that mwe, so we have to skip back to the last full analysisprintWord
with surface form legge
printSpace
, which completely flushes blankqueue
if there is one, otherwise outputs a space.A double quotes token get a simple "
analysis.
$ echo '"' | lt-proc eng.automorf.bin
"
^"/"<dquotes>$
Would it be better if we add the double quotes to the .dix
files?
Other missing characters:
°
°C
can be also handled instead of getting °^C/*C$
analysis.–
(unicode decimal value: 8211)Unzip softhyph.zip
The form i
is missing below:
$ lt-proc -we nob-nno.automorf.bin < softhyph
^/i<pr>/ialphabet<n><m><sg><ind>$ ^xyzzy/*xyzzy$
The third character in the input file is a soft hyphen (utf8 bytes C2AD):
$ hexdump -C softhyph
00000000 69 20 c2 ad 78 79 7a 7a 79 0a |i ..xyzzy.|
0000000a
Remove the soft hyphen, and it gives the expected
^i/i<pr>/ialphabet<n><m><sg><ind>$ ^xyzzy/*xyzzy$
This feature will allow weighting of FSTs given a tagged corpus.
Testing the uploaded lttoolbox-3.5.1.tar.gz
on OpenBSD -current.
Using /ptmp/pobj/lttoolbox-3.5.1/config.site (generated)
configure: WARNING: unrecognized options: --disable-gtk-doc
configure: loading site script /ptmp/pobj/lttoolbox-3.5.1/config.site
checking build system type... x86_64-unknown-openbsd6.6
checking host system type... x86_64-unknown-openbsd6.6
checking target system type... x86_64-unknown-openbsd6.6
checking for a BSD-compatible install... /ptmp/pobj/lttoolbox-3.5.1/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... mkdir -p
checking for gawk... (cached) awk
checking whether make sets $(MAKE)... (cached) yes
checking whether make supports nested variables... yes
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... (cached) o
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether c++ accepts -g... (cached) yes
checking for style of include used by make... GNU
checking dependency style of c++... gcc3
checking how to print strings... print -r
checking for gcc... cc
checking whether we are using the GNU C compiler... (cached) yes
checking whether cc accepts -g... (cached) yes
checking for cc option to accept ISO C89... none needed
checking whether cc understands -c and -o together... yes
checking dependency style of cc... gcc3
checking for a sed that does not truncate output... (cached) /usr/bin/sed
checking for grep that handles long lines and -e... (cached) /usr/bin/grep
checking for egrep... (cached) /usr/bin/egrep
checking for fgrep... (cached) /usr/bin/fgrep
checking for ld used by cc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... (cached) 131072
checking how to convert x86_64-unknown-openbsd6.6 file names to x86_64-unknown-openbsd6.6 format... func_convert_file_noop
checking how to convert x86_64-unknown-openbsd6.6 file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... match_pattern /lib[^/]+(\.so\.[0-9]+\.[0-9]+|\.so|_pic\.a)$
checking for dlltool... no
checking how to associate runtime and link libraries... print -r --
checking for ar... (cached) ar
checking for archiver @FILE support... @
checking for strip... (cached) strip
checking for ranlib... (cached) ranlib
checking command to parse /usr/bin/nm -B output from cc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... cc -E
checking for ANSI C header files... (cached) yes
checking for sys/types.h... (cached) yes
checking for sys/stat.h... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for memory.h... (cached) yes
checking for strings.h... (cached) yes
checking for inttypes.h... (cached) yes
checking for stdint.h... (cached) yes
checking for unistd.h... (cached) yes
checking for dlfcn.h... (cached) yes
checking for objdir... .libs
checking if cc supports -fno-rtti -fno-exceptions... yes
checking for cc option to produce PIC... -fPIC -DPIC
checking if cc PIC flag -fPIC -DPIC works... yes
checking if cc static flag -static works... yes
checking if cc supports -c -o file.o... yes
checking if cc supports -c -o file.o... (cached) yes
checking whether the cc linker (/usr/bin/ld) supports shared libraries... yes
checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... openbsd6.6 ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... c++ -E
checking for ld used by c++... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking whether the c++ linker (/usr/bin/ld) supports shared libraries... yes
checking for c++ option to produce PIC... -fPIC -DPIC
checking if c++ PIC flag -fPIC -DPIC works... yes
checking if c++ static flag -static works... yes
checking if c++ supports -c -o file.o... yes
checking if c++ supports -c -o file.o... (cached) yes
checking whether the c++ linker (/usr/bin/ld) supports shared libraries... yes
checking dynamic linker characteristics... openbsd6.6 ld.so
checking how to hardcode library paths into programs... immediate
checking whether build environment is sane... yes
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for LTTOOLBOX... yes
checking whether the compiler supports wide strings... yes
checking for xmlReaderForFile in -lxml2... no
checking for ANSI C header files... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for unistd.h... (cached) yes
checking for stddef.h... (cached) yes
checking for stdbool.h that conforms to C99... no
checking for _Bool... no
checking for an ANSI C-conforming const... (cached) yes
checking for size_t... (cached) yes
checking for error_at_line... no
checking whether fread_unlocked is declared... no
checking whether fwrite_unlocked is declared... no
checking whether fgetc_unlocked is declared... no
checking whether fputc_unlocked is declared... no
checking whether fputs_unlocked is declared... no
checking whether fgetwc_unlocked is declared... no
checking whether fputwc_unlocked is declared... no
checking whether fputws_unlocked is declared... no
checking for setlocale... (cached) yes
checking for strdup... (cached) yes
checking for getopt_long... (cached) yes
checking whether C++ compiler accepts -std=c++20... no
checking whether C++ compiler accepts -std=c++2a... yes
checking for a Python interpreter with version >= 3.4... python3
checking for python3... /usr/local/bin/python3
checking for python3 version... 3.7
checking for python3 platform... openbsd6
checking for python3 script directory... ${prefix}/lib/python3.7/site-packages
checking for python3 extension module directory... ${exec_prefix}/lib/python3.7/site-packages
checking that generated files are newer than configure... done
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: error: cannot find input file: `python/setup.py.in'
Currently, weights are specified using the w
attribute. Using it in a bidix gives an entry the same weight for both directions, potentially producing unwanted effects if there are lexical units with multiple translations in both directions.
I suggest adding the attributes wr
and wl
with a behaviour equivalent to what vr
and vl
already do with variants.
Thanks!
I imagine it will be called lt-reweight
It should have two arguments:
grn.automorf.bin
grn.tagged
$ lt-reweight grn.automorf.bin grn.tagged
Where grn.tagged looks like:
^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^Guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^ombohéra/o<prn><p3><sg>+mbohéra<v><tv><pres>$
^hikuái/hikuái<aux><impf><p3><pl>$
^umi/umi<adj><dem><pl>$
^Guaranikuéra/guarani<n>+kuéra<det><pl>$
^pe/pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^teépe/tee<n>+pe<post>$
^./.<sent>$
^Guarani/guarani<n>$
^haʼe/haʼe<vbser><iv><pres>$
^peteĩva/peteĩ<num>+va<subs><dem>$
^umi/umi<adj><dem><pl>$
^teʼyikuéra/teʼyi<n>+kuéra<det><pl>$
^Amérika-gua/Amérika<np><top>+gua<post>$
^ñeʼẽnguéra/ñeʼẽ<n>+kuéra<det><pl>$
^apytépe/apytépe<post>$
^hetave/heta<adv>+ve<comp>$
^iñeʼẽhárava/iñeʼẽhárava<adj>$
^,/,<cm>$
^oñemohendáva/o<prn><p3><sg>+je<pass>+mohenda<v><tv><pres>+va<subs><dem>$
^irundy/irundy<num>$
^tetãnguéra/tetã<n>+kuéra<det><pl>$
^iñambuévape/iñambuéva<adj>+pe<post>$
^(/(<lpar>$
^Paraguái/Paraguái<np><top>$
^,/,<cm>$
^Argentina/Argentina<np><top>$
^,/,<cm>$
^Volívia/Volívia<np><top>$
^ha/ha<cnjcoo>$
^Brasil/Brasil<np><top>$
^)/)<rpar>$
^./.<sent>$
^Avei/avei<adv>$
^,/,<cm>$
^haʼe/haʼe<vbser><iv><pres>$
^ñoite/ñoite<adv>$
^ojehechakuaáva/o<prn><p3><sg>+je<pass>+hechakuaa<v><tv><pres>+va<subs><dem>$
^ñeʼẽ/ñeʼẽ<n>$
^teéramo/tee<n>+ramo<post>$
^peteĩ/peteĩ<num>$
^tetã/tetã<n>$
^Ñembyamérika-guápe/Ñembyamérika<np><top>+gua<post>+pe<post>$
^./.<sent>$
^Tupi/Tupi<n>$
^ha/ha<cnjcoo>$
^guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^aty/aty<n>$
^guasu/guasu<adj>$
^rehegua/rehegua<post>$
^,/,<cm>$
^oguereko/o<prn><p3><sg>+guereko<v><tv><pres>$
^hetáichagua/hetáichagua<adj>$
^ñeʼẽnunga/ñeʼẽnunga<n>$
^,/,<cm>$
^upéicharõ/upéicha<adv>+rõ<post>$
^jepe/jepe<adv>$
^oĩ/oĩ<v><iv><pres>$
^jekupyty/jekupyty<v><tv><pres>$
^ijapytepekuéra/i<prn><p3><sg>+japyte<n>+pe<post>+kuéra<det><pl>$
^ha/ha<cnjcoo>$
^heta/heta<adv>$
^mbaʼépe/mbaʼe<n>+pe<post>$
^ojojogua/ojojogua<n>$
^koʼã/koʼã<adj><dem><pl>$
^ñeʼẽnungakuéra/ñeʼẽnunga<n>+kuéra<det><pl>$
^./.<sent>$
^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^haʼe/haʼe<vbser><iv><pres>$
^Paraguái/Paraguái<np><top>$
^retãme/tetã<n>+pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^ary/ary<n>$
^1992/1992<num>$
^guive/guive<post>$
^./.<sent>$
^Japypateĩ/Japypateĩ<num>$
^2006/2006<num>$
^guive/guive<post>$
^haʼe/haʼe<vbser><iv><pres>$
^avei/avei<adv>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^Mercosur-pe/Mercosur<np><org>+pe<case>$
^,/,<cm>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^ha/ha<cnjcoo>$
^poytugañeʼẽ/poytugañeʼẽ<n>$
^ykére/ykére<post>$
^./.<sent>$
And the output of the analyser for e.g. poytugañeʼẽ
is:
^poytugañeʼẽ/poytugañeʼẽ<n>/a<prn><p1><sg>+poytugañeʼẽ<n>/re<prn><p2><sg>+poytugañeʼẽ<n>$^./.<sent>$
So, the analyses should be weighted
poytugañeʼẽ : poytugañeʼẽ<n> = 1.0
poytugañeʼẽ : a<prn><p1><sg>+poytugañeʼẽ<n> = 0.0
poytugañeʼẽ : re<prn><p2><sg>+poytugañeʼẽ<n> = 0.0
Implement an option to lt-proc
to output n-best paths. We can use the same option names as in hfst-proc
:
-N N, --analyses=N Output no more than N analyses
(if the transducer is weighted, the N best analyses)
--weight-classes N Output no more than N best weight classes
(where analyses with equal weight constitute a class
They should work for both analysis and generation.
How about we switch all I/O and wide char use to ICU instead? That would get rid of all the locale irritations and make the code more portable.
We already require ICU, both directly and indirectly. We could even get rid of PCRE in downstream tools.
Start with the att_compiler
. This means that you won't have to implement determinisation/minimisation code to start with. The code should not break reading existing files. It might be a good idea to have a small one-byte version header.
Here are a few places to start looking:
transducer.h:
/**
* Transitions of the transducer
*/
map<int, multimap<int, int> > transitions;
transducer.cc:
void
Transducer::write(FILE *output, int const decalage)
void
Transducer::read(FILE *input, int const decalage)
node.h:
class Node
{
private:
friend class State;
/**
* The outgoing transitions of this node.
* Schema: (input symbol, (output symbol, destination))
*/
map<int, Dest> transitions;
@TinoDidriksen do you have any thoughts on properly storing floats in a binary file ?
@jimregan implemented -C in https://sourceforge.net/p/apertium/tickets/121/ for generation with dictionary-case. But it doesn't seem to be compatible with the -g
option in the lt-proc main function, that just shows the help message. Since the apertium driver always adds -g
, it'd be nice to have it compatible.
See apertium/apertium-nob@51f51ae
<pardef n="RL_s_case">
<e> <p><l></l> <r></r></p></e>
<e r="RL"><p><l>s</l> <r><s n="gen"/></r></p></e>
</pardef>
creates a std::bad_alloc when trying to analyse something using this pardef.
The workaround is to add an entry that never matches, but this really should be fixed in lt-comp.
Build failure on OpenBSD: apertium/apertium#15
Offending commit: fb61b82
Need a better way to have size_t separate when size_t is not one of the cstdint types.
Before
944ed25 / #52
it was possible to use monodix files with an empty <alphabet>
in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.
See 944ed25#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.
Maybe the iswalnum test could be turned off by a flag, e.g. lt-proc --no-implicit-alphabet
?
hfst-proc
behaviour (expected):
$ echo "с." | hfst-proc sah.automorf.hfst
^с./с.<abbr>$
$ echo "с.1" | hfst-proc sah.automorf.hfst
^с./с.<abbr>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$
lt-proc
behaviour (second one is unexpected):
$ echo "с." | lt-proc sah.automorf.bin
^с./с.<abbr>$
$ echo "с.1" | lt-proc sah.automorf.bin
^с/*с$^./.<sent>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$
Specifically, с.
doesn't receive an analysis above; instead the .
alone receives an analysis. My expectation is that the parsing would be LMLR, but it seems to be something else?
lttoolbox/lttoolbox/pattern_list.cc
Line 127 in 0285bab
result.push_back(int((unsigned char) lemma[i]));
should be
result.push_back(int((wchar_t) lemma[i]));
Otherwise, Unicode lemma of tags-item in TSX file will not work.
[Test case]
unicode.tsx
<?xml version="1.0" encoding="UTF-8"?>
<tagger name="unicode">
<tagset>
<def-label name="unicode" closed="true">
<tags-item lemma="아" tags="noun"/>
</def-label>
</tagset>
</tagger>
In case of unsigned char:
$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
Warning: There is not coarse tag for the fine tag '아<noun>'
This is because of an incomplete tagset definition or a dictionary error
^아/아<noun>$
In case of wchar_t:
$ echo "^아/아<noun>$" | apertium-filter-ambiguity unicode.tsx
^아/아<noun>$
Same is true for apertium-tagger.
This issue is copied from 76287d2#commitcomment-36208453
@Vaydheesh (CC @unhammer), when running "make install DESTDIR=/tmp/hubba", the Python parts do not respect that installation prefix.
If I add --prefix=$(DESTDIR)$(prefix)
to Makefile.am
install step, an error occurs:
make[3]: Entering directory '/misc/lttoolbox/python'
/usr/bin/python3 setup.py install --prefix=/tmp/hubba/usr/local
running install
Checking .pth file support in /tmp/hubba/usr/local/lib/python3.5/site-packages/
/usr/bin/python3 -E -c pass
TEST FAILED: /tmp/hubba/usr/local/lib/python3.5/site-packages/ does NOT support .pth files
error: bad install directory or PYTHONPATH
You are attempting to install a package to a directory that is not
on PYTHONPATH and which Python does not read ".pth" files from. The
installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:
/tmp/hubba/usr/local/lib/python3.5/site-packages/
...
Makefile:475: recipe for target 'install-exec-local' failed
But libdivvun's https://github.com/divvun/libdivvun/tree/master/python install works with just that, and I don't know why. I don't see any relevant difference, but libdivvun does not run that .pth test step.
In addition to #78, it would be great to have a tool, let's call it lt-segment
that would calculate a segment vocabulary from a .dix file
. E.g.
...
<pardef n="cat__n">
<e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
<e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="m/ouse__n">
<e><p><l>ouse</l><r>ouse<s n="n"/><s n="sg"/></r></p></e>
<e><p><l>ice</l><r>ouse<s n="n"/><s n="pl"/></r></p></e>
</pardef>
<pardef n="happ/y__adj">
<e><p><l>y</l><r>y<s n="adj"/></r></p></e>
<e><p><l>ier</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
<e><p><l>iest</l><r>y<s n="adj"/><s n="comp"/></r></p></e>
</pardef>
<e><i>cat</i><par n="cat__n"/></e>
<e><i>bat</i><par n="cat__n"/></e>
<e><i>happ</i><par n="happ/y__adj"/></e>
<e><i>eas</i><par n="happ/y__adj"/></e>
<e><i>m</i><par n="m/ouse__n"/></e>
<e><i>l</i><par n="m/ouse__n"/></e>
Would produce something like
cat bat happ eas m l @s @ouse @ice @y @ier @iest
It could also be good to have the frequency.
Python2 is about to get out of support soon so it would be better if we could update the tests scripts to be python3 complaint.
I tried changing the shebang line to #!/usr/bin/env python3
and it worked without any issues.
Line 1 in 1fefd20
We will also need to update the README file.
It seem a bit of overkill to have two separate file formats for this stuff.
Given the following paradigms and entries:
<pardef n="liv/e__vblex">
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="inf"/></r></p></e>
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="imp"/></r></p></e>
<e> <p><l>ed</l> <r>e<s n="vblex"/><s n="pp"/></r></p></e>
<e w="1"> <p><l>ing</l> <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
<e w="3"> <p><l>ing</l> <r>e<s n="vblex"/><s n="ger"/></r></p></e>
<e w="2"> <p><l>ing</l> <r>e<s n="vblex"/><s n="subs"/></r></p></e>
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="pres"/></r></p></e>
<e> <p><l>es</l> <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
<e> <p><l>ed</l> <r>e<s n="vblex"/><s n="past"/></r></p></e>
</pardef>
<pardef n="house__n">
<e> <p><l></l> <r><s n="n"/><s n="sg"/></r></p></e>
<e r="RL"><p><l>'s</l> <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
<e> <p><l>s</l> <r><s n="n"/><s n="pl"/></r></p></e>
<e r="RL"><p><l>s'</l> <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
</pardef>
<e lm="house" w="1"> <i>house</i><par n="house__n"/></e>
<e lm="house" w="2"> <i>hous</i><par n="liv/e__vblex"/></e>
lt-proc seems to ignore the weights for the entries:
$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:0.000000>/house<vblex><inf><W:0.000000>/house<vblex><pres><W:0.000000>/house<vblex><imp><W:0.000000>$
The expected result would be:
$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:1.000000>/house<vblex><inf><W:2.000000>/house<vblex><pres><W:2.000000>/house<vblex><imp><W:2.000000>$
However, the weights work fine when they are used inside a paradigm:
$ echo "housing" | lt-proc -wW eng-cat.automorf.bin
^housing/housing<n><sg><W:0.000000>/house<vblex><pprs><W:1.000000>/house<vblex><subs><W:2.000000>/house<vblex><ger><W:3.000000>$
lttoolbox.pc.in
has @LTTOOLBOX_LIBS@
and @LTTOOLBOX_CFLAGS@
, but it shouldn't have those. Need to remove them and check if any dependents break from this.
Same goes for https://github.com/apertium/apertium/blob/master/apertium.pc.in
$ cat cmpnum.dix
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖšŠčČðđÐýÝñÑüÜíÍ-</alphabet>
<sdefs>
<sdef n="n" c="Noun"/>
<sdef n="acr" c="Acronym"/>
<sdef n="compound-only-L" c="May only be the left-side of a compound"/>
<sdef n="compound-R" c="May be the right-side of a compound, or a full word"/>
<sdef n="guio" c="Dash"/>
</sdefs>
<pardefs>
<pardef n="blah">
<e> <p><l>blah</l> <r>blah</r></p></e>
</pardef>
</pardefs>
<section id="main" type="standard">
<e><i>9</i><p><l>-</l> <r><s n="n"/><s n="compound-only-L"/></r></p></e>
<e><i>x</i><p><l>-</l> <r><s n="n"/><s n="compound-only-L"/></r></p></e>
<e><i>y</i><p><l></l> <r><s n="acr"/><s n="compound-R"/></r></p></e>
</section>
<section id="main" type="inconditional">
<e><i>-</i><p><l></l> <r><s n="guio"/></r></p></e>
</section>
</dictionary>
$ lt-comp lr cmpnum.dix cmpnum.bin
main@inconditional 3 2
main@standard 6 7
$ echo x-y | lt-proc -we cmpnum.bin
^x-y/x<n>+y<acr>$
$ echo 9-y | lt-proc -we cmpnum.bin
9^-/-<guio>$^y/y<acr>$
$ lt-print cmpnum.bin
0 1 - - 0.000000
1 2 ε <guio> 0.000000
2 0.000000
--
0 1 9 9 0.000000
0 1 x x 0.000000
0 2 y y 0.000000
1 3 - <n> 0.000000
2 4 ε <acr> 0.000000
3 5 ε <compound-only-L> 0.000000
4 5 ε <compound-R> 0.000000
5 0.000000
The 9 and the x are represented the same in both the dix and the compiled bin as shown by lt-print above; however only the x works as a compound-left.
A tool should be included in lttoolbox which calculates a BPE vocabulary as defined in this paper: https://arxiv.org/pdf/1508.07909.pdf
The idea is to use BPE to weight our morphological transducers.
Perhaps lt-expand should escape e.g. :
as part of form/lemma (DJ:en:DJ<n><m><sg><def>
looks like it has three fields). Similarly other reserved chars
Probably, incorrect output for:
*+* => ^*+*/*+*$
* => ^*/*$
$ => ^$/$<dollar>$
/ => ^///<sent>$
Tested on apertium-tat.
Post-generation should be able to just run on everything LRLM and only apply the changes where it matches (as if it were a version of sed that respects deformatting).
Say for all words in your dictionary, you want to apply the rule …inh t…
→ …is…
. It's just noisy to have to add a <a/>
(or explicit ~
in hfst/lexc) to the RL form-side of every place in your dictionary where that happens, and it's especially noisy if the parts of the form inh
are generated by different pardefs.
If postgen didn't have to have a wake-up-mark, but stayed awake constantly, you could just put <l>inh<b/>t<l> <r>is</r>
in post.dix and not have any changes to the generator at all.
This might have to be a new option (lt-proc -P, --post-generation-everywhere
or something).
(via https://sourceforge.net/p/apertium/mailman/message/36600451/ )
If the dictionary has
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<alphabet/>
<sdefs>
<sdef n="n"/>
<sdef n="m"/>
<sdef n="pl"/>
<sdef n="def"/>
</sdefs>
<section id="main" type="standard">
<e><p><l>kakene</l><r>kake<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
<e><p><l>pc-ane</l><r>pc<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
<e><p><l>PC-ane</l><r>PC<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
</section>
</dictionary>
then we get
$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin
kakene kakene
I would like it to just fall back to "normal" generation for words it can't find exact case for, ie.
$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin
kakene KAKENE kakene
while still retaining the -C functionality for words it can find exact matches for
$ echo '^PC<n><m><pl><def>$ ^pc<n><m><pl><def>$' | lt-proc -C nob.autogen.bin
PC-ane pc-ane
For the breton compiled dictionary bre.automorf.bin
$ lt-print bre.automorf.bin > bre.att
$ lt-comp lr bre.att bre.bin
Segmentation fault (core dumped)
This is caused by the way Attcompiler deduces the type of the edge "https://github.com/apertium/lttoolbox/blob/master/lttoolbox/att_compiler.cc#L381"
I tried setting the type of all edges to word such that they are part of the main section.
The lt-comp worked but the lt-proc is now entering an infinite loop on initializing the root state (Finding Epsilon closure).
I am trying to add weights to the morphological analyser.
So while I was checking last year's project (http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed that the output of the analyser isn't correct (according to my understanding).
The wiki suggests that to do so I will need to:
$ cat test.att
0 1 c c 4.567895
1 2 a a 0.989532
2 3 t t 2.796193
3 4 @0@ + -3.824564
4 5 @0@ n 1.824564
5 0.525487
4 5 @0@ v 2.845989
$ lt-comp lr test.att test.bin
main@standard 6 6
$ lt-print test.bin
0 1 c c 4.567895
1 2 a a 0.989532
2 3 t t 2.796193
3 4 ε + -3.824564
4 5 ε n 1.824564
4 5 ε v 2.845989
5 0.525487
However, the output of the transducer is a bit strange:
$ echo "cats" | lt-proc test.bin
^cat/cat+n/cat+v$s
Shouldn't the $ sign mark the end of the analysis. Why is there an s following the $ sign?
@Vaydheesh (maybe @unhammer can help?), commit 0fd248f fails when building in parallel (make -j4) with error:
Making all in python
make[1]: Entering directory '/misc/lttoolbox/python'
/usr/bin/python3 setup.py build
make[1]: *** No rule to make target 'lttoolbox.py', needed by 'all'. Stop.
Works fine when re-run or in serial (make -j1), but need it to work in parallel from a pristine clone.
<e lm="ta ille opp"> <i>t</i><par n="t/a__vblex_adj"/><p><l><b/>ille<b/>opp</l><r><g><b/>ille<b/>opp</g></r></p></e>
is very noisy and hard to read, and it's easy to miss a bit from the <l>
or <r>
when creating new entries by copy-pasting old ones. It'd be nice to have a syntax sugar so we could instead write
<e lm="ta ille opp"> <i>t</i><par n="t/a__vblex_adj"/><ig><b/>ille<b/>opp</ig></e>
and have it be equivalent to the first example.
It would be useful to try and run a roundtrip test for lt-print
and lt-comp
over available dictionary, to make sure that nothing segfaults or causes infinite loops. See e.g. #68
We currently have a problem in transliteration mode in that sometimes characters are dropped.
I would like to be able to convert non-alphabetic apostrophes (U+2019
and U+0027
) to the alphabetic apostrophe (U+02BC
) cleanly in the stream without having to rely on sed
. At the moment I am using:
$ cat wiki.txt | sed "s/\([^ ]\)['’]\([^ ]\)/\1ʼ\2/g" | apertium -d apertium-grn grn-morph
If I try and do it in lttoolbox, e.g. using the following transducer:
$ lt-print apostrophe.bin
0 1 a a
0 1 á á
0 1 ã ã
0 1 b b
0 1 c c
0 1 d d
0 1 e e
0 1 é é
0 1 ê ê
0 1 ë ë
0 1 ẽ ẽ
0 1 f f
0 1 g g
0 1 h h
0 1 i i
0 1 í í
0 1 ï ï
0 1 ĩ ĩ
0 1 j j
0 1 k k
0 1 l l
0 1 m m
0 1 n n
0 1 ñ ñ
0 1 o o
0 1 ó ó
0 1 ô ô
0 1 õ õ
0 1 p p
0 1 q q
0 1 r r
0 1 s s
0 1 t t
0 1 u u
0 1 ú ú
0 1 ü ü
0 1 ũ ũ
0 1 v v
0 1 x x
0 1 y y
0 1 ý ý
0 1 ỹ ỹ
0 1 z z
1 2 ' ʼ
1 2 ’ ʼ
2 3 a a
2 3 á á
2 3 ã ã
2 3 b b
2 3 c c
2 3 d d
2 3 e e
2 3 é é
2 3 ê ê
2 3 ë ë
2 3 ẽ ẽ
2 3 f f
2 3 g g
2 3 h h
2 3 i i
2 3 í í
2 3 ï ï
2 3 ĩ ĩ
2 3 j j
2 3 k k
2 3 l l
2 3 m m
2 3 n n
2 3 ñ ñ
2 3 o o
2 3 ó ó
2 3 ô ô
2 3 õ õ
2 3 p p
2 3 q q
2 3 r r
2 3 s s
2 3 t t
2 3 u u
2 3 ú ú
2 3 ü ü
2 3 ũ ũ
2 3 v v
2 3 x x
2 3 y y
2 3 ý ý
2 3 ỹ ỹ
2 3 z z
3
I get:
$ echo "ka'aguy" | lt-proc -t apostrophe.bin
a'gy
The expected output is:
$ echo "ka'aguy ka'aguy" | lt-proc -t apostrophe.bin
kaʼaguy kaʼaguy
Hello!
There are so many non-whitespace symbols that are not recognized by Apertium's tagger and not marked in any way. For example, apertium-tat does not recognize the following symbols:
_ @ % ~ |
and thousands others.
Is it possible to use some special tag (^_/_<unknown> or <sym>$) for such cases?
Without tagging it is difficult to process Apertium's output. Streamparser also leaves such cases in "blank" variable or skips them.
The current problem I'm having is that Arabic commas, semicolons, question marks, etc. (،, ؛ ,؟ — all in the U+0600 block) are not placed in the punctuation level of an lttoolbox transducer when converting from HFST transducers via att format.
Probably due to use of iswpunct()
in this function:
lttoolbox/lttoolbox/att_compiler.cc
Lines 375 to 377 in 5e69502
Resolving #81 would probably resolve this issue as well.
There is a massive segfault / memory leak somewhere in the weight code. After upgrading to it, translations started randomly overloading, with some part of lttoolbox eating the APy machine's whole 32 + 64 GB RAM in seconds and then dying. Haven't taken the time to isolate it yet - for now, I've rolled back the install.
[29857716.421446] lt-proc[30665]: segfault at 824100 ip 00007f728d000bbc sp 00007fffe8c3f6b0 error 4 in liblttoolbox3-3.4.so.1.0.0[7f728cfa8000+6c000]
(ping @Techievena)
The compiler should compile it out by default but there should be an option to retain them to lt-comp. This can be used to make e.g. segmenters. It is needed because at the moment many languages get around .dix
restrictions by just duplicating entries, so we can't just add morpheme boundary symbols to the beginning of <l>
sides (although that works nicely for paradigms.
<e lm="parastin"><p><l>parast</l><r>parastin</r></p><par n="kir/__vblex_tv"/></e>
<e lm="parastin"><p><l>diparast</l><r>parastin</r></p><par n="dikir/__vblex_tv"/></e>
<e lm="parastin"><p><l>diparêz</l><r>parastin</r></p><par n="dik/e__vblex_tv"/></e>
<e lm="parastin"><p><l>biparêz</l><r>parastin</r></p><par n="bik/e__vblex_tv"/></e>
<e lm="parastin"><p><l>neparast</l><r>parastin</r></p><par n="nekir/__vblex_tv"/></e>
<e lm="parastin"><p><l>naparêz</l><r>parastin</r></p><par n="nak/e__vblex_tv"/></e>
<e lm="parastin"><p><l>neparêz</l><r>parastin</r></p><par n="nek/e__vblex_tv"/></e>
this would give segmentations like biparêz>in
for bi>parêz>in
. So it would be nice to be able to give explicit morpheme boundaries. Option (1) is that there are a few single letters left, c d f h k m n o q t u v w x y z
, here is what the code looks like with each of them:
examples.txt
Ideally we could come up with something with a good mnemonic too.
<m/>
muga "border" (Basque)<f/>
frontera "border" (Catalan)<f/>
finis "border" (Latin)<h/>
hranice "border" (Czech), határ "border" (Hungarian)Another option (2) would be to use a XML entity or (3) a simple Unicode symbol, like ¦
or ‖
.
Currently, lt-comp can only compile at&t files containing a single FST.
It would be better if it can also compile multiple disjunct FSTs encoded in the same at&t file.
Example:
$ cat transducer.att
--
0 1 a b 0.000000
1 1.000000
--
0 1 b c 0.000000
1 1.000000
--
The current behaviour is:
$ lt-comp lr transducer.att transducer.bin
Error: invalid format 'transducer.att'.
It would be cool to be able to define, on a per-transducer basis language specific basis certain characters which can appear anywhere in the stream but that don't effect the analysis.
This could possibly be used for soft hyphen,[1] for tatweel[2] and various kinds of zero-width joiners/non-joiners and floating punctuation symbols, e.g. Armenian interrogative signs.[3]
There are many open questions regarding what exact form this should have and what kind of behaviours we should support.
"lt-comp --var-left" is used in several Makefile, but it does not appear if I type
lt-comp --help
fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ make
apertium-validate-dictionary apertium-mlt-heb.mlt-heb.dix
lt-comp rl apertium-mlt-heb.mlt-heb.dix heb-mlt.autobil.bin
lt-trim .deps/heb.automorf.bin heb-mlt.autobil.bin heb-mlt.automorf.bin
Error: empty set of final states
Makefile:764: recipe for target 'heb-mlt.automorf.bin' failed
make: *** [heb-mlt.automorf.bin] Error 1
fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ ls -lsrth heb-mlt.autobil.bin
8,0K -rw-r--r-- 1 fran fran 6,6K oct 27 03:34 heb-mlt.autobil.bin
fran@matxine:~/source/apertium/staging/apertium-mlt-heb$ lt-print heb-mlt.autobil.bin
Violació de segment
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.