accek / pantera-tagger Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 3.0 32.48 MB

PANTERA Morphosyntactic Tagger for Polish

License: GNU General Public License v3.0

Shell 41.00% Java 0.89% Python 1.25% C++ 56.79% C 0.08%

pantera-tagger's People

Contributors

Stargazers

Watchers

Forkers

dlozinski rkorszun magsoch

pantera-tagger's Issues

Support reading plain text.

Support reading plain text.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 11:09

Core dumped on simple training

Trial of training  pantera on cesAna file (in attachment):

pantera  --create-engine moj.btengine --training-data input.txt.xml

ends with:

OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
Scanning for training files ... 1 found
[6308] LEXER input.txt.xml
Sending data to all worker processes ...
Training unigram tagger for phase 0...
Training unigram tagger for phase 1...
Preparing initial tagging for phase 0 ...  done.
[jp-VirtualBox:06308] *** Process received signal ***
[jp-VirtualBox:06308] Signal: Naruszenie ochrony pamięci (11)
[jp-VirtualBox:06308] Signal code: Address not mapped (1)
[jp-VirtualBox:06308] Failing at address: (nil)
[jp-VirtualBox:06308] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) 
[0x7f90e2fa0cb0]
[jp-VirtualBox:06308] *** End of error message ***
Naruszenie ochrony pamięci (core dumped)

Pantera version on Ubuntu 12.04.1 LTS:

OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
pantera-tagger 0.9

Original issue reported on code.google.com by [email protected] on 9 Oct 2012 at 11:20

Attachments:

input.txt.xml

empty disamb section


Hard to reproduce, when I ran pantera again on the same two files the problem 
disappeared.

Among lakhs of files annotated for the National Corpus, two did not get a 
proper disamb. The <fs> element was empty (had attributes but no content).

Original issue reported on code.google.com by [email protected] on 5 Oct 2011 at 11:14

Poor sentence splitting

Sentence splitting does not perform very well. It can be easily improved by 
using more sophisticated rules for Segment program. Details can be found here: 
http://zil.ipipan.waw.pl/Segment

Original issue reported on code.google.com by [email protected] on 17 May 2012 at 7:06

XML entity support

The tagger behaves incorrectly if the text to be tagged contains XML entities 
such as &apos; or &amp;. For example if the orth Kennedy'ego is correctly 
encodeed in XML as:

<orth>Kennedy&apos;ego</orth>

it will end up being literally treated as Kennedy(ampersand)amp(semicolon), 
which is then split (presumably by Morfeusz) into separate tokens (Kennedy & 
apos ; ego), for an even worse effect.

I believe the tagger should support at least the basic XML entities that 
technically should always be escaped.

Original issue reported on code.google.com by [email protected] on 11 Oct 2010 at 10:33

xces and xces-disamb implicitly assume different lemma selection strategy

What steps will reproduce the problem?

echo "Mają kręgi." > seg.txt
~/pantera/bin/pantera -t nkjp --engine 
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces-disamb
mv seg.txt.disamb seg.xces
~/pantera/bin/pantera -t nkjp --engine 
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces
mv seg.txt.disamb seg.xces-sh

The standard XCES output (-o xces-disamb) implicitly assumes that only one 
(lemma,tag) pair is selected. The selection seems arbitrary.

The “sh” XCES dialect preserves both lemmata when having the same tag.

Is this really intended?

Original issue reported on code.google.com by [email protected] on 20 Feb 2012 at 2:52

Add option for specifying segmentation disambiguation rules

--segm-disamb-rules

Original issue reported on code.google.com by [email protected] on 2 Oct 2010 at 10:50

IPI-PAN lexer will sometimes merge input sentences

What steps will reproduce the problem?

1. Use Pantera on a large enough XCES (ipipan) corpus
2. Some sentences have been merged in the output, with the total token count 
remaining the same.

This seems to be an issue with the regular expression in ipipanlexer.h, notably 
the <chunk type=... one not matching the closing bracket of the tag. The 
attached patch fixes the problem for me.

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 10:34

Attachments:

ipipanlexer-fix.diff

Segfaults when tagging

I tried to run pantera on a directory from input.tar.bz2 attachment:

pantera -t nkjp --no-guesser --engine /path/to/the/engine 
./Posiedzenia_Plenarne-012/

It fails on one of the subdirectories (Posiedzenia_Plenarne-012-02). However 
doesn't fail when invoking for that subdir alone.

The output is in attachment.

I use the newest pantera version downloaded from SVN with morfeusz 0.7 and 
"ultimarum-tertia-np0-6.btengine" engine file.

Original issue reported on code.google.com by [email protected] on 14 Aug 2011 at 9:37

Attachments:

Ipipan tagset does not work because of hardcoded Morfeusz workarounds

I successfully trained Pantera on a corpus tagged in the IPI-PAN tagset, 
passing the relevant --tagset option. Running the resulting engine, however, 
resulted in an error complaining about a "numcol" grammatical class not being 
present in the tagset. "numcol" does not appear anywhere in the training corpus 
I used, nor does it exist in the IPI-PAN tagset as defined in Pantera (and in 
other tools using the IPI-PAN tagset such as Morfeusz or TaKIPI).

It seems that the hardcoded fixes for Mofeusz output in 
src/nlpcommon/morfeusz.h are tailored for the "new" SGJP Morfeusz, and are not 
appropriate for the old version. I have patched the issue locally by wrapping 
the offending block in a #ifdef, and I'm attaching the diff.

Original issue reported on code.google.com by [email protected] on 17 Jan 2011 at 9:45

Attachments:

fix-ipipan-morfeusz.diff

IPIPAN lexer assumes no tags between <tok> and <orth>

if there is something in between, tokens are not recognized, and incidentally 
the lexer starts to eat a lot of memory (because current_lex is never reset).

Original issue reported on code.google.com by [email protected] on 7 Sep 2010 at 6:16

libcorpus as dependency

Pantera includes whole TaKIPI distribution as ‘third-party’ tool. This is 
weird and cumbersome — e.g. when having newer TaKIPI already installed. In 
fact, only libcorpus is used. It'd be far more natural to add libcorpus as 
Pantera's dependency.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 1:51

Missing segmentation disambiguation rule action for 'gdzieś'

... we wanted to make it separate at the beginning of a question.

Original issue reported on code.google.com by [email protected] on 2 Oct 2010 at 10:48

Empty values for optional categories generated from morfeusz's _

(Not sure if this is a bug or not)

Example: je

         <fs type="lex" xml:id="morph_p-1.3-seg_1-lex">
          <f name="base">
           <string>on</string>
          </f>
          <f name="ctag">
           <symbol value="ppron3"/>
          </f>
          <f name="msd">
           <vAlt>
            <symbol value="pl:acc:m2:ter:npraep" xml:id="morph_p-1.3-seg_1-msd">
            <symbol value="pl:acc:m3:ter:npraep" xml:id="morph_p-1.3-seg_2-msd">
            <symbol value="pl:acc:f:ter:npraep" xml:id="morph_p-1.3-seg_3-msd">
            <symbol value="sg:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_4-msd">
            <symbol value="pl:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_5-msd">
            <symbol value="sg:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_6-msd">
            <symbol value="pl:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_7-msd">
            <symbol value="pl:acc:p2:ter:npraep" xml:id="morph_p-1.3-seg_8-msd">
            <symbol value="pl:acc:p3:ter:npraep" xml:id="morph_p-1.3-seg_9-msd">
            <symbol value="pl:acc:m2:ter:akc:npraep" xml:id="morph_p-1.3-seg_10-msd">
            <symbol value="pl:acc:m3:ter:akc:npraep" xml:id="morph_p-1.3-seg_11-msd">
            <symbol value="pl:acc:f:ter:akc:npraep" xml:id="morph_p-1.3-seg_12-msd">
            <symbol value="sg:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_13-msd">
            <symbol value="pl:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_14-msd">
            <symbol value="sg:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_15-msd">
            <symbol value="pl:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_16-msd">
            <symbol value="pl:acc:p2:ter:akc:npraep" xml:id="morph_p-1.3-seg_17-msd">
            <symbol value="pl:acc:p3:ter:akc:npraep" xml:id="morph_p-1.3-seg_18-msd">
            <symbol value="pl:acc:m2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_19-msd">
            <symbol value="pl:acc:m3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_20-msd">
            <symbol value="pl:acc:f:ter:nakc:npraep" xml:id="morph_p-1.3-seg_21-msd">
            <symbol value="sg:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_22-msd">
            <symbol value="pl:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_23-msd">
            <symbol value="sg:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_24-msd">
            <symbol value="pl:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_25-msd">
            <symbol value="pl:acc:p2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_26-msd">
            <symbol value="pl:acc:p3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_27-msd">
           </vAlt>
          </f>
         </fs>

Original issue reported on code.google.com by [email protected] on 16 Aug 2010 at 9:11

Error messages when contradicting command line options passed

- --create-engine and --engine.
- --create-engine with no training files
- also when no --create-engine, but no input files found

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:13

string-range offsets >= 1000 printed with separator

What steps will reproduce the problem?
1. Create a text with a paragraph (<p>) longer than 1000 chars
2. Tag it

What is the expected output? What do you see instead?
Expected: <seg corresp="text_structure.xml#string-range(p-363,1007,9)" ...>
Got:  <seg corresp="text_structure.xml#string-range(p-363,1,007,9)" ...>

Original issue reported on code.google.com by [email protected] on 16 May 2011 at 11:18

Make NKJP writers support writing data which came not necessarily from NKJP readers

This is the only way to allow producing NKJP outputs from plain text let's say.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 12:28

Verify NKJP output format

Recently the specs for the output format changed.
This needs some adjustments.

Also: formally validate outputs.

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:09

Removing Oxygen processing instructions

ann_segmentation.xml and ann_morphosyntax.xml contain:
<?oxygen RNGSchema="NKJP_morphosyntax.rng" type="xml"?>
<?oxygen RNGSchema="NKJP_segmentation.rng" type="xml"?>

It's probably better to remove them.

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 4:20

unneeded nkjp:nps after a <lb/>

What steps will reproduce the problem?
1. Tag a text_structure that has an <lb/>, e.g.

nie wywieraj nacisku na innych, żeby wznosili toasty<lb/>nie zmuszaj nikogo do 
picia alkoholu

What I see in segmentation is:
      <!-- toasty -->
      <seg corresp="text_structure.xml#string-range(p-151,46,6)" xml:id="segm_p-151.2471-seg"/>
      <!-- nie -->
      <seg corresp="text_structure.xml#string-range(p-151,52,3)" nkjp:nps="true" xml:id="segm_p-151.2472-seg"/>
      <!-- zmuszaj -->
      <seg corresp="text_structure.xml#string-range(p-151,56,7)" xml:id="segm_p-151.2473-seg"/>

while it would make more sense if there was no nkjp:nps="true" for the second 
segment, so that the 'projected' text is "toasty nie zmuszaj", i.e. when we 
lose information about newline (that is inevitable here), let's change it to a 
space, not to an empty string.

Original issue reported on code.google.com by [email protected] on 20 May 2011 at 11:14

adjust MORFEUSZ_TAGSET


The wiki says:

If you use different version of Morfeusz, you may need to adjust some paths in 
third_party/morfeusz/Makefile.am. If the version of Morfeusz used emits the new 
tagset, adjust MORFEUSZ_TAGSET in src/nlpcommon/morfeusz.h to nkjp.

while, as far as I understand, now NKJP tagset is the default, and you need to 
uncomment #define MORFEUSZ_IPI to use the old one.

Original issue reported on code.google.com by [email protected] on 23 Aug 2011 at 10:32

Add option to skip tagging

this makes sense.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 12:27

will not overwrite just-created stats.h at 'make install'

test -z "/home/eliasz/workspace/pantera/include/nlpcommon" || /bin/mkdir -p 
"/home/eliasz/workspace/pantera/include/nlpcommon"
 /usr/bin/install -c -m 644 exception.h ipipanlexer.h lexer.h progress.h tag.h util.h writer.h tagset.h category.h pos.h scorer.h cascorer.h finderrors.h datwriter.h datlexer.h stats.h poliqarp-weights.h category-weights.h stats.h lexemesfilter.h plaintextwriter.h plaintextlexer.h morfeusz.h nkjptextlexer.h nkjplexerdata.h nkjpwriter.h libsegmentsentencer.h tagset_convert.h polish_tagset_convert.h _pstream.h '/home/eliasz/workspace/pantera/include/nlpcommon'
/usr/bin/install: will not overwrite just-created 
`/home/eliasz/workspace/pantera/include/nlpcommon/stats.h' with `stats.h'
make[4]: *** [install-libnlpcommonincludeHEADERS] Error 1

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 2:07

Do not crash if tagging fails on some inputs

... progress as far as possible.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 11:46

Freeze instead of error message when file cannot be created

What steps will reproduce the problem?
1. For instance, chmod -R u-w DIRECTORY
2. run Pantera on the DIRECTORY


What is the expected output? What do you see instead?
I'd expect an error message. Instead, the program just freezes (still using 
CPU) until killed with a signal or something.

What version of the product are you using? On what operating system?
RHEL5, but expect the same behaviour anywhere (e.g. when disk is full, if the 
OS doesn't support access rights)


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 22 Apr 2011 at 10:02

Support reading NKJP ann_morphosyntax.xml

This may be very useful for training the tagger.

Original issue reported on code.google.com by [email protected] on 2 Sep 2010 at 4:28

Double 'ign' when word not known to odgadywacz

Double 'ign' when word not known to odgadywacz

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:06

Tagset translation needs more work

(in nlpcommon/polish_tagset_convert.h)

- incompatible catgories should not be remapped to ign
- investigate if there are any incompatible grammatical category values

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:08

Missing segmentation disambiguation

Missing segmentation disambiguation

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:06

Remove google-protobuf format support

... it's not even maintained and noone uses it.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 11:56

Incorrect treatment of comments in a <p> (TEI NKJP)

Comment inside a paragraph:

<p xml:id="p-13"><!--<4>--></p>

There are definitely no segments here, while Pantera for the reason I don't 
know identifies (and tags) the following:

<!-- -\- -->
<seg corresp="text_structure.xml#string-range(p-13,0,2)" 
xml:id="segm_p-13.697-seg"/>
<!-- > -->
<seg corresp="text_structure.xml#string-range(p-13,2,1)" nkjp:nps="true" 
xml:id="segm_p-13.698-seg"/>


Haven't checked what happens when the comment is not the only content of the 
paragraph. However, it should work in this case as well, as if the commented 
text was not there.

Original issue reported on code.google.com by [email protected] on 15 Jul 2011 at 10:13

Missing m4 files

Attached patch includes the missing files, for convenience. Taken from:
https://svn.apache.org/repos/asf/incubator/mesos/trunk/m4/acx_pthread.m4
and
http://stuff.mit.edu/afs/athena/software/elmer/current/src/hutiter/acx_mpi.m4

Original issue reported on code.google.com by joregan on 19 Mar 2012 at 2:06

Attachments:

missing-m4.patch

Required locale warning

What I get is:

Loading engine ...
Sending data to all worker processes ...
Scanning for input files ... 1 found
[5252] LEXER /tmp/text_structure.xml
[5252] SENTENCER /tmp/text_structure.xml
Warning: system does not support required locale 'pl_PL'. We will continue with 
the 'en_US.UTF-8' locale, but things may not work as expected.
[5252] MORPH /tmp/text_structure.xml
[5252] SEGM-DISAMB /tmp/text_structure.xml
[5252] PRE-TAGGER /tmp/text_structure.xml
[5252] TAGGER /tmp/text_structure.xml
[5252] POST-TAGGER /tmp/text_structure.xml
[5252] WRITER /tmp/text_structure.xml
[5252] DONE /tmp/text_structure.xml
Processing ...  done.
All done.


while I do have the appropriate locale installed

$ locale -a | grep pl_PL
pl_PL.utf8

and the locale test program described on the wiki page compiles and runs.

The system is Ubuntu 10.04 LTS.

Original issue reported on code.google.com by [email protected] on 24 Aug 2011 at 10:15

Bad XML produced - unclosed <p>'s

What steps will reproduce the problem?
1. Running Pantera (standard parameters, incl. model) on the provided 
text_structures.

What is the expected output? What do you see instead?

Some paragraphs are not closed ("</p>" line is missing).


What version of the product are you using? On what operating system?
0.9, from package 0.9-r154-1, Ubuntu 10.04

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 12 Aug 2013 at 12:49

Attachments:

lostPs.tgz

accek / pantera-tagger Goto Github PK

pantera-tagger's People

Contributors

Stargazers

Watchers

Forkers

pantera-tagger's Issues

Recommend Projects

Recommend Topics

Recommend Org