Code Monkey home page Code Monkey logo

pantera-tagger's People

Contributors

accek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pantera-tagger's Issues

Core dumped on simple training

Trial of training  pantera on cesAna file (in attachment):

pantera  --create-engine moj.btengine --training-data input.txt.xml

ends with:

OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
Scanning for training files ... 1 found
[6308] LEXER input.txt.xml
Sending data to all worker processes ...
Training unigram tagger for phase 0...
Training unigram tagger for phase 1...
Preparing initial tagging for phase 0 ...  done.
[jp-VirtualBox:06308] *** Process received signal ***
[jp-VirtualBox:06308] Signal: Naruszenie ochrony pamięci (11)
[jp-VirtualBox:06308] Signal code: Address not mapped (1)
[jp-VirtualBox:06308] Failing at address: (nil)
[jp-VirtualBox:06308] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) 
[0x7f90e2fa0cb0]
[jp-VirtualBox:06308] *** End of error message ***
Naruszenie ochrony pamięci (core dumped)

Pantera version on Ubuntu 12.04.1 LTS:

OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
pantera-tagger 0.9






Original issue reported on code.google.com by [email protected] on 9 Oct 2012 at 11:20

Attachments:

empty disamb section


Hard to reproduce, when I ran pantera again on the same two files the problem 
disappeared.

Among lakhs of files annotated for the National Corpus, two did not get a 
proper disamb. The <fs> element was empty (had attributes but no content).


Original issue reported on code.google.com by [email protected] on 5 Oct 2011 at 11:14

Poor sentence splitting

Sentence splitting does not perform very well. It can be easily improved by 
using more sophisticated rules for Segment program. Details can be found here: 
http://zil.ipipan.waw.pl/Segment

Original issue reported on code.google.com by [email protected] on 17 May 2012 at 7:06

XML entity support

The tagger behaves incorrectly if the text to be tagged contains XML entities 
such as &apos; or &amp;. For example if the orth Kennedy'ego is correctly 
encodeed in XML as:

<orth>Kennedy&apos;ego</orth>

it will end up being literally treated as Kennedy(ampersand)amp(semicolon), 
which is then split (presumably by Morfeusz) into separate tokens (Kennedy & 
apos ; ego), for an even worse effect.

I believe the tagger should support at least the basic XML entities that 
technically should always be escaped.


Original issue reported on code.google.com by [email protected] on 11 Oct 2010 at 10:33

xces and xces-disamb implicitly assume different lemma selection strategy

What steps will reproduce the problem?

echo "Mają kręgi." > seg.txt
~/pantera/bin/pantera -t nkjp --engine 
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces-disamb
mv seg.txt.disamb seg.xces
~/pantera/bin/pantera -t nkjp --engine 
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces
mv seg.txt.disamb seg.xces-sh

The standard XCES output (-o xces-disamb) implicitly assumes that only one 
(lemma,tag) pair is selected. The selection seems arbitrary.

The “sh” XCES dialect preserves both lemmata when having the same tag.

Is this really intended?

Original issue reported on code.google.com by [email protected] on 20 Feb 2012 at 2:52

IPI-PAN lexer will sometimes merge input sentences

What steps will reproduce the problem?

1. Use Pantera on a large enough XCES (ipipan) corpus
2. Some sentences have been merged in the output, with the total token count 
remaining the same.

This seems to be an issue with the regular expression in ipipanlexer.h, notably 
the <chunk type=... one not matching the closing bracket of the tag. The 
attached patch fixes the problem for me.

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 10:34

Attachments:

Segfaults when tagging

I tried to run pantera on a directory from input.tar.bz2 attachment:

pantera -t nkjp --no-guesser --engine /path/to/the/engine 
./Posiedzenia_Plenarne-012/

It fails on one of the subdirectories (Posiedzenia_Plenarne-012-02). However 
doesn't fail when invoking for that subdir alone.

The output is in attachment.

I use the newest pantera version downloaded from SVN with morfeusz 0.7 and 
"ultimarum-tertia-np0-6.btengine" engine file.






Original issue reported on code.google.com by [email protected] on 14 Aug 2011 at 9:37

Attachments:

Ipipan tagset does not work because of hardcoded Morfeusz workarounds

I successfully trained Pantera on a corpus tagged in the IPI-PAN tagset, 
passing the relevant --tagset option. Running the resulting engine, however, 
resulted in an error complaining about a "numcol" grammatical class not being 
present in the tagset. "numcol" does not appear anywhere in the training corpus 
I used, nor does it exist in the IPI-PAN tagset as defined in Pantera (and in 
other tools using the IPI-PAN tagset such as Morfeusz or TaKIPI).

It seems that the hardcoded fixes for Mofeusz output in 
src/nlpcommon/morfeusz.h are tailored for the "new" SGJP Morfeusz, and are not 
appropriate for the old version. I have patched the issue locally by wrapping 
the offending block in a #ifdef, and I'm attaching the diff.

Original issue reported on code.google.com by [email protected] on 17 Jan 2011 at 9:45

Attachments:

libcorpus as dependency

Pantera includes whole TaKIPI distribution as ‘third-party’ tool. This is 
weird and cumbersome — e.g. when having newer TaKIPI already installed. In 
fact, only libcorpus is used. It'd be far more natural to add libcorpus as 
Pantera's dependency.

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 1:51

Empty values for optional categories generated from morfeusz's _

(Not sure if this is a bug or not)

Example: je

         <fs type="lex" xml:id="morph_p-1.3-seg_1-lex">
          <f name="base">
           <string>on</string>
          </f>
          <f name="ctag">
           <symbol value="ppron3"/>
          </f>
          <f name="msd">
           <vAlt>
            <symbol value="pl:acc:m2:ter:npraep" xml:id="morph_p-1.3-seg_1-msd">
            <symbol value="pl:acc:m3:ter:npraep" xml:id="morph_p-1.3-seg_2-msd">
            <symbol value="pl:acc:f:ter:npraep" xml:id="morph_p-1.3-seg_3-msd">
            <symbol value="sg:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_4-msd">
            <symbol value="pl:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_5-msd">
            <symbol value="sg:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_6-msd">
            <symbol value="pl:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_7-msd">
            <symbol value="pl:acc:p2:ter:npraep" xml:id="morph_p-1.3-seg_8-msd">
            <symbol value="pl:acc:p3:ter:npraep" xml:id="morph_p-1.3-seg_9-msd">
            <symbol value="pl:acc:m2:ter:akc:npraep" xml:id="morph_p-1.3-seg_10-msd">
            <symbol value="pl:acc:m3:ter:akc:npraep" xml:id="morph_p-1.3-seg_11-msd">
            <symbol value="pl:acc:f:ter:akc:npraep" xml:id="morph_p-1.3-seg_12-msd">
            <symbol value="sg:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_13-msd">
            <symbol value="pl:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_14-msd">
            <symbol value="sg:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_15-msd">
            <symbol value="pl:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_16-msd">
            <symbol value="pl:acc:p2:ter:akc:npraep" xml:id="morph_p-1.3-seg_17-msd">
            <symbol value="pl:acc:p3:ter:akc:npraep" xml:id="morph_p-1.3-seg_18-msd">
            <symbol value="pl:acc:m2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_19-msd">
            <symbol value="pl:acc:m3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_20-msd">
            <symbol value="pl:acc:f:ter:nakc:npraep" xml:id="morph_p-1.3-seg_21-msd">
            <symbol value="sg:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_22-msd">
            <symbol value="pl:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_23-msd">
            <symbol value="sg:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_24-msd">
            <symbol value="pl:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_25-msd">
            <symbol value="pl:acc:p2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_26-msd">
            <symbol value="pl:acc:p3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_27-msd">
           </vAlt>
          </f>
         </fs>

Original issue reported on code.google.com by [email protected] on 16 Aug 2010 at 9:11

string-range offsets >= 1000 printed with separator

What steps will reproduce the problem?
1. Create a text with a paragraph (<p>) longer than 1000 chars
2. Tag it

What is the expected output? What do you see instead?
Expected: <seg corresp="text_structure.xml#string-range(p-363,1007,9)" ...>
Got:  <seg corresp="text_structure.xml#string-range(p-363,1,007,9)" ...>


Original issue reported on code.google.com by [email protected] on 16 May 2011 at 11:18

Verify NKJP output format

Recently the specs for the output format changed.
This needs some adjustments.

Also: formally validate outputs.

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:09

Removing Oxygen processing instructions

ann_segmentation.xml and ann_morphosyntax.xml contain:
<?oxygen RNGSchema="NKJP_morphosyntax.rng" type="xml"?>
<?oxygen RNGSchema="NKJP_segmentation.rng" type="xml"?>

It's probably better to remove them.

Original issue reported on code.google.com by [email protected] on 4 Nov 2011 at 4:20

unneeded nkjp:nps after a <lb/>

What steps will reproduce the problem?
1. Tag a text_structure that has an <lb/>, e.g.

nie wywieraj nacisku na innych, żeby wznosili toasty<lb/>nie zmuszaj nikogo do 
picia alkoholu

What I see in segmentation is:
      <!-- toasty -->
      <seg corresp="text_structure.xml#string-range(p-151,46,6)" xml:id="segm_p-151.2471-seg"/>
      <!-- nie -->
      <seg corresp="text_structure.xml#string-range(p-151,52,3)" nkjp:nps="true" xml:id="segm_p-151.2472-seg"/>
      <!-- zmuszaj -->
      <seg corresp="text_structure.xml#string-range(p-151,56,7)" xml:id="segm_p-151.2473-seg"/>

while it would make more sense if there was no nkjp:nps="true" for the second 
segment, so that the 'projected' text is "toasty nie zmuszaj", i.e. when we 
lose information about newline (that is inevitable here), let's change it to a 
space, not to an empty string.

Original issue reported on code.google.com by [email protected] on 20 May 2011 at 11:14

adjust MORFEUSZ_TAGSET


The wiki says:

If you use different version of Morfeusz, you may need to adjust some paths in 
third_party/morfeusz/Makefile.am. If the version of Morfeusz used emits the new 
tagset, adjust MORFEUSZ_TAGSET in src/nlpcommon/morfeusz.h to nkjp.

while, as far as I understand, now NKJP tagset is the default, and you need to 
uncomment #define MORFEUSZ_IPI to use the old one.

Original issue reported on code.google.com by [email protected] on 23 Aug 2011 at 10:32

will not overwrite just-created stats.h at 'make install'

test -z "/home/eliasz/workspace/pantera/include/nlpcommon" || /bin/mkdir -p 
"/home/eliasz/workspace/pantera/include/nlpcommon"
 /usr/bin/install -c -m 644 exception.h ipipanlexer.h lexer.h progress.h tag.h util.h writer.h tagset.h category.h pos.h scorer.h cascorer.h finderrors.h datwriter.h datlexer.h stats.h poliqarp-weights.h category-weights.h stats.h lexemesfilter.h plaintextwriter.h plaintextlexer.h morfeusz.h nkjptextlexer.h nkjplexerdata.h nkjpwriter.h libsegmentsentencer.h tagset_convert.h polish_tagset_convert.h _pstream.h '/home/eliasz/workspace/pantera/include/nlpcommon'
/usr/bin/install: will not overwrite just-created 
`/home/eliasz/workspace/pantera/include/nlpcommon/stats.h' with `stats.h'
make[4]: *** [install-libnlpcommonincludeHEADERS] Error 1

Original issue reported on code.google.com by [email protected] on 6 Sep 2010 at 2:07

Freeze instead of error message when file cannot be created

What steps will reproduce the problem?
1. For instance, chmod -R u-w DIRECTORY
2. run Pantera on the DIRECTORY


What is the expected output? What do you see instead?
I'd expect an error message. Instead, the program just freezes (still using 
CPU) until killed with a signal or something.

What version of the product are you using? On what operating system?
RHEL5, but expect the same behaviour anywhere (e.g. when disk is full, if the 
OS doesn't support access rights)


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 22 Apr 2011 at 10:02

Tagset translation needs more work

(in nlpcommon/polish_tagset_convert.h)

- incompatible catgories should not be remapped to ign
- investigate if there are any incompatible grammatical category values

Original issue reported on code.google.com by [email protected] on 11 Aug 2010 at 8:08

Incorrect treatment of comments in a <p> (TEI NKJP)

Comment inside a paragraph:

<p xml:id="p-13"><!--<4>--></p>

There are definitely no segments here, while Pantera for the reason I don't 
know identifies (and tags) the following:

<!-- -\- -->
<seg corresp="text_structure.xml#string-range(p-13,0,2)" 
xml:id="segm_p-13.697-seg"/>
<!-- > -->
<seg corresp="text_structure.xml#string-range(p-13,2,1)" nkjp:nps="true" 
xml:id="segm_p-13.698-seg"/>


Haven't checked what happens when the comment is not the only content of the 
paragraph. However, it should work in this case as well, as if the commented 
text was not there.

Original issue reported on code.google.com by [email protected] on 15 Jul 2011 at 10:13

Missing m4 files

Attached patch includes the missing files, for convenience. Taken from:
https://svn.apache.org/repos/asf/incubator/mesos/trunk/m4/acx_pthread.m4
and
http://stuff.mit.edu/afs/athena/software/elmer/current/src/hutiter/acx_mpi.m4

Original issue reported on code.google.com by joregan on 19 Mar 2012 at 2:06

Attachments:

Required locale warning

What I get is:

Loading engine ...
Sending data to all worker processes ...
Scanning for input files ... 1 found
[5252] LEXER /tmp/text_structure.xml
[5252] SENTENCER /tmp/text_structure.xml
Warning: system does not support required locale 'pl_PL'. We will continue with 
the 'en_US.UTF-8' locale, but things may not work as expected.
[5252] MORPH /tmp/text_structure.xml
[5252] SEGM-DISAMB /tmp/text_structure.xml
[5252] PRE-TAGGER /tmp/text_structure.xml
[5252] TAGGER /tmp/text_structure.xml
[5252] POST-TAGGER /tmp/text_structure.xml
[5252] WRITER /tmp/text_structure.xml
[5252] DONE /tmp/text_structure.xml
Processing ...  done.
All done.


while I do have the appropriate locale installed

$ locale -a | grep pl_PL
pl_PL.utf8

and the locale test program described on the wiki page compiles and runs.

The system is Ubuntu 10.04 LTS.

Original issue reported on code.google.com by [email protected] on 24 Aug 2011 at 10:15

Bad XML produced - unclosed <p>'s

What steps will reproduce the problem?
1. Running Pantera (standard parameters, incl. model) on the provided 
text_structures.

What is the expected output? What do you see instead?

Some paragraphs are not closed ("</p>" line is missing).


What version of the product are you using? On what operating system?
0.9, from package 0.9-r154-1, Ubuntu 10.04

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 12 Aug 2013 at 12:49

Attachments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.