accek / pantera-tagger Goto Github PK
View Code? Open in Web Editor NEWPANTERA Morphosyntactic Tagger for Polish
License: GNU General Public License v3.0
PANTERA Morphosyntactic Tagger for Polish
License: GNU General Public License v3.0
Support reading plain text.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 11:09
Trial of training pantera on cesAna file (in attachment):
pantera --create-engine moj.btengine --training-data input.txt.xml
ends with:
OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
Scanning for training files ... 1 found
[6308] LEXER input.txt.xml
Sending data to all worker processes ...
Training unigram tagger for phase 0...
Training unigram tagger for phase 1...
Preparing initial tagging for phase 0 ... done.
[jp-VirtualBox:06308] *** Process received signal ***
[jp-VirtualBox:06308] Signal: Naruszenie ochrony pamięci (11)
[jp-VirtualBox:06308] Signal code: Address not mapped (1)
[jp-VirtualBox:06308] Failing at address: (nil)
[jp-VirtualBox:06308] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)
[0x7f90e2fa0cb0]
[jp-VirtualBox:06308] *** End of error message ***
Naruszenie ochrony pamięci (core dumped)
Pantera version on Ubuntu 12.04.1 LTS:
OpenMP parallelism enabled (processors: 1, dynamic thread allocation: 0)
pantera-tagger 0.9
Original issue reported on code.google.com by [email protected]
on 9 Oct 2012 at 11:20
Attachments:
Hard to reproduce, when I ran pantera again on the same two files the problem
disappeared.
Among lakhs of files annotated for the National Corpus, two did not get a
proper disamb. The <fs> element was empty (had attributes but no content).
Original issue reported on code.google.com by [email protected]
on 5 Oct 2011 at 11:14
Sentence splitting does not perform very well. It can be easily improved by
using more sophisticated rules for Segment program. Details can be found here:
http://zil.ipipan.waw.pl/Segment
Original issue reported on code.google.com by [email protected]
on 17 May 2012 at 7:06
The tagger behaves incorrectly if the text to be tagged contains XML entities
such as ' or &. For example if the orth Kennedy'ego is correctly
encodeed in XML as:
<orth>Kennedy'ego</orth>
it will end up being literally treated as Kennedy(ampersand)amp(semicolon),
which is then split (presumably by Morfeusz) into separate tokens (Kennedy &
apos ; ego), for an even worse effect.
I believe the tagger should support at least the basic XML entities that
technically should always be escaped.
Original issue reported on code.google.com by [email protected]
on 11 Oct 2010 at 10:33
What steps will reproduce the problem?
echo "Mają kręgi." > seg.txt
~/pantera/bin/pantera -t nkjp --engine
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces-disamb
mv seg.txt.disamb seg.xces
~/pantera/bin/pantera -t nkjp --engine
~/pantera/engines/ultimarum-tertia-np0-6.btengine seg.txt -o xces
mv seg.txt.disamb seg.xces-sh
The standard XCES output (-o xces-disamb) implicitly assumes that only one
(lemma,tag) pair is selected. The selection seems arbitrary.
The “sh” XCES dialect preserves both lemmata when having the same tag.
Is this really intended?
Original issue reported on code.google.com by [email protected]
on 20 Feb 2012 at 2:52
--segm-disamb-rules
Original issue reported on code.google.com by [email protected]
on 2 Oct 2010 at 10:50
What steps will reproduce the problem?
1. Use Pantera on a large enough XCES (ipipan) corpus
2. Some sentences have been merged in the output, with the total token count
remaining the same.
This seems to be an issue with the regular expression in ipipanlexer.h, notably
the <chunk type=... one not matching the closing bracket of the tag. The
attached patch fixes the problem for me.
Original issue reported on code.google.com by [email protected]
on 8 Mar 2011 at 10:34
Attachments:
I tried to run pantera on a directory from input.tar.bz2 attachment:
pantera -t nkjp --no-guesser --engine /path/to/the/engine
./Posiedzenia_Plenarne-012/
It fails on one of the subdirectories (Posiedzenia_Plenarne-012-02). However
doesn't fail when invoking for that subdir alone.
The output is in attachment.
I use the newest pantera version downloaded from SVN with morfeusz 0.7 and
"ultimarum-tertia-np0-6.btengine" engine file.
Original issue reported on code.google.com by [email protected]
on 14 Aug 2011 at 9:37
Attachments:
I successfully trained Pantera on a corpus tagged in the IPI-PAN tagset,
passing the relevant --tagset option. Running the resulting engine, however,
resulted in an error complaining about a "numcol" grammatical class not being
present in the tagset. "numcol" does not appear anywhere in the training corpus
I used, nor does it exist in the IPI-PAN tagset as defined in Pantera (and in
other tools using the IPI-PAN tagset such as Morfeusz or TaKIPI).
It seems that the hardcoded fixes for Mofeusz output in
src/nlpcommon/morfeusz.h are tailored for the "new" SGJP Morfeusz, and are not
appropriate for the old version. I have patched the issue locally by wrapping
the offending block in a #ifdef, and I'm attaching the diff.
Original issue reported on code.google.com by [email protected]
on 17 Jan 2011 at 9:45
Attachments:
if there is something in between, tokens are not recognized, and incidentally
the lexer starts to eat a lot of memory (because current_lex is never reset).
Original issue reported on code.google.com by [email protected]
on 7 Sep 2010 at 6:16
Pantera includes whole TaKIPI distribution as ‘third-party’ tool. This is
weird and cumbersome — e.g. when having newer TaKIPI already installed. In
fact, only libcorpus is used. It'd be far more natural to add libcorpus as
Pantera's dependency.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 1:51
... we wanted to make it separate at the beginning of a question.
Original issue reported on code.google.com by [email protected]
on 2 Oct 2010 at 10:48
(Not sure if this is a bug or not)
Example: je
<fs type="lex" xml:id="morph_p-1.3-seg_1-lex">
<f name="base">
<string>on</string>
</f>
<f name="ctag">
<symbol value="ppron3"/>
</f>
<f name="msd">
<vAlt>
<symbol value="pl:acc:m2:ter:npraep" xml:id="morph_p-1.3-seg_1-msd">
<symbol value="pl:acc:m3:ter:npraep" xml:id="morph_p-1.3-seg_2-msd">
<symbol value="pl:acc:f:ter:npraep" xml:id="morph_p-1.3-seg_3-msd">
<symbol value="sg:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_4-msd">
<symbol value="pl:acc:n1:ter:npraep" xml:id="morph_p-1.3-seg_5-msd">
<symbol value="sg:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_6-msd">
<symbol value="pl:acc:n2:ter:npraep" xml:id="morph_p-1.3-seg_7-msd">
<symbol value="pl:acc:p2:ter:npraep" xml:id="morph_p-1.3-seg_8-msd">
<symbol value="pl:acc:p3:ter:npraep" xml:id="morph_p-1.3-seg_9-msd">
<symbol value="pl:acc:m2:ter:akc:npraep" xml:id="morph_p-1.3-seg_10-msd">
<symbol value="pl:acc:m3:ter:akc:npraep" xml:id="morph_p-1.3-seg_11-msd">
<symbol value="pl:acc:f:ter:akc:npraep" xml:id="morph_p-1.3-seg_12-msd">
<symbol value="sg:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_13-msd">
<symbol value="pl:acc:n1:ter:akc:npraep" xml:id="morph_p-1.3-seg_14-msd">
<symbol value="sg:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_15-msd">
<symbol value="pl:acc:n2:ter:akc:npraep" xml:id="morph_p-1.3-seg_16-msd">
<symbol value="pl:acc:p2:ter:akc:npraep" xml:id="morph_p-1.3-seg_17-msd">
<symbol value="pl:acc:p3:ter:akc:npraep" xml:id="morph_p-1.3-seg_18-msd">
<symbol value="pl:acc:m2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_19-msd">
<symbol value="pl:acc:m3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_20-msd">
<symbol value="pl:acc:f:ter:nakc:npraep" xml:id="morph_p-1.3-seg_21-msd">
<symbol value="sg:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_22-msd">
<symbol value="pl:acc:n1:ter:nakc:npraep" xml:id="morph_p-1.3-seg_23-msd">
<symbol value="sg:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_24-msd">
<symbol value="pl:acc:n2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_25-msd">
<symbol value="pl:acc:p2:ter:nakc:npraep" xml:id="morph_p-1.3-seg_26-msd">
<symbol value="pl:acc:p3:ter:nakc:npraep" xml:id="morph_p-1.3-seg_27-msd">
</vAlt>
</f>
</fs>
Original issue reported on code.google.com by [email protected]
on 16 Aug 2010 at 9:11
- --create-engine and --engine.
- --create-engine with no training files
- also when no --create-engine, but no input files found
Original issue reported on code.google.com by [email protected]
on 11 Aug 2010 at 8:13
What steps will reproduce the problem?
1. Create a text with a paragraph (<p>) longer than 1000 chars
2. Tag it
What is the expected output? What do you see instead?
Expected: <seg corresp="text_structure.xml#string-range(p-363,1007,9)" ...>
Got: <seg corresp="text_structure.xml#string-range(p-363,1,007,9)" ...>
Original issue reported on code.google.com by [email protected]
on 16 May 2011 at 11:18
This is the only way to allow producing NKJP outputs from plain text let's say.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 12:28
Recently the specs for the output format changed.
This needs some adjustments.
Also: formally validate outputs.
Original issue reported on code.google.com by [email protected]
on 11 Aug 2010 at 8:09
ann_segmentation.xml and ann_morphosyntax.xml contain:
<?oxygen RNGSchema="NKJP_morphosyntax.rng" type="xml"?>
<?oxygen RNGSchema="NKJP_segmentation.rng" type="xml"?>
It's probably better to remove them.
Original issue reported on code.google.com by [email protected]
on 4 Nov 2011 at 4:20
What steps will reproduce the problem?
1. Tag a text_structure that has an <lb/>, e.g.
nie wywieraj nacisku na innych, żeby wznosili toasty<lb/>nie zmuszaj nikogo do
picia alkoholu
What I see in segmentation is:
<!-- toasty -->
<seg corresp="text_structure.xml#string-range(p-151,46,6)" xml:id="segm_p-151.2471-seg"/>
<!-- nie -->
<seg corresp="text_structure.xml#string-range(p-151,52,3)" nkjp:nps="true" xml:id="segm_p-151.2472-seg"/>
<!-- zmuszaj -->
<seg corresp="text_structure.xml#string-range(p-151,56,7)" xml:id="segm_p-151.2473-seg"/>
while it would make more sense if there was no nkjp:nps="true" for the second
segment, so that the 'projected' text is "toasty nie zmuszaj", i.e. when we
lose information about newline (that is inevitable here), let's change it to a
space, not to an empty string.
Original issue reported on code.google.com by [email protected]
on 20 May 2011 at 11:14
The wiki says:
If you use different version of Morfeusz, you may need to adjust some paths in
third_party/morfeusz/Makefile.am. If the version of Morfeusz used emits the new
tagset, adjust MORFEUSZ_TAGSET in src/nlpcommon/morfeusz.h to nkjp.
while, as far as I understand, now NKJP tagset is the default, and you need to
uncomment #define MORFEUSZ_IPI to use the old one.
Original issue reported on code.google.com by [email protected]
on 23 Aug 2011 at 10:32
this makes sense.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 12:27
test -z "/home/eliasz/workspace/pantera/include/nlpcommon" || /bin/mkdir -p
"/home/eliasz/workspace/pantera/include/nlpcommon"
/usr/bin/install -c -m 644 exception.h ipipanlexer.h lexer.h progress.h tag.h util.h writer.h tagset.h category.h pos.h scorer.h cascorer.h finderrors.h datwriter.h datlexer.h stats.h poliqarp-weights.h category-weights.h stats.h lexemesfilter.h plaintextwriter.h plaintextlexer.h morfeusz.h nkjptextlexer.h nkjplexerdata.h nkjpwriter.h libsegmentsentencer.h tagset_convert.h polish_tagset_convert.h _pstream.h '/home/eliasz/workspace/pantera/include/nlpcommon'
/usr/bin/install: will not overwrite just-created
`/home/eliasz/workspace/pantera/include/nlpcommon/stats.h' with `stats.h'
make[4]: *** [install-libnlpcommonincludeHEADERS] Error 1
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 2:07
... progress as far as possible.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 11:46
What steps will reproduce the problem?
1. For instance, chmod -R u-w DIRECTORY
2. run Pantera on the DIRECTORY
What is the expected output? What do you see instead?
I'd expect an error message. Instead, the program just freezes (still using
CPU) until killed with a signal or something.
What version of the product are you using? On what operating system?
RHEL5, but expect the same behaviour anywhere (e.g. when disk is full, if the
OS doesn't support access rights)
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 22 Apr 2011 at 10:02
This may be very useful for training the tagger.
Original issue reported on code.google.com by [email protected]
on 2 Sep 2010 at 4:28
Double 'ign' when word not known to odgadywacz
Original issue reported on code.google.com by [email protected]
on 11 Aug 2010 at 8:06
(in nlpcommon/polish_tagset_convert.h)
- incompatible catgories should not be remapped to ign
- investigate if there are any incompatible grammatical category values
Original issue reported on code.google.com by [email protected]
on 11 Aug 2010 at 8:08
Missing segmentation disambiguation
Original issue reported on code.google.com by [email protected]
on 11 Aug 2010 at 8:06
... it's not even maintained and noone uses it.
Original issue reported on code.google.com by [email protected]
on 6 Sep 2010 at 11:56
Comment inside a paragraph:
<p xml:id="p-13"><!--<4>--></p>
There are definitely no segments here, while Pantera for the reason I don't
know identifies (and tags) the following:
<!-- -\- -->
<seg corresp="text_structure.xml#string-range(p-13,0,2)"
xml:id="segm_p-13.697-seg"/>
<!-- > -->
<seg corresp="text_structure.xml#string-range(p-13,2,1)" nkjp:nps="true"
xml:id="segm_p-13.698-seg"/>
Haven't checked what happens when the comment is not the only content of the
paragraph. However, it should work in this case as well, as if the commented
text was not there.
Original issue reported on code.google.com by [email protected]
on 15 Jul 2011 at 10:13
Attached patch includes the missing files, for convenience. Taken from:
https://svn.apache.org/repos/asf/incubator/mesos/trunk/m4/acx_pthread.m4
and
http://stuff.mit.edu/afs/athena/software/elmer/current/src/hutiter/acx_mpi.m4
Original issue reported on code.google.com by joregan
on 19 Mar 2012 at 2:06
Attachments:
What I get is:
Loading engine ...
Sending data to all worker processes ...
Scanning for input files ... 1 found
[5252] LEXER /tmp/text_structure.xml
[5252] SENTENCER /tmp/text_structure.xml
Warning: system does not support required locale 'pl_PL'. We will continue with
the 'en_US.UTF-8' locale, but things may not work as expected.
[5252] MORPH /tmp/text_structure.xml
[5252] SEGM-DISAMB /tmp/text_structure.xml
[5252] PRE-TAGGER /tmp/text_structure.xml
[5252] TAGGER /tmp/text_structure.xml
[5252] POST-TAGGER /tmp/text_structure.xml
[5252] WRITER /tmp/text_structure.xml
[5252] DONE /tmp/text_structure.xml
Processing ... done.
All done.
while I do have the appropriate locale installed
$ locale -a | grep pl_PL
pl_PL.utf8
and the locale test program described on the wiki page compiles and runs.
The system is Ubuntu 10.04 LTS.
Original issue reported on code.google.com by [email protected]
on 24 Aug 2011 at 10:15
What steps will reproduce the problem?
1. Running Pantera (standard parameters, incl. model) on the provided
text_structures.
What is the expected output? What do you see instead?
Some paragraphs are not closed ("</p>" line is missing).
What version of the product are you using? On what operating system?
0.9, from package 0.9-r154-1, Ubuntu 10.04
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 12 Aug 2013 at 12:49
Attachments:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.