Code Monkey home page Code Monkey logo

libfolia's People

Contributors

kosloot avatar proycon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libfolia's Issues

folialint produces invalid FoLiA out of dubious input

related to proycon/flat#138

Consider this file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfolia-v0.14" version="0.12.0">
  <metadata type="native">
    <annotations>
      <pos-annotation set="pos"/>
      <syntax-annotation set="syn"/>
    </annotations>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="s.1">
      <w xml:id="s.1.w.1">
	<t>Is@</t>
	<pos class="BEP" />
      </w>
      <syntax>
	<su xml:id="s.1.su.1" class="IP-MAT">
          <su xml:id="s.1.su.2" class="NP-SBJ">
            <w xml:id="s.1.su.w.1">
              <t>*exp*</t>
              <pos class="EX" />
            </w>
          </su>
	</su>
    </syntax>
    </s>
  </text>
</FoLiA>

It contains a <w> in the <su> node that IS NOT present in the <s> itself.
That is a construction which is (until now) never thought of.

When running folialint on this file, an INVALID output is produced:

<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfol
ia-v1.20" version="0.12.0">
  <metadata type="native">
    <annotations>
      <pos-annotation set="pos"/>
      <syntax-annotation set="syn"/>
    </annotations>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="s.1">
      <w xml:id="s.1.w.1">
        <t>Is@</t>
        <pos class="BEP"/>
      </w>
      <syntax>
        <su xml:id="s.1.su.1" class="IP-MAT">
          <su xml:id="s.1.su.2" class="NP-SBJ">
            <wref id="s.1.su.w.1" t="*exp*"/>
          </su>
        </su>
      </syntax>
    </s>
  </text>
</FoLiA>

A <wref> is generated to a non existing word!

Desired behavior:

  • Or reject the input
  • Or emit a real <w> and not a <wref>

configure.ac:44: error: possibly undefined macro: AC_MSG_ERROR

While trying to create the FreeBSD port, I'm getting this error for version 1.15:

$ sh bootstrap.sh 
aclocal: installing 'm4/libtool.m4' from '/usr/local/share/aclocal/libtool.m4'
aclocal: installing 'm4/ltoptions.m4' from '/usr/local/share/aclocal/ltoptions.m4'
aclocal: installing 'm4/ltsugar.m4' from '/usr/local/share/aclocal/ltsugar.m4'
aclocal: installing 'm4/ltversion.m4' from '/usr/local/share/aclocal/ltversion.m4'
aclocal: installing 'm4/lt~obsolete.m4' from '/usr/local/share/aclocal/lt~obsolete.m4'
aclocal: installing 'm4/pkg.m4' from '/usr/local/share/aclocal/pkg.m4'
libtoolize: putting auxiliary files in '.'.
libtoolize: linking file './ltmain.sh'
configure.ac:44: error: possibly undefined macro: AC_MSG_ERROR
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf-2.69: /usr/local/bin/autoconf-2.69 failed with exit status: 1

I have autoreconf-2.69 and pkg-config installed.

Installation of libfolia

When I try to install libfolia, I consistently receive the following error message:

grep: /usr/lib/libiconv.la: No such file or directory
sed: /usr/lib/libiconv.la: No such file or directory
libtool: error: '/usr/lib/libiconv.la' is not a valid libtool archive

I am working on a Mac OS Sierra environment, version 10.12.3. The folder that is mentioned does not contain a libiconv.la file. There is a libiconv.dylib file, however. I have tried to install libiconv using homebrew on my machine, but this did not create a libiconv.la file either.

Does anybody have an idea of what may be causing this problem, and of how I may solve this issue? I have attached the config.log file.

Any help would be greatly appreciated!

config-log.txt

Dellimitter problem in corrected text

given this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <token-annotation />
    </annotations>
  </metadata>
  <text xml:id="bug">
    <s xml:id="s.1">
      <t class="in">Dit is een test</t>
      <w xml:id="w.1">
        <t class="in">Dit</t>
      </w>
      <w xml:id="w.2">
        <t class="in">is</t>
      </w>
      <correction>
        <original>
          <w xml:id="w.3">
            <t class="in">een</t>
          </w>
        </original>
      </correction>
      <w xml:id="w.4">
        <t class="in">test</t>
      </w>
    </s>
  </text>
</FoLiA>

follialint gives this erroneous message:

tests/textproblem_2.xml failed: inconsistent text: node s(s.1) has a mismatch for the text in set:in
the element text ='Dit is een test'
 the deeper text ='Dit is eentest'

apparantly, the dellimitter is lost somewhere.

textclass properties on entities not honoured when interpreting wref/@t

folialint breaks on the following document with error (foliavalidator does not complain):

XML error: WordRefence id=TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3 has another value for  the t attribute them it's reference. (Zuidhollanschen versus Zuydthollanschen)

It should look in the right textclass, which is explicitly specified at the entity level.

'Minimal' FoLiA example (http://lst.science.ru.nl/~proycon/issue52.folia.xml):

    <s xml:id="TEI.1.par">                                                                                                                                                                                                                                     
            <w xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3" class="WORD" set="tokconfig-nld">                                                                                                                                                            
              <t>Zuydthollanschen</t>                                                                                                                                                                                                                          
              <t class="contemporary">Zuidhollanschen</t>                                                                                                                                                                                                      
              <pos class="SPEC(deeleigen)" confidence="1" head="SPEC" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" textclass="contemporary">                                                                                                              
                <feat class="deeleigen" subset="spectype"/>                                                                                                                                                                                                    
              </pos>                                                                                                                                                                                                                                           
              <lemma class="Zuidhollanschen" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl" textclass="contemporary"/>                                                                                                                                       
            </w>                                                                                                                                                                                                                                               
            <w xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.4" class="WORD" set="tokconfig-nld" space="no">                                                                                                                                                 
              <t>Synodi</t>                                                                                                                                                                                                                                    
              <t class="contemporary">Sijnodi</t>                                                                                                                                                                                                              
              <pos class="SPEC(deeleigen)" confidence="1" head="SPEC" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-cgn" textclass="contemporary">                                                                                                              
                <feat class="deeleigen" subset="spectype"/>                                                                                                                                                                                                    
              </pos>                                                                                                                                                                                                                                           
              <lemma class="Sijnodi" set="http://ilk.uvt.nl/folia/sets/frog-mblem-nl" textclass="contemporary"/>                                                                                                                                               
            </w>                                                                                                                                                                                                                                               
            <entities xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.entities.1">                                                                                                                                                                               
              <entity xml:id="TEI.1.text.1.body.1.div1.1.head.1.s.1.entities.1.entity.1" class="pro" confidence="0.68202" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" textclass="contemporary">                                                             
                <wref id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.3" t="Zuidhollanschen"/>                                                                                                                                                                     
                <wref id="TEI.1.text.1.body.1.div1.1.head.1.s.1.w.4" t="Sijnodi"/>                                                                                                                                                                             
              </entity>                                                                                                                                                                                                                                        
            </entities>                                                                                                                                                                                                                                        
    </s>     

Libfolia generated document with invalid ids (not an NCName)

Martin came accross the following document:

<FoLiA generator="libfolia-v0.11" xml:id="_gid001199401_01_0096" xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink">

Which fails to be read because:

reason: XML error: XML error: '_gid001199401_01_0096' is not a valid NCName. (must start with character). 

I think we need some kind of quick check on ID setting, may possibly be an issue for the python version too.

Moving includes out-of-header causes issues for Cython bindings

Commit 2b0bb9a breaks the Python/Cython bindings as I need to duplicate more headers natively in Cython. Can we perhaps revert this or find an elegant solution that does allow the includes to be present in the headers? That way Cython needs not be aware of too much and the c++ compiler can handle it.

How to handle missing version information

Sometimes FoLiA files don't carry version information.
In fact I think this should be an error, but they have been created, and we have to handle them.
At the moment libfolia accepts those files, and assigns the CURRENT FoLiA version to them.
(This is probably so since a few months)
But that is proven wrong. Files may have been created with pré 1.5 version and contain incompatible text elements. (text not matching the deeper text), and the parser bails out.
This can be circumvented by setting the environment variable FOLIA_TEXT_CHECK to NO.
Which will create a document with a version (like 1.5.1) which still contains the error.
This is highly undesirable.

Maybe it is better to set the version to some pre 1.5 like 1.4.10? That will keep the files 'valid' for ever.

@proycon opinions on this?
And should we forbid 'version less FoLiA' in the future? I'm in favor.

fix id parsing

libfolia accepts constructions like

<s id="s.1"/>

This should be:

<s xml:id="s.1"/>

[Debian bug 843053] ABI breakage

Frog (and ucto?) break in debian because of a symbol error: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843053

The packages were compiled with libfolia 1.4, but state dependency requirement (>= 1.3). I assume the user still has libfolia v1.3 then and the ABI changed in the meantime? It sounds like we need to increase the so version for every ABI change to prevent such issues? Or tighten the package dependencies even more strictly if that suffices.

reduce the number of is_declared functions

folia_engine contains a lot of is_declared variants (as does folia_document)
most of them arr NOT or seldomly used. (also in other programs in foliautils and such)
The can be removed to clean up stuff. Better now than after someone starts using them :)

The same for a lot of declare() variants.

allow for multiple foreign metadata nodes in FoLiA, even in 'native' mode.

Use case:
we have a file where we want to add a foreign metadata node:

  <metadata type="foreign">
     <foreign-data>
        <paragraphStyles xmlns="http://www.abbyy.com/FineReader_xml/FineReader10-schema-v1.xml">
...
    </foreign-data>
  </metadata>

This works.
But we also would like to add:

 <metadata type="native">
   <meta id="abby_file">piroska.abby.xml</meta>
  </metadata>

But the current implementation doesn't allow this.
Is this an oversight? I assume submetadata is meant for this?
It seems the API doesn't provide an easy way.

value of Textcontent dissappears (empty string) upon add?

Something goes wrong when I add TextContent with value eologico*phijsico*metaphijsicum, libfolia adds an empty text content element instead! I've no idea what triggers this (special meaning for the asterisk perhaps??), other words process fine.

I add TextContent as follows:
https://github.com/LanguageMachines/foliautils/blob/wordtranslate/src/FoLiA-wordtranslate.cxx#L134

Debug output, I explicitly check if I'm not passing an empty string (after trimming even):

$ FoLiA-wordtranslate --outputclass contemporary -d lexicon.1637-2010.250.lexserv.vandale.tsv -p preservation2010.txt -r rules.machine aa__001biog01_01.tok.folia.xml
                                                                                                                                                                                                                                                            
Loading dictionary...                                                                                                                                                                                                                                       
Loading preserve lexicon...                                                                                                                                                                                                                                 
Loading rules...                                                                                                                                                                                                                                            
DEBUG: target before sanity check 'eologico*phijsico*metaphijsicum'                                                                                                                                                                                         
DEBUG: target after sanity check 'eologico*phijsico*metaphijsicum'
DEBUG: text after adding textcontent ''
finished aa__001biog01_01.tok.folia.xml 

failed operations should not (always?) invalidate a FoLiA document

Some operations, especially appending should leave the original document untouched.
This enables users/programs to continue with the document, attempting other operations, like edit/save etc.
At the moment preserving is probably not always the case, leaving a failed addition IN the document, rendering it invalid.
Some code overhauling might be necessary.
@proycon is the Python version robust against these problems ?

Occurrence behaviour is not defined correctly?

After adding a second lemma, in a different set, I get:

terminate called after throwing an instance of 'folia::DuplicateAnnotationError'
what(): Unable to add another object of type lemma to w. There are already 1 instances of this class, which is the maximum.

There seems something wrong here. Investigating..

Implement group annotations

Group annotations provide the ability to add inline annotations on multi-word spans (group annotations) and solved related multi-word issues. These were previously reserved only for use with structural elements. See proycon/folia#51

AbstractInlineAnnotation is now in AbstractSpanAnnotation.ACCEPTED_DATA to allow this, but the library has to check if the declaration actually allows it.

Implement an incremental construction of FoLiA (output) files

For some purposes it is handy to make it possible to incrementally construct a FoLiA document
while keeping only fragments in memory.
There are probably many way to accomplish this, but I would like to sketch a simple approach, which might be useful for ucto or frog.

This approach doesn't rely on SAX or XmlReader based features yet. That might be needed in the long run, but for now I propose an even simpler idea.

I will describe the API, which should almost be selfexplaining...

  • constructor:
    foliaStream( const string& id. ostream& os )
    creates an empty FoLiA document with id=id. The document contains a basic document, including an empty <text> node, which is considered the root.
    We will manipulate this doc, and gradually write it to a stream.

The xmltag could be a parameter too? for other roots.

Also in the future we could initialize from a FoLiA file, using XmlReader to initialize only a base document.

  • members:
    • document()
      returns the FoLiA proto-document
    • root()
      returns the root of the FoLiA proto-document. (so the Text, for now)
    • output_header()
      Output the header of the folia document to the stream, up to the opening tag of the root node.
      This should FREEZE all global operations on the document (adding annotations and sets and such) as they can never be output again.
      Calling more then once is prohibited.
    • add_node( FoLiAElement * )
      add a FoLiA subtree under the root. This tree should be created as parts of document() (to get indexes and sets right)
      It can be called multiple times.
    • flush( )
      Writes all availanle FoLiA nodes under root() to the file and REMOVES them from the document. Keeping the footprint of the document low.
      When output_header() is not called yet, do that first.
    • output_footer()
      Output the tail of the Document, from till the end.
      Can only be called only once. flushes the foliaStream() first. So it even writes the header, if not already done.
  • destructor:
    call output_footer() when necessary, the closes the stream and removed the document.

Typical usage:
In UCTO:

  • create a foliaStream
  • declare all needed annotations/setnames in the document.
  • output_header()
    loop:
    • read a 'line'
    • tokenize
    • create FoLiA node(s) relative to the document()
    • add()
    • flush()
    • loop again
      output_footer()

In FROG

  • create a foliaStream
  • declare all needed annotations/setnames in the document.
  • output_header()
    loop:
    • read a 'line' / sentence /paragraph?
    • tokenize the line to a FoLiAFragment. (might need some ucto hacking)
    • perform MBLEM/MBMA/NER etc on the fragment. (might need work)
    • add()
    • flush()
    • loop again
      output_footer()

(re-)add possibility to include or check 'external' references

The "External" implementation has been simplified in commit af1817f

In fact NO resolving is done anymore now. This also implies that NO CHECKING is done on those external files.
It would be nice, and really simple to implement, to have a way to check the sanity.

Like adding a new 'mode' for the documents: "check-externals".
Even more easy to do is implementing a "resolve-externals" mode, which is nothing else than the old include="yes"

WARNING: for the "check-externals" variant it is important to be sure that no the external document doesn't introduce new sets or annotations in the 'master' document. (is that even possible?)

How to handle DOCTYPE and entities

At the moment, processing of FoLiA documents with a !DOCTYPE seems risky:

Given this file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="0.8" generator="libfolia-v0.4">
  <metadata>
    <annotations/>
  </metadata>
  <text xml:id="WR-P-E-J-0000000001.text">
    <s xml:id="WR-P-E-J-0000000001.head.1.s.1">
      <t>Dit is als het ware é&eacute;n test.</t>
    </s>
  </text>
</FoLiA>

The XML parser (for instance in folialint) chokes:
XML-error: Entity 'eacute' not defined
foliavalidator says:

Malformed XML!

This is as such NOT an ERROR.
xmllint says the same:

folia.xml:5: parser error : Entity 'eacute' not defined
      <t>Dit is als het ware é&eacute;n test.</t>

This can be solved by adding a !DOCTYPE with an ENTITY

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE test [
<!ENTITY eacute "é">
]>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="0.8" generator="libfolia-v0.4">
  <metadata>
    <annotations/>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="test.1.s.1">
      <t>Dit is als het ware é&eacute;n test.</t>
    </s>
  </text>
</FoLiA>

Both xmlint and folialint now accept this document.

But: folialint ditches the !DOCTYPE producing:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="libfolia-v1.20" version="0.8">
  <metadata type="native">
    <annotations/>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="test.s.1">
      <t>Dit is als het ware één test.</t>
    </s>
  </text>
</FoLiA>

So the entity IS resolved, but leaving out the DOCTYPE might be a problem in the future. Don't know...
@proycon any opinion on this?

extracting text() from <part> nodes ignores the space="no" attribute

Given this FoLiA:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="hbr" generator="libfolia-v2.8" version="2.4.0">
  <metadata type="native">
    <annotations>
      <paragraph-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <part-annotation/>
      <hyphenation-annotation/>
    </annotations>
  </metadata>
  <text xml:id="hbr.text">
    <p xml:id="hbr.text.p">
      <part xml:id="hbr.text.part.1" space="no">
        <t>White<t-hbr/>water Moun<t-hbr/></t>
      </part>
      <part xml:id="hbr.text.part.2">
        <t>tains.</t>
      </part>
    </p>
  </text>
</FoLiA>

the Pyton function folia2txt (rightfully) extracts the text:
Whitewater Mountains.

But it's C++ counterpart FoLiA-2text extracts:
Whitewater Moun tains.
ignoring the space="no". This is most probably a bug in libfolia.

spurious 'text' element is not detected

Given this simple file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="build1" generator="libfolia-v2.4" version="2.2.1">
  <metadata type="native">
    <annotations>
      <paragraph-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="build1.text">
    <p xml:id="p3">
      <t>paragraaf 3</t>
    </p>
    </text>wat nu?
</FoLiA>

folialint doesn't detect the spurious 'wat nu?' after the final </text> and states that the file is valid FoLiA

foliavalidator DOES detect the problem:

Error on line 2: Element FoLiA has extra content: text
VALIDATION ERROR against RelaxNG schema (stage 1/3), in /home/sloot/Downloads/build.xml
Element FoLiA has extra content: text, line 2

When the 'wat nu?' is moved to the </p> node, both programs complain, but the message from
folialint is quite cryptic:

XML error: Unable to append object of type _XmlText to a <text> (id=build1.text)

It would be nice to have a clearer message by folialint AND to detect it on the 'top' node too.

refactoring idea 3

folia_engine has al lot of functions with a depth parameter. This could probably also be part of the Engine Class as an internal property.

libfolia-based software fails on Mac OS X

Some software doesn't link with libfolia, see issue LanguageMachines/wopr#1 and LanguageMachines/foliatest#1

Other software does fully compile and link, but symbol lookup fails for some reason. I assume it's related to the linker error of the aforementioned issues, but could be wrong. This affects latest master branch, and most probably latest stable releases as well...

$ ucto
dyld: Symbol not found: __ZNK5folia9FoliaImpl10element_idEv
  Referenced from: /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib
  Expected in: flat namespace
 in /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib
Trace/BPT trap: 5
$ frog
dyld: Symbol not found: __ZNK5folia9FoliaImpl10element_idEv
  Referenced from: /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib
  Expected in: flat namespace
 in /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib
Trace/BPT trap: 5

Deeper investigation:

$ otool -L /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib
/Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib:
    /Users/proycon/LaMachine/lamachine/lib/libucto.2.dylib (compatibility version 3.0.0, current version 3.0.0)
    /Users/proycon/LaMachine/lamachine/lib/libfolia.4.dylib (compatibility version 5.0.0, current version 5.0.0)
    /usr/local/opt/icu4c/lib/libicui18n.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicuuc.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicudata.55.1.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicuio.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /Users/proycon/LaMachine/lamachine/lib/libticcutils.2.dylib (compatibility version 3.0.0, current version 3.0.0)
    /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.5)
    /usr/local/opt/boost/lib/libboost_regex-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.9.0)
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0)

$ otool -L /Users/proycon/LaMachine/lamachine/lib/libfolia.4.dylib
/Users/proycon/LaMachine/lamachine/lib/libfolia.4.dylib:
    /Users/proycon/LaMachine/lamachine/lib/libfolia.4.dylib (compatibility version 5.0.0, current version 5.0.0)
    /Users/proycon/LaMachine/lamachine/lib/libticcutils.2.dylib (compatibility version 3.0.0, current version 3.0.0)
    /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.5)
    /usr/local/opt/boost/lib/libboost_regex-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/local/opt/icu4c/lib/libicui18n.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicuuc.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicudata.55.1.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/local/opt/icu4c/lib/libicuio.55.dylib (compatibility version 55.0.0, current version 55.1.0)
    /usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.9.0)
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0)

$ nm /Users/proycon/LaMachine/lamachine/lib/libfolia.4.dylib | grep __ZNK5folia9FoliaImpl10element_idEv
000000000002ca60 t __ZNK5folia9FoliaImpl10element_idEv

1.16 fails to configure: error: Could not find a version of the Boost::Regex library!

configure:16328: (cc -c -Werror -Wunknown-warning-option  -pthread -O2 -pipe -fno-omit-frame-pointer  -fstack-protector-strong -fno-strict-aliasing  -fno-omit-frame-pointer conftest.c >&5) && (echo ==== >&5) && (cc -o conftest -Werror -Wunknown-warning-option  -pthread -O2 -pipe -fno-omit-frame-pointer  -fstack-protector-strong -fno-strict-aliasing  -fno-omit-frame-pointer  -fstack-protector-strong  conftest.o  >&5)
====
configure:16328: $? = 0
configure:16346: result: no
configure:16495: checking for joinable pthread attribute
configure:16513: cc -o conftest -O2 -pipe -fno-omit-frame-pointer  -fstack-protector-strong -fno-strict-aliasing  -pthread -fno-omit-frame-pointer  -fstack-protector-strong  conftest.c   >&5
configure:16513: $? = 0
configure:16521: result: PTHREAD_CREATE_JOINABLE
configure:16535: checking whether more special flags are required for pthreads
configure:16548: result: no
configure:16556: checking for PTHREAD_PRIO_INHERIT
configure:16572: cc -o conftest -O2 -pipe -fno-omit-frame-pointer  -fstack-protector-strong -fno-strict-aliasing  -pthread -fno-omit-frame-pointer  -fstack-protector-strong  conftest.c   >&5
configure:16572: $? = 0
configure:16581: result: yes
configure:16827: checking for boostlib >= 1.50 (105000)
configure:16859: c++ -c -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fno-omit-frame-pointer   -pthread -fno-omit-frame-pointer -I/usr/local/include conftest.cpp >&5
configure:16859: $? = 0
configure:16861: result: yes
configure:17041: checking whether the Boost::Regex library is available
configure:17064: c++ -c -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fno-omit-frame-pointer   -pthread -fno-omit-frame-pointer -I/usr/local/include -I/usr/local/include conftest.cpp >&5
conftest.cpp:34:15: warning: empty parentheses interpreted as a function declaration [-Wvexing-parse]
boost::regex r(); return 0;
              ^~
conftest.cpp:34:15: note: replace parentheses with an initializer to declare a variable
boost::regex r(); return 0;
              ^~
              {}
1 warning generated.
configure:17064: $? = 0
configure:17078: result: yes
configure:17226: error: Could not find a version of the Boost::Regex library!

boost-libs-1.70.0

Handle renamed elements

Various elements have been renamed. The renamed ones are in the oldmaps variable in the specification (already propagated to the C++ code by foliaspec). Furthermore, AbstractHigherOrderAnnotation is a new grouping, and AbstractExtendedTokenAnnotation is removed.

generate id fails when parent doesn't hava an id

When adding a word to a sentence without an ID, (for instance in ucto) you get a failure:
"Unable to generate an id from ID= "

Generate_id should look 'upwards' for parent withe an id.
With as final resort the document id

folialint should detect missing annotation declarations

consider this document:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="test" generator="ko" version="2.2.0">
  <metadata type="native">
    <annotations>
      <text-annotation set="tekst" />
    </annotations>
  </metadata>
  <text xml:id="test.text">
    <s xml:id="s1">
      <t>Een zin.</t>
    </s>
  </text>
</FoLiA>

When running foliavalidator on this file we get:
ParseError: FoLiA exception in handling of <s> @ line 10 (in parent <text> @ parent line 9) : [DeclarationError] Encountered an instance without proper declaration: Sentence <s>!

With de `-a' option, the document is accepted and an (empty) annotation is added:

<sentence-annotation/>

folialint happily accepts this an just ignores the problem.
We would need folialint to warn, AND probably add something like the -a option too.

NOTE:
when leaving out the text-annotation declaration too, libfolia DOES an autodeclare of that.

incorrect extraction of deep text from a document with corrections

the text() extraction function fails to extract the correct text from a sentence where the last Word is a Correction, and the sentence is followed by another sentence.
This came up in: LanguageMachines/foliautils#66

When the last Word is truly a Word, a space separator is added, and everything is fine. But in case of a Correction the space is omitted, gluing the 2 sentences text together.
Example (rather braindead, but is proves the point)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.1">
  <metadata type="native">
    <annotations>
      <token-annotation alias="tokconfig-deu" set="https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl">
        <annotator processor="FoLiA-correct.1"/>
        <annotator processor="ucto.1"/>
      </token-annotation>
      <paragraph-annotation>
        <annotator processor="ucto.1"/>
      </paragraph-annotation>
      <sentence-annotation>
        <annotator processor="ucto.1"/>
      </sentence-annotation>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation set="Ticcl-set">
        <annotator processor="FoLiA-correct.1"/>
      </correction-annotation>
    </annotations>
    <provenance>
      <processor xml:id="ucto.1" begindatetime="2022-10-06T12:10:53" command="ucto -X -L deu --textredundancy=full --id Walter bug.in bug.folia.xml" folia_version="2.5.1" host="kobus" name="ucto" user="sloot" version="0.26">
        <processor xml:id="ucto.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
        <processor xml:id="uctodata.1" name="uctodata" type="datasource" version="0.9.1">
          <processor xml:id="uctodata.1.1" name="tokconfig-deu" type="datasource" version="0.2"/>
        </processor>
      </processor>
      <processor xml:id="FoLiA-correct.1" begindatetime="2022-10-06T12:11:06" command="FoLiA-correct --ngram=3 -e folia.xml -O OUT --rank=data/DeutscheEssays.RANK.withunderscore.ranked --unk=data/DeutscheEssays.UNK.withunderscore.unk --punct=data/DeutscheEssays.UNK.withunderscore.punct" folia_version="2.5.1" host="kobus" name="FoLiA-correct" user="sloot" version="0.19">
        <processor xml:id="FoLiA-correct.1.generator" folia_version="2.5.1" name="libfolia" type="generator" version="2.12"/>
      </processor>
    </provenance>
    <meta id="language">deu</meta>
  </metadata>
  <text xml:id="Walter.text">
    <p xml:id="Walter.p.1">
      <t>chat... Von</t>
      <s xml:id="Walter.p.1.s.1">
        <t>chat...</t>
        <w xml:id="Walter.p.1.s.1.w.1" class="WORD" processor="ucto.1" space="no">
          <t>chat</t>
        </w>
        <correction xml:id="Walter.p.1.s.1.correction.1">
          <new>
            <w xml:id="Walter.p.1.s.1.w.3.edit.1" processor="FoLiA-correct.1">
              <t>...</t>
            </w>
          </new>
          <original auth="no">
            <w xml:id="Walter.p.1.s.1.w.3" class="PUNCTUATION-MULTI" processor="ucto.1">
              <t>...</t>
            </w>
          </original>
        </correction>
      </s>
      <s xml:id="Walter.p.1.s.2">
        <t>Von</t>
        <w xml:id="Walter.p.1.s.2.w.1" class="WORD" processor="ucto.1">
          <t>Von</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

When parsing this file, withe folialint:

bug.xml failed: inconsistent text: node p(Walter.p.1) has a mismatch for the text in set:current
the element text ='chat... Von'
 the deeper text ='chat...Von'

Implement support for provenance data

FoLiA v2.0 documents may now carry provenance data, an annotationtype uses either this or the old system of annotators

  • - Implement parsing of provenance data (holding <processor> elements)
  • - Implement parsing of <annotator> elements in the annotation declarations, this is what links declarations to processors in the provenance data
  • - Implement parsing of processor attribute (a common FoLiA attribute that links any element to its processor, analogous to annotator/annotatortype pre-2.0)
  • - Implement serialisation of provenance data
  • - Implement serialisation of <annotator>
  • - Implement serialisation of the processor attribute
  • - Implement an API for setting the processor (see proycon/foliapy#9)

See:

Implement stricter declarations but retain backward compatibility.

FoLiA v2.0 requires everything that has an annotationtype to be declared, which makes things simpler. However, since we need to retain backward compatibility we also need to keep supporting the pre-v2 situation.. See the closing post in proycon/folia#54 for a summary.

The Python library actually does a of declarations implicitly (I have a Document.autodeclare that now defaults to True, but I don't think the C++ version necessarily requires such extra convenience function)

Forward-compatibility: less strict version checking

Right now libfolia refuses to read newer FoLiA files than known at the time of compilation:

$ folialint example.xml
FAIL: XML error: XML error: FoLiA Document has unsupported version: 1.3 (1.2.0 is supported.)

I propose we become a bit less strict and do not immediately refuse to parse newer FoLiA documents. Instead we should output a very clear warning that the document is newer and that any failures that occur after that point are most likely related to that fact. In practice newer FoLiA documents do not necessarily use features that were not present in earlier versions, so in many cases parsing would work, although it is of course no longer guaranteed, the bigger the version discrepancy the larger the chance of failure.

I think such behaviour may be preferable in cases where people linger on older versions of libfolia for a bit longer (as in debian for instance). I already got one support question related to this a while back.

String Element should support str()

using str() on s String element yield the default word 'str' (the tagname)
text() is working correctly, so a workaround is available.
But str() should work too.

Maybe it is better to define str() in terms of text() generally?

Compiling against libxml2-2.9.1 fails on CentOS 7

Latest stable libfolia fails to compile on CentOS 7 (pretty old, I know) against libxml2-2.9.1 (2013, so old as well):

  folia_impl.cxx: In member function ‘virtual folia::FoliaElement* folia::AbstractElement::parseXml(const xmlNode*)’:
  folia_impl.cxx:3400:38: error: invalid conversion from ‘const xmlNode*’ {aka ‘const _xmlNode*’} to ‘xmlNodePtr’ {aka ‘_xmlNode*’} [-fpermissive]
   3400 |     int sp = xmlNodeGetSpacePreserve(node);
        |                                      ^~~~
        |                                      |
        |                                      const xmlNode* {aka const _xmlNode*}
  In file included from /usr/local/include/ticcutils/XMLtools.h:34,
                   from folia_impl.cxx:42:
  /usr/include/libxml2/libxml/tree.h:1081:39: note:   initializing argument 1 of ‘int xmlNodeGetSpacePreserve(xmlNodePtr)’
   1081 |   xmlNodeGetSpacePreserve (xmlNodePtr cur);
        |                            ~~~~~~~~~~~^~~

Still, libfolia advertises libxml2 2.6.16 or later...

Ref: https://github.com/proycon/python-ucto/actions/runs/3950834083/jobs/6763898599#step:5:1472

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.