arademaker / hs-conllu Goto Github PK
View Code? Open in Web Editor NEWCoNLL-U/UD library
License: GNU Lesser General Public License v3.0
CoNLL-U/UD library
License: GNU Lesser General Public License v3.0
plus they can have the weird bracket-thing, which I don't know for what use it is.
8 قوللىرى قول NOUN N Case=Nom|Number=Plur|Number[psor]=Plur,Sing|Person[psor]=3 10 nsubj _ Translit=qolliri
from the Uyghur UD test set.
[source]
For users of your parser who do not use it standalone, but integrate it with other code, two simpler functions would be beneficial:
parseConllu :: P.Parser [T.Sentence] -> Text -> ErrOrVal [T.Sentence]
-- | parse a text (no IO) to sentences
parseConllu parser text =
case r of
Left err -> Left (s2t $ M.parseErrorPretty err)
Right ss -> Right ss
where
r = M.parse parser "" (t2s text) -- why a textname required ??
prettyPrintConlluSentence ::T.Sentence -> Text
-- | prettyprint a single sentence
prettyPrintConlluSentence = s2t . Pr.fromDiffList . Pr.printSent
where ErrOrVal
is Either Text
and s2t is a conversion is pack
from Data.Text.
In case you wonder - I use your package to parse output from coreNLP (udfeats) and stick them into a triplestore. Thank you for your effort!
Is there a plan to incorporate the same level of validation performed by the Universal Dependency tool, validate.py, found here https://github.com/UniversalDependencies/tools? Also Is there a plan to do a performance comparison between the official Universal Dependency validation python script and hs-conllu?
The universal dependency organization's validation software provides 5 levels of validation. Because it is written in python, I suspect it would be slower than written in Haskell or C++.
so that we can accumulate all errors and report them, instead of getting them one at a time.
% cabal install hs-conllu
Resolving dependencies...
cabal: Could not resolve dependencies:
[__0] trying: hs-conllu-0.1.2 (user goal)
[__1] next goal: megaparsec (dependency of hs-conllu)
[__1] rejecting: megaparsec-9.0.1 (conflict: hs-conllu => megaparsec>=6 && <7)
[__1] skipping: megaparsec-9.0.0, megaparsec-8.0.0, megaparsec-7.0.5,
megaparsec-7.0.4, megaparsec-7.0.3, megaparsec-7.0.2, megaparsec-7.0.1,
megaparsec-7.0.0 (has the same characteristics that caused the previous
version to fail: excluded by constraint '>=6 && <7' from 'hs-conllu')
[__1] trying: megaparsec-6.5.0
[__2] next goal: base (dependency of hs-conllu)
[__2] rejecting: base-4.13.0.0/installed-4.13.0.0 (conflict: megaparsec =>
base>=4.7 && <4.13)
[__2] skipping: base-4.14.1.0, base-4.14.0.0, base-4.13.0.0 (has the same
characteristics that caused the previous version to fail: excluded by
constraint '>=4.7 && <4.13' from 'megaparsec')
[__2] rejecting: base-4.12.0.0, base-4.11.1.0, base-4.11.0.0, base-4.10.1.0,
base-4.10.0.0, base-4.9.1.0, base-4.9.0.0, base-4.8.2.0, base-4.8.1.0,
base-4.8.0.0, base-4.7.0.2, base-4.7.0.1, base-4.7.0.0, base-4.6.0.1,
base-4.6.0.0, base-4.5.1.0, base-4.5.0.0, base-4.4.1.0, base-4.4.0.0,
base-4.3.1.0, base-4.3.0.0, base-4.2.0.2, base-4.2.0.1, base-4.2.0.0,
base-4.1.0.0, base-4.0.0.0, base-3.0.3.2, base-3.0.3.1 (constraint from
non-upgradeable package requires installed instance)
[__2] fail (backjumping, conflict set: base, hs-conllu, megaparsec)
After searching the rest of the dependency tree exhaustively, these were the
goals I've had most trouble fulfilling: base, megaparsec, hs-conllu
it will parse cc
first and then complain about a spurious 'o'
at least to check a few properties such as the identity on parsing/printing.
maybe use this library? https://github.com/mrkkrp/hspec-megaparsec
The function`` printSentseems to add a newline at the beginning of a sentence (and one at the end as well) which causes the
print . read` not to be identy. perhaps you remove the line which is commented out:
printSent :: Sentence -> DiffList Char
printSent ss =
mconcat
[ printComments (_meta ss)
-- , diffLSpace -- causes an extra space initially
, printTks (_tokens ss)
, diffLSpace
]
trigerred by #17 and UniversalDependencies/UD_English-EWT#60.
files here.
ideally, semantic validation would be separate from parsing (syntactic validation), but maybe this would require too much mapping between types?
I wrote the text below as an open reply to @arademaker for our conversation on #32 about plans for the library.
I'd like to change the structure of the library a bit: first have a really dumb parser that would accept anything remotely matching the conllu format, then do light validation on top of it according to user specification. This would mean not to hardcode deprels and other stuff, but read a file that lists the acceptable entities (these files already exist for the canonical validating script, but the user could tweak them if they wanted to).
I also think that the megaparsec library might be unnecessary since the conllu format is so simple, but its performance is not bad and the error-reporting facilities are great (are we using them as well as we could?), so maybe I'd leave that be. If there's a performance need, then we might think about it.
I don't think it's worth it to implement full conllu validation, for the reasons I said on #34.
At some point I had plans for a query interface like the one in http://match.grew.fr (see master...query), but honestly I don't think it's worth implementing it since just loading the data on a graph database would give better-performing queries and facilities for visualization for free :)
Finally, I started writing this library a long time ago when I first started learning Haskell, so I would also change the code quite a bit to reflect some of what I learned since then.
I used the parser on a small example and found that dobj
is missing in the Dep enumerated type.
It might be better to move the tagset into separate modules to be imported qualified (to avoid construction like AUXpos and similar). and to publish the tagset independenly in a separate package for all to use and to improve (I have constructed one which I hope is more complete adn may submit a PR).
The error is produced by the code https://github.com/cpdoc/dhbb-nlp/blob/master/bin/CountLexicon.hs, I can't say if the problem is in the script or the library.
$ cat out.hs.orgs.error
countlex: Maybe.fromJust: Nothing
$ cat out.hs.people.error
countlex: Maybe.fromJust: Nothing
http://universaldependencies.org/u/overview/enhanced-syntax.html
including ref
relation.
Resolving dependencies...
cabal: Could not resolve dependencies:
trying: Hs-conllu-0.0.1 (user goal)
next goal: base (dependency of Hs-conllu-0.0.1)
rejecting: base-4.8.1.0/installed-075... (conflict: Hs-conllu => base>=4.9 &&
<5)
rejecting: base-4.10.0.0, 4.9.1.0, 4.9.0.0, 4.8.2.0, 4.8.1.0, 4.8.0.0,
4.7.0.2, 4.7.0.1, 4.7.0.0, 4.6.0.1, 4.6.0.0, 4.5.1.0, 4.5.0.0, 4.4.1.0,
4.4.0.0, 4.3.1.0, 4.3.0.0, 4.2.0.2, 4.2.0.1, 4.2.0.0, 4.1.0.0, 4.0.0.0,
3.0.3.2, 3.0.3.1 (global constraint requires installed instance)
Dependency tree exhaustively searched.```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.