arademaker / hs-conllu Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 3.0 144 KB

CoNLL-U/UD library

License: GNU Lesser General Public License v3.0

Haskell 100.00%

conll-u universal-dependencies haskell

hs-conllu's People

Contributors

Stargazers

Watchers

Forkers

andrewufrank k-bx funarog

hs-conllu's Issues

suggestion: two low level functions for parsing and printing

For users of your parser who do not use it standalone, but integrate it with other code, two simpler functions would be beneficial:

parseConllu :: P.Parser [T.Sentence] -> Text -> ErrOrVal [T.Sentence] 
-- | parse a text (no IO) to sentences
parseConllu parser text = 
    case r of 
            Left err -> Left (s2t $ M.parseErrorPretty err)
            Right ss -> Right   ss
    where 
        r = M.parse parser "" (t2s text)   -- why a textname required ?? 

prettyPrintConlluSentence ::T.Sentence -> Text 
-- | prettyprint a single sentence
prettyPrintConlluSentence  = s2t . Pr.fromDiffList . Pr.printSent

where ErrOrVal is Either Text and s2t is a conversion is pack from Data.Text.

In case you wonder - I use your package to parse output from coreNLP (udfeats) and stick them into a triplestore. Thank you for your effort!

upload to Hackage

get confirmation email
set up travis builds
document code
being done in the haddock branch.

Is there a plan to incorporate the same level of validation performed by the Universal Dependency tool, validate.py, found here https://github.com/UniversalDependencies/tools? Also Is there a plan to do a performance comparison between the official Universal Dependency validation python script and hs-conllu?

The universal dependency organization's validation software provides 5 levels of validation. Because it is written in python, I suspect it would be slower than written in Haskell or C++.

parse with recovery

so that we can accumulate all errors and report them, instead of getting them one at a time.

warn of forbidden values in UPOS or DEPREL

error in dependencies

% cabal install hs-conllu
Resolving dependencies...
cabal: Could not resolve dependencies:
[__0] trying: hs-conllu-0.1.2 (user goal)
[__1] next goal: megaparsec (dependency of hs-conllu)
[__1] rejecting: megaparsec-9.0.1 (conflict: hs-conllu => megaparsec>=6 && <7)
[__1] skipping: megaparsec-9.0.0, megaparsec-8.0.0, megaparsec-7.0.5,
megaparsec-7.0.4, megaparsec-7.0.3, megaparsec-7.0.2, megaparsec-7.0.1,
megaparsec-7.0.0 (has the same characteristics that caused the previous
version to fail: excluded by constraint '>=6 && <7' from 'hs-conllu')
[__1] trying: megaparsec-6.5.0
[__2] next goal: base (dependency of hs-conllu)
[__2] rejecting: base-4.13.0.0/installed-4.13.0.0 (conflict: megaparsec =>
base>=4.7 && <4.13)
[__2] skipping: base-4.14.1.0, base-4.14.0.0, base-4.13.0.0 (has the same
characteristics that caused the previous version to fail: excluded by
constraint '>=4.7 && <4.13' from 'megaparsec')
[__2] rejecting: base-4.12.0.0, base-4.11.1.0, base-4.11.0.0, base-4.10.1.0,
base-4.10.0.0, base-4.9.1.0, base-4.9.0.0, base-4.8.2.0, base-4.8.1.0,
base-4.8.0.0, base-4.7.0.2, base-4.7.0.1, base-4.7.0.0, base-4.6.0.1,
base-4.6.0.0, base-4.5.1.0, base-4.5.0.0, base-4.4.1.0, base-4.4.0.0,
base-4.3.1.0, base-4.3.0.0, base-4.2.0.2, base-4.2.0.1, base-4.2.0.0,
base-4.1.0.0, base-4.0.0.0, base-3.0.3.2, base-3.0.3.1 (constraint from
non-upgradeable package requires installed instance)
[__2] fail (backjumping, conflict set: base, hs-conllu, megaparsec)
After searching the rest of the dependency tree exhaustively, these were the
goals I've had most trouble fulfilling: base, megaparsec, hs-conllu

include indices in string-diff?

can't parse `ccomp` DEPREL

it will parse cc first and then complain about a spurious 'o'

add tests

at least to check a few properties such as the identity on parsing/printing.

maybe use this library? https://github.com/mrkkrp/hspec-megaparsec

printSent produces unnecessary newline

The function`` printSentseems to add a newline at the beginning of a sentence (and one at the end as well) which causes theprint . read` not to be identy. perhaps you remove the line which is commented out:

printSent :: Sentence -> DiffList Char
printSent ss =
  mconcat
    [ printComments (_meta ss)
--    , diffLSpace   -- causes an extra space initially  
    , printTks (_tokens ss)
    , diffLSpace
    ]

how to incorporate the existence of deprel., feat_val., etc?

trigerred by #17 and UniversalDependencies/UD_English-EWT#60.

files here.

ideally, semantic validation would be separate from parsing (syntactic validation), but maybe this would require too much mapping between types?

Plans for the library

I wrote the text below as an open reply to @arademaker for our conversation on #32 about plans for the library.

I'd like to change the structure of the library a bit: first have a really dumb parser that would accept anything remotely matching the conllu format, then do light validation on top of it according to user specification. This would mean not to hardcode deprels and other stuff, but read a file that lists the acceptable entities (these files already exist for the canonical validating script, but the user could tweak them if they wanted to).

I also think that the megaparsec library might be unnecessary since the conllu format is so simple, but its performance is not bad and the error-reporting facilities are great (are we using them as well as we could?), so maybe I'd leave that be. If there's a performance need, then we might think about it.

I don't think it's worth it to implement full conllu validation, for the reasons I said on #34.

At some point I had plans for a query interface like the one in http://match.grew.fr (see master...query), but honestly I don't think it's worth implementing it since just loading the data on a graph database would give better-performing queries and facilities for visualization for free :)

Finally, I started writing this library a long time ago when I first started learning Haskell, so I would also change the code quite a bit to reflect some of what I learned since then.

missing UD : dobj

I used the parser on a small example and found that dobj is missing in the Dep enumerated type.
It might be better to move the tagset into separate modules to be imported qualified (to avoid construction like AUXpos and similar). and to publish the tagset independenly in a separate package for all to use and to improve (I have constructed one which I hope is more complete adn may submit a PR).

Maybe.fromJust: Nothing

The error is produced by the code https://github.com/cpdoc/dhbb-nlp/blob/master/bin/CountLexicon.hs, I can't say if the problem is in the script or the library.

$ cat out.hs.orgs.error
countlex: Maybe.fromJust: Nothing
$ cat out.hs.people.error
countlex: Maybe.fromJust: Nothing

handle enhanced DEPS correctly

http://universaldependencies.org/u/overview/enhanced-syntax.html

including ref relation.

error in the cabal installation

Resolving dependencies...
cabal: Could not resolve dependencies:
trying: Hs-conllu-0.0.1 (user goal)
next goal: base (dependency of Hs-conllu-0.0.1)
rejecting: base-4.8.1.0/installed-075... (conflict: Hs-conllu => base>=4.9 &&
<5)
rejecting: base-4.10.0.0, 4.9.1.0, 4.9.0.0, 4.8.2.0, 4.8.1.0, 4.8.0.0,
4.7.0.2, 4.7.0.1, 4.7.0.0, 4.6.0.1, 4.6.0.0, 4.5.1.0, 4.5.0.0, 4.4.1.0,
4.4.0.0, 4.3.1.0, 4.3.0.0, 4.2.0.2, 4.2.0.1, 4.2.0.0, 4.1.0.0, 4.0.0.0,
3.0.3.2, 3.0.3.1 (global constraint requires installed instance)
Dependency tree exhaustively searched.```