Comments (7)
Will fix (is not totally trivial, might take a moment): I suggest all newlines are OK (CRLF,CR,LF). I don't think we can force the unix LF only, because that would not open in Windows/Mac editors properly. @jnivre, @dan-zeman : Do you have some experience regarding the newline character? Has this caused issues previously in shared tasks, evaluation, etc?
from docs.
Yes and no. I think the standard has (implicitly) been unix LF, and the occasional problems have been caused by files imported from other systems. I am not sure whether it is best to have a “forgiving” standard, which might lead to more problems when people apply tools developed under a particular system, or whether we should try to enforce one standard and make it more complicated for people to use other platforms. Perhaps we should allow all newlines but give a warning for cases other than LF saying something like: “Your file has newlines of type X, which may cause problems when …”. I don’t know …
Joakim
On 12 Sep 2014, at 10:03, Filip Ginter <[email protected]mailto:[email protected]> wrote:
Will fix (is not totally trivial, might take a moment): I suggest all newlines are OK (CRLF,CR,LF). I don't think we can force the unix LF only, because that would not open in Windows/Mac editors properly. @jnivrehttps://github.com/jnivre, @dan-zemanhttps://github.com/dan-zeman : Do you have some experience regarding the newline character? Has this caused issues previously in shared tasks, evaluation, etc?
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-55372787.
from docs.
Not sure about shared tasks, I think that the data there were always LF-only but I don't know what the evaluation scripts would do if a participating system submitted files with CRLF.
Otherwise, I use a mixed environment (run most computation on Linux machines but do most of editing on a Windows laptop) and I try to stick with LF-only everywhere. I switched to Windows editors that can handle LF-only files, and I configured them to use LF as default. Obviously, other Windows-based people may be more sensitive to Linux line breaks.
I would suggest that the tools should not crash on CRLF, and the validators should optionally approve CRLF files (default option would be LF-only). And mixed files (one line LF, the other CRLF) should not be allowed.
from docs.
+1 for LF-only at least for the format specification. Not crashing on CRLF would be a bonus :-)
from docs.
re: "forgiving" vs. "strict" standards, I'd like to suggest to aim to be conservative in what you send ... (http://en.wikipedia.org/wiki/Robustness_principle) in our role as format designers.
Permitting CRLF but strongly encouraging LF-only (for example) invites a category of subtle bugs in tools written (by others) to consume the CoNLL-U format which only surface for the rare (but valid) datasets that make use of the CRLF allowance.
from docs.
I had the robustness principle in mind when I suggested that we make LF the default (send LF only) but do not disallow CRLF (accept it).
But I agree with your point about the subtle bugs. I personally have no problem with postulating that the CR character is banned from CoNLL-U files.
from docs.
The validator now survives CRLF (no crash, checks the file as usual for other errors). A validation error is given if there is any other line termination character but LF in the file. I'll close this now, but will reopen if it dies on a Windows machine. I have zero experience with those.
from docs.
Related Issues (20)
- Misidentified Lemmas in Spanish HOT 1
- clausal appos HOT 36
- Flat:foreign with Typo=Yes HOT 3
- acl vs xcomp vs advcl HOT 1
- Deprel of list item enumerators HOT 11
- English nominal subtypes: merge :npmod and :tmod as :unmarked HOT 18
- Create treebank without syntactic dependencies HOT 2
- How to document script used for the data in treebank? HOT 7
- Some Broken or missing treebank links HOT 4
- NPs in head-marking languages HOT 19
- Standardizing ExtPos (at least for fixed expressions) HOT 36
- New enhanced dependencies - Propagation of nsubj for ccomp and advcl in pro-drop languages HOT 3
- Annotation of Classifiers in the Egyptian-UJaen Treebank HOT 33
- English mischievous nominals involving names and numbers HOT 7
- Repository for new treebank HOT 1
- Transitive vs intransitive verb features? HOT 1
- `as X as` expressions as `fixed` with ExtPos - what qualifies? HOT 8
- Ellipsis in UD HOT 2
- docs site has stopped building HOT 2
- How to differentiate DET for quantifiers and DET for demonstrative determiners for isolating languages like Thai HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docs.