Code Monkey home page Code Monkey logo

Comments (5)

goodmami avatar goodmami commented on August 30, 2024

Are you talking about warning during validation or during normal use of Wn? To me, the former seems acceptable but not the latter, as this is just bad formatting and not something that makes the data less correct or usable.

In Wn, I try to store in the database an accurate representation of what was in the WN-LMF file, such that exporting the data would result in an equivalent WN-LMF file, so I don't think stripping the definitions is a good solution. However, it would be fine with me if OMW wanted to fix these things during the compilation of its wordnets.

from wn.

fcbond avatar fcbond commented on August 30, 2024

from wn.

goodmami avatar goodmami commented on August 30, 2024

I was indeed thinking of warning during validation.

Ok, good. It wasn't clear, so I changed the title. We could also check for similar whitespace issues in other elements like <ILIDefinition>, <Count>, <Tag>, and <Pronunciation>, or in attribute values, like for writtenForm or subcategorizationFrame.

I think returning \n\t\t\n\t\t\n for the definition of a synset, rather
than None, is less correct and does make it less usable. However, as you
say, the ideal time to catch this is when the wordnet is made, not when we
load.

Right. It's less correct for the language, but it's an accurate representation of what's in the data. I don't think Wn should be deciding what it thinks a language should look like. The data should do that.

from wn.

francis-dion avatar francis-dion commented on August 30, 2024

My understanding is that, in XML, white space after the opening tag and before the closing tag should be ignored.
I didn't trace the original specs, but found multiple references including this one from adobe:
XML ignores the first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag. XML translates non-space characters (tab and new-line) into a space character and consolidates all multiple space characters into a single space

If the author of a wordnet wants/needs white space preserved, they should use the xml:space attribute. Here's a quote from O'Reilly's xml pocket reference:
When xml:space is used on an element with a value of preserve , the whitespace in that element's content must be preserved as is by the application that processes it. The whitespace is always passed on to the processing application, but xml:space provides the application with a hint regarding how to process it.

Otherwise, I believe leading/trailing white space should definitively be stripped. I also think (albeit less strongly :-) that wn should be translating non-space characters (tab and new-line) into a space character and consolidate all multiple space characters into a single space.

from wn.

goodmami avatar goodmami commented on August 30, 2024

Thanks, @francis-dion, that's a good point. I'd forgotten about xml:space. The W3 spec says this about the default value:

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space.

So when xml:space is not specified, it's not that the spacing should be stripped, but that the application should use its default whitespace processsing. So, yes, Wn could strip (and normalize) whitespace if xml:space is not present. One issue is if a wordnet author wishes to preserve whitespace. Obviously the answer is to use xml:space on the element, but the WN-LMF spec needs to declare the attribute for it to be used. From the same W3 spec:

In valid documents, this attribute, like any other, MUST be declared if it is used.

from wn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.