Code Monkey home page Code Monkey logo

nlcst's Introduction

nlcst

Natural Language Concrete Syntax Tree format.


nlcst is a specification for representing natural language in a syntax tree. It implements the unist spec.

This document may not be released. See releases for released documents. The latest released version is 1.0.2.

Contents

Introduction

This document defines a format for representing natural language as a concrete syntax tree. Development of nlcst started in May 2014, in the now deprecated textom project for retext, before unist existed. This specification is written in a Web IDL-like grammar.

Where this specification fits

nlcst extends unist, a format for syntax trees, to benefit from its ecosystem of utilities.

nlcst relates to JavaScript in that it has an ecosystem of utilities for working with compliant syntax trees in JavaScript. However, nlcst is not limited to JavaScript and can be used in other programming languages.

nlcst relates to the unified and retext projects in that nlcst syntax trees are used throughout their ecosystems.

Types

If you are using TypeScript, you can use the nlcst types by installing them with npm:

npm install @types/nlcst

Nodes (abstract)

Literal

interface Literal <: UnistLiteral {
  value: string
}

Literal (UnistLiteral) represents a node in nlcst containing a value.

Its value field is a string.

Parent

interface Parent <: UnistParent {
  children: [Paragraph | Punctuation | Sentence | Source | Symbol | Text | WhiteSpace | Word]
}

Parent (UnistParent) represents a node in nlcst containing other nodes (said to be children).

Its content is limited to only other nlcst content.

Nodes

Paragraph

interface Paragraph <: Parent {
  type: 'ParagraphNode'
  children: [Sentence | Source | WhiteSpace]
}

Paragraph (Parent) represents a unit of discourse dealing with a particular point or idea.

Paragraph can be used in a root node. It can contain sentence, whitespace, and source nodes.

Punctuation

interface Punctuation <: Literal {
  type: 'PunctuationNode'
}

Punctuation (Literal) represents typographical devices which aid understanding and correct reading of other grammatical units.

Punctuation can be used in sentence or word nodes.

Root

interface Root <: Parent {
  type: 'RootNode'
}

Root (Parent) represents a document.

Root can be used as the root of a tree, never as a child. Its content model is not limited, it can contain any nlcst content, with the restriction that all content must be of the same category.

Sentence

interface Sentence <: Parent {
  type: 'SentenceNode'
  children: [Punctuation | Source | Symbol | WhiteSpace | Word]
}

Sentence (Parent) represents grouping of grammatically linked words, that in principle tells a complete thought, although it may make little sense taken in isolation out of context.

Sentence can be used in a paragraph node. It can contain word, symbol, punctuation, whitespace, and source nodes.

Source

interface Source <: Literal {
  type: 'SourceNode'
}

Source (Literal) represents an external (ungrammatical) value embedded into a grammatical unit: a hyperlink, code, and such.

Source can be used in root, paragraph, sentence, or word nodes.

Symbol

interface Symbol <: Literal {
  type: 'SymbolNode'
}

Symbol (Literal) represents typographical devices different from characters which represent sounds (like letters and numerals), white space, or punctuation.

Symbol can be used in sentence or word nodes.

Text

interface Text <: Literal {
  type: 'TextNode'
}

Text (Literal) represents actual content in nlcst documents: one or more characters.

Text can be used in word nodes.

WhiteSpace

interface WhiteSpace <: Literal {
  type: 'WhiteSpaceNode'
}

WhiteSpace (Literal) represents typographical devices devoid of content, separating other units.

WhiteSpace can be used in root, paragraph, or sentence nodes.

Word

interface Word <: Parent {
  type: 'WordNode'
  children: [Punctuation | Source | Symbol | Text]
}

Word (Parent) represents the smallest element that may be uttered in isolation with semantic or pragmatic content.

Word can be used in a sentence node. It can contain text, symbol, punctuation, and source nodes.

Glossary

See the unist glossary.

List of utilities

See the unist list of utilities for more utilities.

Related

  • mdast — Markdown Abstract Syntax Tree format
  • hast — Hypertext Abstract Syntax Tree format
  • xast — Extensible Abstract Syntax Tree

References

Contribute

See contributing.md in syntax-tree/.github for ways to get started. See support.md for ways to get help. Ideas for new utilities and tools can be posted in syntax-tree/ideas.

A curated list of awesome syntax-tree, unist, mdast, hast, xast, and nlcst resources can be found in awesome syntax-tree.

This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.

Acknowledgments

The initial release of this project was authored by @wooorm.

Thanks to @nwtn, @tmcw, @muraken720, and @dozoisch for contributing to nlcst and related projects!

License

CC-BY-4.0 © Titus Wormer

nlcst's People

Contributors

christianmurphy avatar tbroadley avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlcst's Issues

Rename node types

Subject of the feature

Rest of unified uses lowercase, suffixless values. RootNode -> root. Would make sense to align them.

Actual behaviour

ParagraphNode

Expected behaviour

paragraph

Alternatives

Keep it as it is, it’s been a while, changing it would lead to a bit of confusion for a while too.

Should add `Symbol`

Interesting conundrum. Punctuation marks are, in fact, symbols. However, not every symbol is a punctuation mark.

Implementation:

  • Add a SymbolNode, much like the current PunctuationNode;
  • PunctuationNode should sub-class SymbolNode;
  • WhiteSpaceNode should, I’d argue, keep on sub-classing PunctuationNode.

`WordNode`s sometimes contain ending punctuation.

I'm having a hard time tracking down exactly where this bug is, but I'm having trouble when running documents through retext-spell, because the WordNodes being passed in sometimes contain the ending punctuation which makes the word get flagged for being misspelled. For example, I've got the following markdown;

This tool uses [retext](https://github.com/wooorm/retext) to check the quality of writing in your project's documentation using these plugins;

* [remark-lint](https://github.com/wooorm/remark-lint) checks for proper markdown formatting.
* [retext-readability](https://github.com/wooorm/retext-readability) checks the reading level of the whole document.
* [retext-simplify](https://github.com/wooorm/retext-simplify) warns on over-complicated phrases.
* [retext-equality](https://github.com/wooorm/retext-equality) warns on insensitive, inconsiderate language.
* [retext-intensify](https://github.com/wooorm/retext-intensify) warns on filler, weasel and hedge words.

I'm passing it through retext().use(spell) here and when the tree is passed through retext-spell, I'm getting WordNodes that look like this;

{ type: 'WordNode',
  children:
   [ { type: 'TextNode', value: 'language', position: [Object] },
     { type: 'PunctuationNode', value: '.', position: [Object] } ],
  position:
   { start: { line: 81, column: 100, offset: 2580 },
     end: { line: 81, column: 109, offset: 2589 } } }

Finally, doing toString(node) for these nodes returns the following strings which fail the spell check; formatting., document., phrases., language.. For some reason the last node doesn't include the ending punctuation.

Any idea what might be going on here and which module a fix should be applied in? I know I could do something like this to work around the problem by modifying nodes with ending punctuation in retext-spell;

// If the last WordNode child is a PunctuationNode, remove it
if (node.children[node.children.length-1].type == 'PunctuationNode') {
    node.children.pop();
}

But that doesn't seem like the correct place to correct the problem.

Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.