danieldk / conllx-rs Goto Github PK

View Code? Open in Web Editor NEW

7.0 2.0 2.0 126 KB

CoNLL-X reader and writers for Rust

License: Apache License 2.0

Rust 99.65% Shell 0.35%

rust conll conllx treebank corpus

conllx-rs's Introduction

Hi! 👋 I am Daniël.

🔭 I’m currently working on:
- Machine learning models for natural language processing.
- Contributing to various parts of nixpkgs.
🔨 I currently use: [CP]ython 🐍, Rust 🦀, macOS, Fedora, NixOS ❄, and Torch 🔥.
🌱 I’m currently working at Hugging Face 🤗.
😄 Pronouns: he/him/his
📫 How to reach me: Send me an e-mail!

conllx-rs's People

Contributors

Stargazers

Watchers

Forkers

sebpuetz twuebi

conllx-rs's Issues

Add a method to `DepGraphMut` to remove an edge

Handling colons in features

If there are multiple colons in a feature string, the value part behind the second colon is discarded:

let feature_map = Features::from("key:val").as_map();
feature_map.get("key").unwrap() == "val";

let feature_map = Features::from("key:val:something_else").as_map();
feature_map.get("key").unwrap() == "val";

In both cases, the same result gets returned, coming from lines 255-261 in conllx::token:

        for fv in self.features.split('|') {
            let mut iter = fv.split(':');
            if let Some(k) = iter.next() {
                let v = iter.next().map(|s| s.to_owned());
                features.insert(k.to_owned(), v.to_owned());
            }
        }

A simple work around could be using find(":") to locate the index of the first colon in the key-value pair and use that to split the string.

Something like (untested):

            if let Some(idx) = fv.find(':') {
                let k = fv[..idx].to_owned();
                let v = fv[idx+1..].to_owned();
                features.insert(k.to_owned(), v.to_owned());
            } else {
                features.insert(fv.to_owned(), None)
            }

Since many downstream tasks need a graph representation of the CoNLL-X data, it would great if this crate could provide that. However, there are many different approaches, and we should hash them out a bit more.

Relevant to the discussion of different approaches is how the graph representation relates to Vec<Token>.

A conversion of Vec<Token> to the graph representation is provided.

Advantages:

No API changes.
Simple representation that works well tasks where syntax is irrelevant (tagging, etc.).

Disadvantages:

Requires conversion.
Multiple representations in the same crate.
The token indices and head()/p_head() indices mismatch.
Needs to own tokens for the graph to be usefully mutable.
Probably needs its own Token type, otherwise head()/p_head() and the graph could become inconsistent.

The graph representation becomes the representation of CoNLL-X sentences.

Advantages:

A single representation, no conversions.
The standard representation is the representation that many downstream tasks use.

Disadvantages:

large API change.
could be awkward for token-oriented tasks if no direct indexing of tokens is provided.
will (depending on the implementation) be less efficient to construct.

Let the user decide the representation and provide read_to_graph() and read_to_vec() on the ReadSentence trait. (@sebpuetz )

Benefits:

not breaking the current API
no required conversion but could still be provided through From and Into

Disadvantages:

multiple Token types would be required, some of this could probably be alleviated through traits/type parameters
could be confusing to have two methods/representations in parallel

Graph representations

Cheap hack: prepend root to `Vec<Token>`

A very simple approach could just prepend its sentence with the artificial root token. head()/p_head() could then be used to directly index into the graph.

Advantages:

Simple change
As fast to construct as the current representation.

Disadvantages:

Breaks APIs in an ugly way, because there is existing code that assumes that the indices are off-by-one. If we go this path, the Vec should probably be wrapped in another type to force downstream to update code. Also, Token should probably become an enum that provides a variant for the root.
Cannot benefit from existing functions from a crate such as petgraph.

`petgraph`

This approach would use the petgraph crate and use a type such as Digraph<Token, DepRel>. Token would then not contain dependency relations anymore. DepRel would be an enum for projective and non-projective relations.

Benefits:

Probably not much slower than the current representation --- node and edge insertions are O(1).
The user gets to leverage all the functionality and algorithms of petgraph.

Downsides:

Large API change.
How do we compare two sentences for equality? This is now used a lot in e.g. unit tests. Graph isomorpism is NP. And I think there is no way to force in the type system that the graph is a e.g. a tree or has ordered vertices with in-degree 1 in petgraph.
The user could turn the dependency tree into a graph that is not a tree. How do we serialize such cases to CoNLL-X again?

Constrained graph

In this approach, we would create our own representation. It would be similar to approach 1, except that we would separate the dependencies from the tokens. So the structure would be something like:

struct DependencyGraph {
  tokens: Vec<Node>,
  deprels: Vec<Option<(usize, String)>>,
  p_deprels: Vec<Option<(usize, String)>>,
}

enum Node {
  Root,
  Token {
    // ... the usual fields, but no head/head_rel/p_head/p_head_rel.
  },
}

Benefits:

Nearly the same efficiency as the current representation.
Tokens are separated from dependency relations.
Not as confusing as approach 1 in having nearly the same representation.

Disadvantages:

Cannot benefit from existing functions from a crate such as petgraph. Though I guess that we could try to implement some of the petgraph traits.
Somewhat large API change.

Equality behaviour on Features

The PartialEq impl on Features compares the feature String rather than the key-value pairs which leads to some odd behaviour.

conllx-rs/src/token.rs

Lines 290 to 294 in 72ccc8e

    
           impl PartialEq for Features { 
        
               fn eq(&self, other: &Features) -> bool { 
        
                   self.features.eq(&other.features) 
        
               } 
        
           }

I think that the following examples should be seen as equal:

let token = TokenBuilder::new("a")
                .features(Features::from_string("a|b:c"))
                .into();
let other = TokenBuilder::new("a")
                .features(Features::from_string("b:c|a"))
                .into();
assert_ne!(token, other);

Improve lifetime of string reference returned by DepTriple::relation

The lifetime of the string reference returned by DepTriple::relation is bound to DepTriple. This is ok when DepTriple owns a String. However, if the string is borrowed, a more appropriate lifetime would be the actual lifetime of the string reference.

This makes the life easier for users of DepTriple, because now something like depgraph.head(idx).unwrap().relation() will complain that the relation outlives the triple.

Implement FromIterator on Sentence

While writing a conversion method to Sentence I thought it'd be nice to just call

let sentence = tokens.into_iter().collect::<Sentence>();

rather than writing

let mut sentence = Sentence::new();
for token in tokens {
    sentence.push(token);
}

Add a method to get a `DiGraph`

This is now possible using the From instance, but it is not always ergonomic.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	impl PartialEq for Features {
	fn eq(&self, other: &Features) -> bool {
	self.features.eq(&other.features)
	}
	}

danieldk / conllx-rs Goto Github PK

conllx-rs's Introduction

Hi! 👋 I am Daniël.

conllx-rs's People

Contributors

Stargazers

Watchers

Forkers

conllx-rs's Issues

Graph representations

Cheap hack: prepend root to Vec<Token>

petgraph

Constrained graph

Recommend Projects

Recommend Topics

Recommend Org

Cheap hack: prepend root to `Vec<Token>`

`petgraph`