Code Monkey home page Code Monkey logo

conllx-rs's Introduction

Hi! 👋 I am Daniël.

  • 🔭 I’m currently working on:
    • Machine learning models for natural language processing.
    • Contributing to various parts of nixpkgs.
  • 🔨 I currently use: [CP]ython 🐍, Rust 🦀, macOS, Fedora, NixOS ❄, and Torch 🔥.
  • 🌱 I’m currently working at Hugging Face 🤗.
  • 😄 Pronouns: he/him/his
  • 📫 How to reach me: Send me an e-mail!

conllx-rs's People

Contributors

danieldk avatar sebpuetz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

sebpuetz twuebi

conllx-rs's Issues

Handling colons in features

If there are multiple colons in a feature string, the value part behind the second colon is discarded:

let feature_map = Features::from("key:val").as_map();
feature_map.get("key").unwrap() == "val";
let feature_map = Features::from("key:val:something_else").as_map();
feature_map.get("key").unwrap() == "val";

In both cases, the same result gets returned, coming from lines 255-261 in conllx::token:

        for fv in self.features.split('|') {
            let mut iter = fv.split(':');
            if let Some(k) = iter.next() {
                let v = iter.next().map(|s| s.to_owned());
                features.insert(k.to_owned(), v.to_owned());
            }
        }

A simple work around could be using find(":") to locate the index of the first colon in the key-value pair and use that to split the string.

Something like (untested):

            if let Some(idx) = fv.find(':') {
                let k = fv[..idx].to_owned();
                let v = fv[idx+1..].to_owned();
                features.insert(k.to_owned(), v.to_owned());
            } else {
                features.insert(fv.to_owned(), None)
            }

Graph representations

cc: @sebpuetz, @DiveFish

Since many downstream tasks need a graph representation of the CoNLL-X data, it would great if this crate could provide that. However, there are many different approaches, and we should hash them out a bit more.

Relevant to the discussion of different approaches is how the graph representation relates to Vec<Token>.

  1. A conversion of Vec<Token> to the graph representation is provided.

Advantages:

  • No API changes.
  • Simple representation that works well tasks where syntax is irrelevant (tagging, etc.).

Disadvantages:

  • Requires conversion.
  • Multiple representations in the same crate.
  • The token indices and head()/p_head() indices mismatch.
  • Needs to own tokens for the graph to be usefully mutable.
  • Probably needs its own Token type, otherwise head()/p_head() and the graph could become inconsistent.
  1. The graph representation becomes the representation of CoNLL-X sentences.

Advantages:

  • A single representation, no conversions.
  • The standard representation is the representation that many downstream tasks use.

Disadvantages:

  • large API change.
  • could be awkward for token-oriented tasks if no direct indexing of tokens is provided.
  • will (depending on the implementation) be less efficient to construct.
  1. Let the user decide the representation and provide read_to_graph() and read_to_vec() on the ReadSentence trait. (@sebpuetz )

Benefits:

  • not breaking the current API
  • no required conversion but could still be provided through From and Into

Disadvantages:

  • multiple Token types would be required, some of this could probably be alleviated through traits/type parameters
  • could be confusing to have two methods/representations in parallel

Graph representations

Cheap hack: prepend root to Vec<Token>

A very simple approach could just prepend its sentence with the artificial root token. head()/p_head() could then be used to directly index into the graph.

Advantages:

  • Simple change
  • As fast to construct as the current representation.

Disadvantages:

  • Breaks APIs in an ugly way, because there is existing code that assumes that the indices are off-by-one. If we go this path, the Vec should probably be wrapped in another type to force downstream to update code. Also, Token should probably become an enum that provides a variant for the root.
  • Cannot benefit from existing functions from a crate such as petgraph.

petgraph

This approach would use the petgraph crate and use a type such as Digraph<Token, DepRel>. Token would then not contain dependency relations anymore. DepRel would be an enum for projective and non-projective relations.

Benefits:

  • Probably not much slower than the current representation --- node and edge insertions are O(1).
  • The user gets to leverage all the functionality and algorithms of petgraph.

Downsides:

  • Large API change.
  • How do we compare two sentences for equality? This is now used a lot in e.g. unit tests. Graph isomorpism is NP. And I think there is no way to force in the type system that the graph is a e.g. a tree or has ordered vertices with in-degree 1 in petgraph.
  • The user could turn the dependency tree into a graph that is not a tree. How do we serialize such cases to CoNLL-X again?

Constrained graph

In this approach, we would create our own representation. It would be similar to approach 1, except that we would separate the dependencies from the tokens. So the structure would be something like:

struct DependencyGraph {
  tokens: Vec<Node>,
  deprels: Vec<Option<(usize, String)>>,
  p_deprels: Vec<Option<(usize, String)>>,
}

enum Node {
  Root,
  Token {
    // ... the usual fields, but no head/head_rel/p_head/p_head_rel.
  },
}

Benefits:

  • Nearly the same efficiency as the current representation.
  • Tokens are separated from dependency relations.
  • Not as confusing as approach 1 in having nearly the same representation.

Disadvantages:

  • Cannot benefit from existing functions from a crate such as petgraph. Though I guess that we could try to implement some of the petgraph traits.
  • Somewhat large API change.

Equality behaviour on Features

The PartialEq impl on Features compares the feature String rather than the key-value pairs which leads to some odd behaviour.

conllx-rs/src/token.rs

Lines 290 to 294 in 72ccc8e

impl PartialEq for Features {
fn eq(&self, other: &Features) -> bool {
self.features.eq(&other.features)
}
}

I think that the following examples should be seen as equal:

let token = TokenBuilder::new("a")
                .features(Features::from_string("a|b:c"))
                .into();
let other = TokenBuilder::new("a")
                .features(Features::from_string("b:c|a"))
                .into();
assert_ne!(token, other);

Improve lifetime of string reference returned by DepTriple::relation

The lifetime of the string reference returned by DepTriple::relation is bound to DepTriple. This is ok when DepTriple owns a String. However, if the string is borrowed, a more appropriate lifetime would be the actual lifetime of the string reference.

This makes the life easier for users of DepTriple, because now something like depgraph.head(idx).unwrap().relation() will complain that the relation outlives the triple.

Implement FromIterator on Sentence

While writing a conversion method to Sentence I thought it'd be nice to just call

let sentence = tokens.into_iter().collect::<Sentence>();

rather than writing

let mut sentence = Sentence::new();
for token in tokens {
    sentence.push(token);
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.