cc: @sebpuetz, @DiveFish
Since many downstream tasks need a graph representation of the CoNLL-X data, it would great if this crate could provide that. However, there are many different approaches, and we should hash them out a bit more.
Relevant to the discussion of different approaches is how the graph representation relates to Vec<Token>
.
- A conversion of
Vec<Token>
to the graph representation is provided.
Advantages:
- No API changes.
- Simple representation that works well tasks where syntax is irrelevant (tagging, etc.).
Disadvantages:
- Requires conversion.
- Multiple representations in the same crate.
- The token indices and
head()
/p_head()
indices mismatch.
- Needs to own tokens for the graph to be usefully mutable.
- Probably needs its own
Token
type, otherwise head()
/p_head()
and the graph could become inconsistent.
- The graph representation becomes the representation of CoNLL-X sentences.
Advantages:
- A single representation, no conversions.
- The standard representation is the representation that many downstream tasks use.
Disadvantages:
- large API change.
- could be awkward for token-oriented tasks if no direct indexing of tokens is provided.
- will (depending on the implementation) be less efficient to construct.
- Let the user decide the representation and provide
read_to_graph()
and read_to_vec()
on the ReadSentence trait. (@sebpuetz )
Benefits:
- not breaking the current API
- no required conversion but could still be provided through From and Into
Disadvantages:
- multiple Token types would be required, some of this could probably be alleviated through traits/type parameters
- could be confusing to have two methods/representations in parallel
Graph representations
Cheap hack: prepend root to Vec<Token>
A very simple approach could just prepend its sentence with the artificial root token. head()
/p_head()
could then be used to directly index into the graph.
Advantages:
- Simple change
- As fast to construct as the current representation.
Disadvantages:
- Breaks APIs in an ugly way, because there is existing code that assumes that the indices are off-by-one. If we go this path, the
Vec
should probably be wrapped in another type to force downstream to update code. Also, Token
should probably become an enum
that provides a variant for the root.
- Cannot benefit from existing functions from a crate such as
petgraph
.
petgraph
This approach would use the petgraph
crate and use a type such as Digraph<Token, DepRel>
. Token
would then not contain dependency relations anymore. DepRel
would be an enum for projective and non-projective relations.
Benefits:
- Probably not much slower than the current representation --- node and edge insertions are O(1).
- The user gets to leverage all the functionality and algorithms of
petgraph
.
Downsides:
- Large API change.
- How do we compare two sentences for equality? This is now used a lot in e.g. unit tests. Graph isomorpism is NP. And I think there is no way to force in the type system that the graph is a e.g. a tree or has ordered vertices with in-degree 1 in
petgraph
.
- The user could turn the dependency tree into a graph that is not a tree. How do we serialize such cases to CoNLL-X again?
Constrained graph
In this approach, we would create our own representation. It would be similar to approach 1, except that we would separate the dependencies from the tokens. So the structure would be something like:
struct DependencyGraph {
tokens: Vec<Node>,
deprels: Vec<Option<(usize, String)>>,
p_deprels: Vec<Option<(usize, String)>>,
}
enum Node {
Root,
Token {
// ... the usual fields, but no head/head_rel/p_head/p_head_rel.
},
}
Benefits:
- Nearly the same efficiency as the current representation.
- Tokens are separated from dependency relations.
- Not as confusing as approach 1 in having nearly the same representation.
Disadvantages:
- Cannot benefit from existing functions from a crate such as
petgraph
. Though I guess that we could try to implement some of the petgraph
traits.
- Somewhat large API change.