Comments (9)
Added this to the test_dragnn.sh and it works:
reload(sys)
sys.setdefaultencoding('utf8')
from syntaxnet.
Thank you, i think i'll just use second Russian treebank, which is much bigger and looks like with proper tags.
from syntaxnet.
hi~
i tried to run the baseline model described (https://github.com/tensorflow/models/tree/master/syntaxnet/g3doc/conll2017)
but there is a problem related 'utf8, std:out_or_range' in inference steps.
...
2017-04-01 09:57:58.442684: I syntaxnet/embedding_feature_extractor.cc:35] Features: input.focus;input.focus stack.focus stack(1).focus;stack.focus stack(1).focus
2017-04-01 09:57:58.442689: I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: lookahead;tagger;rnn-stack
2017-04-01 09:57:58.442692: I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;64;64
2017-04-01 09:57:58.442810: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
2017-04-01 09:57:58.442830: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string
INFO:tensorflow:Read 0 documents
...
since i haven't found the way to fix it,
i decided to skip by dropping 'char2word' layer when building 'master_spec'.
after that, all works fine.
https://github.com/dsindex/syntaxnet#dragnn
if you are interested in training the Russian corpus and test,
-
download Russian UD corpus from http://universaldependencies.org
-
compile
$ pwd
/path/to/models/syntaxnet
$ bazel build -c opt //work/dragnn_examples:write_master_spec
$ bazel build -c opt //work/dragnn_examples:train_dragnn
$ bazel build -c opt //work/dragnn_examples:inference_dragnn
- train
- say, UD_Russian directory in the path
$ pwd
/path/to/work/UD_Russian
- edit train_dragnn.sh
SRC_CORPUS_DIR=${CDIR}/UD_Russian
TRAIN_FILE=${DATA_DIR}/ru-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/ru-ud-dev.conllu.conv
- run
$ nohup ./train_dragnn.sh -v -v &
- test
- run
$ cat textfile | ./test_dragnn.sh -v -v
note that again
loading downloaded model for annotation is not yet available now in here.
but i think https://github.com/tensorflow/models/tree/master/syntaxnet/dragnn/tools
this original code may work well(i didn't test)
from syntaxnet.
Thank you very much for such detailed response! I will reply shortly in case of issues, great stuff.
from syntaxnet.
Got this error at inference stage (with Russian dataset trained on): UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128).
from syntaxnet.
@dsindex Maybe you could also help how to convert output to Brat standoff ann format to output in Brat?
I commented ${CONLL2TREE} --alsologtostderr from test_dragnn for this,
but then i need to convert CoNNL-U format to standoff, i'm trying with this repo: https://github.com/spyysalo/conllu.py
but getting multiple parse issues. Could you advice something?
from syntaxnet.
that is cool repo!
i am not sure about getting multiple parse issues
you mentioned.
but conllu.py
looks like taking file-based processing with two pass.
one is for text, other is for annotation. it is tricky..... ;;
i think we'd better to save conllu files(from test_dragnn.sh) and use conll.py.
$ cat file.txt | ./test_dragnn.sh > file.conllu
$ python conll.py/convert.py -o outdir file.conllu
if we want to run from on-line manner,
we have to modify conllu.py/convert.py, conll.py/conllu/conllu.py
it seems time-consuming.
by the way, i have a question about the brat tool.
nlplab/brat#1221
as this issue which i reported, i can't annotate relations.
because there is no dialog action.
do you know how to fix it?
from syntaxnet.
I use brat as compare only tool, if i will figure out - i'll let you know.
@dsindex same code as you wrote, i'm getting
conllu.conllu.FormatError: invalid CPOSTAG: PRP$ (line 4)
on file with Russian sentences.
from syntaxnet.
@alexfridlyand thank you :)
hmm.... in UD_English and Korean corpus, there is no error.
i guess cpostag
is not right format
CPOSTAG_RE = re.compile(r'^[a-zA-Z]+$')
...
# some character set constraints
if not CPOSTAG_RE.match(self.cpostag):
raise FormatError('invalid CPOSTAG: %s' % self.cpostag)
here, self.cpostag was generated by from_string
method
def from_string(cls, s):
fields = s.split('\t')
if len(fields) != 10:
raise FormatError('got %d/10 field(s)' % len(fields), s)
fields[5] = [] if fields[5] == '_' else fields[5].split('|') # feats
fields[8] = [] if fields[8] == '_' else fields[8].split('|') # deps
return cls(*fields)
since i don't know exactly why such character in there,
do some filtering for fields
list is the way i'd like to take ;;
hope it helps.
from syntaxnet.
Related Issues (20)
- Training text segmentation and morphological analysis
- Errors when running the server script HOT 9
- UD_Italian v.2.0 training OK but test KO HOT 2
- Serving different language model #2 - Export HOT 10
- how to find sejong_treebank.txt.v1 ?? HOT 10
- How to use conll2017 baseline ? HOT 4
- Launch server with different model HOT 4
- How to train Chinese corpus after downloading the universal-dependencies-2.0 ? HOT 8
- DRAGNN - Tensorflow Serving HOT 5
- GPU device not visible
- Question about "installing syntaxnet" HOT 3
- Why does it continue its training? HOT 1
- Question: train dragnn using dragnn example?
- Question: missing segmenter in dragnn model
- How to generate .pb file for android HOT 2
- Where is the context.pbtxt in UD_language? HOT 13
- cannot import name graph_builder HOT 1
- ./sejong/c2d.sh error HOT 5
- How to retrain existing Syntaxnet model? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from syntaxnet.