Code Monkey home page Code Monkey logo

Comments (9)

alexfridlyand avatar alexfridlyand commented on June 3, 2024 1

Added this to the test_dragnn.sh and it works:

reload(sys)
sys.setdefaultencoding('utf8')

from syntaxnet.

alexfridlyand avatar alexfridlyand commented on June 3, 2024 1

Thank you, i think i'll just use second Russian treebank, which is much bigger and looks like with proper tags.

from syntaxnet.

dsindex avatar dsindex commented on June 3, 2024

@alexfridlyand

hi~

i tried to run the baseline model described (https://github.com/tensorflow/models/tree/master/syntaxnet/g3doc/conll2017)

but there is a problem related 'utf8, std:out_or_range' in inference steps.

...
2017-04-01 09:57:58.442684: I syntaxnet/embedding_feature_extractor.cc:35] Features: input.focus;input.focus stack.focus stack(1).focus;stack.focus stack(1).focus
2017-04-01 09:57:58.442689: I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: lookahead;tagger;rnn-stack
2017-04-01 09:57:58.442692: I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;64;64
2017-04-01 09:57:58.442810: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
2017-04-01 09:57:58.442830: W util/utf8/unicodetext.cc:260] UTF-8 buffer is not interchange-valid.
libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string
INFO:tensorflow:Read 0 documents
...

since i haven't found the way to fix it,
i decided to skip by dropping 'char2word' layer when building 'master_spec'.

after that, all works fine.

https://github.com/dsindex/syntaxnet#dragnn

if you are interested in training the Russian corpus and test,

  1. download Russian UD corpus from http://universaldependencies.org

  2. compile

$ pwd
/path/to/models/syntaxnet
$ bazel build -c opt //work/dragnn_examples:write_master_spec
$ bazel build -c opt //work/dragnn_examples:train_dragnn
$ bazel build -c opt //work/dragnn_examples:inference_dragnn
  1. train
  • say, UD_Russian directory in the path
$ pwd
/path/to/work/UD_Russian
  • edit train_dragnn.sh
SRC_CORPUS_DIR=${CDIR}/UD_Russian
TRAIN_FILE=${DATA_DIR}/ru-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/ru-ud-dev.conllu.conv
  • run
$ nohup ./train_dragnn.sh -v -v &
  1. test
  • run
$ cat textfile | ./test_dragnn.sh -v -v

note that again

loading downloaded model for annotation is not yet available now in here.

but i think https://github.com/tensorflow/models/tree/master/syntaxnet/dragnn/tools
this original code may work well(i didn't test)

from syntaxnet.

alexfridlyand avatar alexfridlyand commented on June 3, 2024

Thank you very much for such detailed response! I will reply shortly in case of issues, great stuff.

from syntaxnet.

alexfridlyand avatar alexfridlyand commented on June 3, 2024

Got this error at inference stage (with Russian dataset trained on): UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128).

from syntaxnet.

alexfridlyand avatar alexfridlyand commented on June 3, 2024

@dsindex Maybe you could also help how to convert output to Brat standoff ann format to output in Brat?
I commented ${CONLL2TREE} --alsologtostderr from test_dragnn for this,
but then i need to convert CoNNL-U format to standoff, i'm trying with this repo: https://github.com/spyysalo/conllu.py

but getting multiple parse issues. Could you advice something?

from syntaxnet.

dsindex avatar dsindex commented on June 3, 2024

@alexfridlyand

that is cool repo!

i am not sure about getting multiple parse issues you mentioned.
but conllu.py looks like taking file-based processing with two pass.
one is for text, other is for annotation. it is tricky..... ;;
i think we'd better to save conllu files(from test_dragnn.sh) and use conll.py.

$ cat file.txt | ./test_dragnn.sh > file.conllu
$ python conll.py/convert.py -o outdir file.conllu

if we want to run from on-line manner,
we have to modify conllu.py/convert.py, conll.py/conllu/conllu.py
it seems time-consuming.

by the way, i have a question about the brat tool.
nlplab/brat#1221
as this issue which i reported, i can't annotate relations.
because there is no dialog action.

do you know how to fix it?

from syntaxnet.

alexfridlyand avatar alexfridlyand commented on June 3, 2024

I use brat as compare only tool, if i will figure out - i'll let you know.

@dsindex same code as you wrote, i'm getting
conllu.conllu.FormatError: invalid CPOSTAG: PRP$ (line 4)
on file with Russian sentences.

from syntaxnet.

dsindex avatar dsindex commented on June 3, 2024

@alexfridlyand thank you :)

hmm.... in UD_English and Korean corpus, there is no error.
i guess cpostag is not right format

CPOSTAG_RE = re.compile(r'^[a-zA-Z]+$')
...
        # some character set constraints
        if not CPOSTAG_RE.match(self.cpostag):
            raise FormatError('invalid CPOSTAG: %s' % self.cpostag)

here, self.cpostag was generated by from_string method

def from_string(cls, s):
        fields = s.split('\t')
        if len(fields) != 10:
            raise FormatError('got %d/10 field(s)' % len(fields), s)
        fields[5] = [] if fields[5] == '_' else fields[5].split('|') # feats
        fields[8] = [] if fields[8] == '_' else fields[8].split('|') # deps
        return cls(*fields)

since i don't know exactly why such character in there,
do some filtering for fields list is the way i'd like to take ;;

hope it helps.

from syntaxnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.