I had a problem with [training parser from Sejong treebank corpus]

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

there was a similar discussion before <a class="issue-link js-issue-link" data-err

./sejong/c2d.sh error about syntaxnet HOT 5 OPEN

dsindex commented on June 13, 2024

./sejong/c2d.sh error

from syntaxnet.

Comments (5)

dsindex commented on June 13, 2024

@YeopIn

you need to place a constituent parse tree corpus(sejong_treebank.txt.v1) to sejong directory.

$ ls
align.py  align_r.py  c2d.py  c2d.sh  context.pbtxt_p  env.sh  eval.py  log  sejong_treebank.sample  sejong_treebank.txt.v1  split.py  split.sh  tagged_input.sample  tagger.py  wdir
$ more sejong_treebank.txt.v1
; 1993/06/08 19
(NP	(NP 1993/SN + //SP + 06/SN + //SP + 08/SN)
	(NP 19/SN))

; 엠마누엘 웅가로 /
(NP	(NP	(NP 엠마누엘/NNP)
		(NP 웅가로/NNP))
	(X //SP))

; 의상서 실내 장식품으로…
(NP_AJT	(NP_AJT 의상/NNG + 서/JKB)
	(NP_AJT	(NP 실내/NNG)
		(NP_AJT 장식품/NNG + 으로/JKB + …/SE)))

; 디자인 세계 넓혀
(VP	(NP_OBJ	(NP 디자인/NNG)
		(NP_OBJ 세계/NNG))
	(VP 넓히/VV + 어/EC))
...

run split.sh, you will have

$ ls wdir
sejong_treebank.txt.v1.test
sejong_treebank.txt.v1.training
sejong_treebank.txt.v1.tuning

run 'c2d.sh`

as you see, this script generates .v2, .v3 files

for SET in training tuning test; do
    ${python} ${CDIR}/c2d.py --mode=0 < ${WDIR}/sejong_treebank.txt.v1.${SET} > ${WDIR}/sejong_treebank.txt.v2.${SET} 2> ${WDIR}/sejong_treebank.txt.v2.${SET}.err
    ${python} ${CDIR}/c2d.py --mode=1 < ${WDIR}/sejong_treebank.txt.v2.${SET} > ${WDIR}/deptree.txt.v2.${SET}         2> ${WDIR}/deptree.txt.v2.${SET}.err
    [ "${SET}" == "training" ] && extend=1 || extend=0
    ${python} ${CDIR}/align.py --extend=${extend} < ${WDIR}/deptree.txt.v2.${SET} > ${WDIR}/deptree.txt.v3.${SET}
done

if you have some troubles, then test like this

$ python c2d.py --mode=0 < wdir/sejong_treebank.txt.v1.training > wdir/sejong_treebank.txt.v2.training

you may notice which points were problem.

from syntaxnet.

YeopIn commented on June 13, 2024

I solved this problem, Thank you.

How to training Korean pos tagging?
Is that true for Korean pos tagging using train_dragnn.sh? and data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser

I downloaded UD_Korean version of 2.0,

I changed SRC_CORPUS_DIR = UD_Korean and TRAIN_FILE = kr-ud-train.conllu and DEV_FILE = kr-ud-dev.conllu in train_dragnn.sh

but, There is out of range Error? What should I do?

from syntaxnet.

dsindex commented on June 13, 2024

@YeopIn

Is that true for Korean pos tagging using train_dragnn.sh?

-> No, train_dragnn.sh stands for training dependency parser only. it is basically same as train_dragnn_sejong.sh.

data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser ...

-> i think you need to check *.conllu.conv. 'convert.py' generates '.conv' files and those files are used as training/tune corpus

TRAIN_FILE=${DATA_DIR}/en-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/en-ud-dev.conllu.conv
CHECKPOINT_FILE=${DATA_DIR}/checkpoint.model

function convert_corpus {
    local _corpus_dir=$1
    for corpus in $(ls ${_corpus_dir}/*.conllu); do
        ${python} ${CDIR}/convert.py < ${corpus} > ${corpus}.conv
    done
}

...
--training_corpus_path=${TRAIN_FILE} 
--tune_corpus_path=${DEV_FILE}

from syntaxnet.

YeopIn commented on June 13, 2024

Thank you so much..
My final goal is training both Korean Tag and Parser with Sejong Corpus data. Is there a way to solution?

from syntaxnet.

dsindex commented on June 13, 2024

there was a similar discussion before
#4 (comment)

but, i couldn't find proper way to train Korean POS tagger.
i thought ... it is worth that i use other Korean POS tagger(Konlpy) or implement character-based POS tagger for Korean and reconstruct morphs from inflectional forms.
for example,

tagging : '하늘을 나는 새를 본다' -> '하/b-ncn 늘/i-ncn 을/b-jks 나/b-vv 는/b-etm 새/b-ncn 를/b-jko 본/b-vv 다/b-ec'
reconstruct : '하늘/ncn 을/jks 날/vv 는/etm 새/ncn 를/jko 보/vv ㄴ다/ec'

of course, you need some extra resources for converting '본/b-vv 다/b-ec' -> '보/vv ㄴ다/ec'

from syntaxnet.

./sejong/c2d.sh error about syntaxnet HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent