Comments (5)
- you need to place a constituent parse tree corpus(
sejong_treebank.txt.v1
) to sejong directory.
$ ls
align.py align_r.py c2d.py c2d.sh context.pbtxt_p env.sh eval.py log sejong_treebank.sample sejong_treebank.txt.v1 split.py split.sh tagged_input.sample tagger.py wdir
$ more sejong_treebank.txt.v1
; 1993/06/08 19
(NP (NP 1993/SN + //SP + 06/SN + //SP + 08/SN)
(NP 19/SN))
; 엠마누엘 웅가로 /
(NP (NP (NP 엠마누엘/NNP)
(NP 웅가로/NNP))
(X //SP))
; 의상서 실내 장식품으로…
(NP_AJT (NP_AJT 의상/NNG + 서/JKB)
(NP_AJT (NP 실내/NNG)
(NP_AJT 장식품/NNG + 으로/JKB + …/SE)))
; 디자인 세계 넓혀
(VP (NP_OBJ (NP 디자인/NNG)
(NP_OBJ 세계/NNG))
(VP 넓히/VV + 어/EC))
...
- run
split.sh
, you will have
$ ls wdir
sejong_treebank.txt.v1.test
sejong_treebank.txt.v1.training
sejong_treebank.txt.v1.tuning
- run 'c2d.sh`
- as you see, this script generates .v2, .v3 files
for SET in training tuning test; do
${python} ${CDIR}/c2d.py --mode=0 < ${WDIR}/sejong_treebank.txt.v1.${SET} > ${WDIR}/sejong_treebank.txt.v2.${SET} 2> ${WDIR}/sejong_treebank.txt.v2.${SET}.err
${python} ${CDIR}/c2d.py --mode=1 < ${WDIR}/sejong_treebank.txt.v2.${SET} > ${WDIR}/deptree.txt.v2.${SET} 2> ${WDIR}/deptree.txt.v2.${SET}.err
[ "${SET}" == "training" ] && extend=1 || extend=0
${python} ${CDIR}/align.py --extend=${extend} < ${WDIR}/deptree.txt.v2.${SET} > ${WDIR}/deptree.txt.v3.${SET}
done
- if you have some troubles, then test like this
$ python c2d.py --mode=0 < wdir/sejong_treebank.txt.v1.training > wdir/sejong_treebank.txt.v2.training
- you may notice which points were problem.
from syntaxnet.
I solved this problem, Thank you.
How to training Korean pos tagging?
Is that true for Korean pos tagging using train_dragnn.sh? and data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser
I downloaded UD_Korean version of 2.0,
I changed SRC_CORPUS_DIR = UD_Korean and TRAIN_FILE = kr-ud-train.conllu and DEV_FILE = kr-ud-dev.conllu in train_dragnn.sh
but, There is out of range Error? What should I do?
from syntaxnet.
- Is that true for Korean pos tagging using train_dragnn.sh?
-> No, train_dragnn.sh stands for training dependency parser only. it is basically same as train_dragnn_sejong.sh.
- data using UD_Korean(universal_dependencies-2.0-ud_treebans-v2.0tgz)?
Is it need sejong_treebank.v1? I knew sejong_treebank.v1 is for Korean parser ...
-> i think you need to check *.conllu.conv
. 'convert.py' generates '.conv' files and those files are used as training/tune corpus
TRAIN_FILE=${DATA_DIR}/en-ud-train.conllu.conv
DEV_FILE=${DATA_DIR}/en-ud-dev.conllu.conv
CHECKPOINT_FILE=${DATA_DIR}/checkpoint.model
function convert_corpus {
local _corpus_dir=$1
for corpus in $(ls ${_corpus_dir}/*.conllu); do
${python} ${CDIR}/convert.py < ${corpus} > ${corpus}.conv
done
}
...
--training_corpus_path=${TRAIN_FILE}
--tune_corpus_path=${DEV_FILE}
from syntaxnet.
Thank you so much..
My final goal is training both Korean Tag and Parser with Sejong Corpus data. Is there a way to solution?
from syntaxnet.
there was a similar discussion before
#4 (comment)
but, i couldn't find proper way to train Korean POS tagger.
i thought ... it is worth that i use other Korean POS tagger(Konlpy) or implement character-based POS tagger for Korean and reconstruct morphs from inflectional forms.
for example,
tagging : '하늘을 나는 새를 본다' -> '하/b-ncn 늘/i-ncn 을/b-jks 나/b-vv 는/b-etm 새/b-ncn 를/b-jko 본/b-vv 다/b-ec'
reconstruct : '하늘/ncn 을/jks 날/vv 는/etm 새/ncn 를/jko 보/vv ㄴ다/ec'
of course, you need some extra resources for converting '본/b-vv 다/b-ec' -> '보/vv ㄴ다/ec'
from syntaxnet.
Related Issues (20)
- Training text segmentation and morphological analysis
- Errors when running the server script HOT 9
- UD_Italian v.2.0 training OK but test KO HOT 2
- Serving different language model #2 - Export HOT 10
- how to find sejong_treebank.txt.v1 ?? HOT 10
- How to run conll17 dragnn baseline model? HOT 9
- How to use conll2017 baseline ? HOT 4
- Launch server with different model HOT 4
- How to train Chinese corpus after downloading the universal-dependencies-2.0 ? HOT 8
- DRAGNN - Tensorflow Serving HOT 5
- GPU device not visible
- Question about "installing syntaxnet" HOT 3
- Why does it continue its training? HOT 1
- Question: train dragnn using dragnn example?
- Question: missing segmenter in dragnn model
- How to generate .pb file for android HOT 2
- Where is the context.pbtxt in UD_language? HOT 13
- cannot import name graph_builder HOT 1
- How to retrain existing Syntaxnet model? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from syntaxnet.