Code Monkey home page Code Monkey logo

faspell's Issues

关于手动绘制曲线的疑惑

请问这里的手动绘制是指什么呢,人为观察出来一条开始合适的曲线之后要怎么转化成公式输入给程序呀
为什么没直接训练一个二分类器呢

微调样本的用处是什么?如何使用?

···
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
···
经过以上的步骤后,获得微调样本tf_examples.tfrecord,我想知道,这个微调样本的用处以及如何使用

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/file'

“## 微调训练
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

执行 $ python create_data.py -f /path/to/training/data/file 时,报错FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/file'

数据对齐问题

您好,这种训练方法可用于原句与改正句未对齐的情况吗,例如:原句是“有着举足轻的作用”,改正句是“有着举足轻重的作用”

汉字字形表示的问题

请问,如何操作才能找出比如人 的 ids -->⿰丿㇏ ?
在cjkvi-ids/ids.txt 中只有 U+4EBA 人 人 ,怎样才能继续拆解呢?是否有额外的字典? 谢谢

執行 run_pretraining.py 未能成功

我執行這些指令,產生tf_examples.tfrecord

$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

最後餵給run_pretraining.py

python3 run_pretraining.py \
      --input_file=/tmp/tf_examples.tfrecord \
      --output_dir=/tmp/pretraining_output \
      --do_train=True \
      --do_eval=True \
      --bert_config_file=./tmp/model/bert_config.json \
      --init_checkpoint=./tmp/model/bert_model.ckpt \
      --train_batch_size=32 \
      --max_seq_length=128 \
      --max_predictions_per_seq=20 \
      --num_train_steps=20 \
      --num_warmup_steps=10 \
      --learning_rate=2e-5

但得到了錯誤

WARNING: Logging before flag parsing goes to stderr.
W1112 13:50:39.595919 4557219264 deprecation_wrapper.py:119] From ../bert_modified/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1112 13:50:39.596668 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:496: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W1112 13:50:39.597512 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W1112 13:50:39.597633 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W1112 13:50:39.597735 4557219264 deprecation_wrapper.py:119] From ../bert_modified/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W1112 13:50:39.598400 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:417: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

W1112 13:50:39.598561 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:421: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W1112 13:50:39.599843 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:423: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

I1112 13:50:39.599960 4557219264 run_pretraining.py:423] *** Input Files ***
W1112 13:50:40.611162 4557219264 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W1112 13:50:40.611783 4557219264 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x1389af2f0>) includes params argument, but params are not passed to Estimator.
I1112 13:50:40.612485 4557219264 estimator.py:209] Using config: {'_model_dir': '/tmp/pretraining_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13ca037b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2), '_cluster': None}
I1112 13:50:40.612823 4557219264 tpu_context.py:209] _TPUContext: eval_on_tpu True
W1112 13:50:40.612979 4557219264 tpu_context.py:211] eval_on_tpu ignored because use_tpu is False.
I1112 13:50:40.613079 4557219264 run_pretraining.py:462] ***** Running training *****
I1112 13:50:40.613152 4557219264 run_pretraining.py:463]   Batch size = 32
W1112 13:50:40.623688 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1112 13:50:40.632311 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:340: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W1112 13:50:40.637573 4557219264 deprecation.py:323] From run_pretraining.py:371: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W1112 13:50:40.637722 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
E1112 13:50:40.639678 4557219264 error_handling.py:70] Error recorded from training_loop: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'
I1112 13:50:40.639847 4557219264 error_handling.py:96] training_loop marked as finished
W1112 13:50:40.639980 4557219264 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining.py", line 496, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining.py", line 469, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_default
    input_fn, ModeKeys.TRAIN))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1022, in _get_features_and_labels_from_input_fn
    self._call_input_fn(input_fn, mode))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2828, in _call_input_fn
    return input_fn(**kwargs)
  File "run_pretraining.py", line 371, in input_fn
    cycle_length=cycle_length))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1853, in apply
    return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1290, in apply
    dataset = transformation_func(self)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 93, in _apply_fn
    buffer_output_elements, prefetch_input_elements)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 224, in __init__
    map_func, self._transformation_name(), dataset=input_dataset)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2555, in __init__
    self._function = wrapper_fn._get_concrete_function_internal()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1355, in _get_concrete_function_internal
    *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1349, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1652, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1545, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 715, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2549, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2489, in _wrapper_helper
    ret = func(*nested_args)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__
    filenames, compression_type, buffer_size, num_parallel_reads)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__
    filenames = _create_or_validate_filenames_dataset(filenames)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 57, in _create_or_validate_filenames_dataset
    filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1087, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1145, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1224, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1018, in _TensorTensorConversionFunction
    (dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'

请问可以分享下 char_meta.txt 文件吗

感谢作者开源!
根据 issue
#13
#5
搞了两天,实在搞不出来 char_meta.txt 。想问下真的有人根据 issue 的回复搞出来了吗?
作者可否分享下这个文件呢? 谢谢。

extension() bug

as described in extension() method,

"""this function is to resolve the bug that when two adjacent full-width numbers/letters are fed to mlm, the output will be merged as one output, thus lead to wrong alignments."""

But this leaded to another bug: when I test a sentence:"本是几经济报道"
bert mask 几 --"21", then extension method cut this to 2 and 1.
when procedure run in 292 row of faspeel.py: char = sentences[i][j - 1]

an error occured: list index out of range

关于训练lm的问题

不是太明白在训练lm时为什么要替换错别字,感觉就采用预测mask,就可以获取字级别语义的字向量了,替换了错别字,在进行self-att的时候难道不会提供错误信息造成干扰吗?有点疑惑,望大神们解惑。

char_meta.txt 问题

您好,感谢您开源您的工作,非常棒,我有个问题,char_meta.txt 这个文件里面是需要自己提前准备吗?想问下怎么准备呢?看你们给的数据链接里面没有这些?

NSP

想了解一下,这里的pretrain中的NSP,看了一下代码,并没有发现a_tokens和b_tokens有来自同一个句子的可能性,都是随机挑句子拼接成AAAABBBB,那么这里的NSP有什么用?毕竟Bert文中这样描述:

Specifically,
when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual
next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from
the corpus (labeled as NotNext).

请问作者能否为我指点迷津。谢谢### @eugene-yh @jwu26

@eugene-yh 用cjklib的getStrokeOrder将'不'字分解后得到的笔画在ids.txt中找不到对应的字符?

import cjklib
from cjklib import characterlookup
cjk = characterlookup.CharacterLookup('T')

...
decomp = cjk.getStrokeOrder(ss[1].decode('utf-8'))

结果如下: (这里ss[1]为'不'字,我的电脑上python2.7无法显示汉字,都是以十六进制显示,)
ss[1].decode('utf-8')
u'\u4e0d'
decomp
[u'\u31d0', u'\u31d2', u'\u31d1', u'\u31d4']

最终结果为:
U+4E0D 不 ㇐㇒㇑㇔

㇐㇒㇑㇔这几个字符在idx.txt找不到对应字符,请问应该如何处理?

tackle_n_gram_bias中操作

您好,请问一下tackle_n_gram_bias这个策略的意义是什么呢?为什么在这个策略里只取了confidence最大的那个值,剩下的都放到了error_delete_positions。 error_delete_positon 只要有一个位置的信息结果指向这里,就意味这不做纠错?

分享一个自己生成的特征文件 char_meta.txt

char_meta.txt

约10MB大小,不能保证完全正确。

几行预览:

U+4E07	万	wan4,mo4;maan6,mak6;MAN,MWUK;MAN,BAN;vạn	⿱一⿰丿𠃌
U+4E08	丈	zhang4;zoeng6;CANG;JOU,CHOU;trượng	⿻一⿻㇇乀
U+4E09	三	san1;saam1,saam3;SAM;SAN;tam	⿱一⿱一一
U+4E0A	上	shang4,shang3;soeng5,soeng6;SANG;JOU,SHOU;thượng	⿱⿰丨一一
U+4E0B	下	xia4;haa5,haa6;HA;KA,GE;hạ	⿱一⿻丨丶
U+4E0C	丌	qi2,ji1;gei1;KI;KI,GI;null	⿱一⿰丿丨

生成过程

  • Unihan_Readings.txt获取汉字的各个语言发音
    • 汉语多音字kHanyuPinyin, kMandarin, kTGHZ2013, kXHC1983
    • 各语言kCantonese, kKorean, kJapaneseOn, kVietnamese
  • ids.txt遍历拆解汉字笔画,
    • 拆解复杂字,在ids.txt提供了部件笔画的
    • 部分简单字没有笔画,利用makemehanzi的笔画
    • 获取所有中文笔画的集合,cjklib分解后只有1画的
    • 还是部分简单字没有笔画,标注后再遍历(利用wiki dictionary和自己标)

some queries for this code

看了代码和readme,我想训练一个适合自身业务的模型,我有几点疑问,想请教下:

  1. masked_lm.py文件中,我没有看到训练模型的时候,使用GPU?
  2. readme中介绍,训练模型要按顺序执行三个步骤:预训练掩码语言模型,微调训练掩码语言模型,训练CSD过滤器。我不是很明白。。对于使用自身的数据集来训练模型的话,是否需要按顺序执行这三个步骤??正常情况下,不是可以直接使用预训练模型文件和自身的数据集,就可以训练一个属于自身业务的模型了me?为什么还要分三步走?
  3. 这个CSD过滤器按照readme中的解释,训练起来比较复杂。但是,我还是没看明白,其中的数据集是如何准备的??
  4. 原始代码中已经包含了OCR_train和OCR_test的数据集,其格式比较明确。其应该也需要在训练前,通过create_data.py文件生成指定的wrong和correct.txt文件,然后就可以直接使用转换出来的3个文件(除了wrong和correct.txt,还有一个mask_probability.sav文件),进行训练了??
  5. 根据代码中的提示,-m参数需设置为e, 且agrs.train需为true的情况下,就可以直接进行模型训练?但是没有验证过程?
    求指点。。

交流

您好,想问下有没有项目交流群呀,那个字符char_meta.txt文件处理还是有点问题

部分字没有字形

感谢作者的贡献。但是有部分字,在ids和makemeahanzi中都无法找到字形表示,例如“牛”。麻烦问下是如何得到的呢?另外,还没有仔细研读论文,只有几千条数据集效果好吗?如果加入大量没有错别字的训练集,对模型有影响吗?

fine-tune训练时的数据格式问题

$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

您好,请问readme中提及的fine-tune训练,/path/to/training/data/file, correct.txt, wrong.txt 三个文件分别是什么文件,以及应该是什么格式呢? readme中仿佛没有提及。

回复为盼,感谢

pre-trained model

Hi, I could not find the pre-trained model. The link is the pre-trained bert model.

~

~

stroke-level IDS的处理问题

您好!下载ids.txt文件后,发现针对简单字并没有stroke-level的信息,比如:
U+4EBA 人 人
递归处理后,应该得不到咱们论文描述的情况,这个咱们是如何处理的?需要用到其他外部数据吗?

可以分享一下训练好的模型数据吗

您好,感谢您们的工作,我想复现这个模型的结果,但在fine-tune掩码语言模型时就会因为内存不足而中断训练。请问可以分享一下训练好的模型数据吗?

对于实验结果召回率不高的情况有什么优化建议吗

通过CSD过滤我获取到了较高的精准度,但召回率仅有56%,即使无过滤召回率也只有65%,请问有进一步提高召回率的思路吗,比如更换bert模型哈工大的Chinese-BERT-wwm和最新albert模型,增加更丰富的训练数据,这样可行吗?

CSD 部分 candidates 是不是只要存一次,我理解的对吗

top_difference=True, sim_type='shape', rank=0 #就是这一次存起来 记做A
top_difference=True, sim_type='shape', rank=1 #用第一次的
top_difference=True, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=True, sim_type='sound', rank=0#用第一次的
top_difference=True, sim_type='sound', rank=1#用第一次的
top_difference=True, sim_type='sound', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='shape', rank=0#用第一次的
top_difference=False, sim_type='shape', rank=1#用第一次的
top_difference=False, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='sound', rank=0
top_difference=False, sim_type='sound', rank=1
top_difference=False, sim_type='sound', rank=2

验证数据集

您好,请问验证数据集和测试数据集都是sighan中的测试数据集吗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.