iqiyi / faspell Goto Github PK
View Code? Open in Web Editor NEW2019-SOTA简繁中文拼写检查工具:FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)
License: GNU General Public License v3.0
2019-SOTA简繁中文拼写检查工具:FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)
License: GNU General Public License v3.0
“## 微调训练
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
执行 $ python create_data.py -f /path/to/training/data/file 时,报错FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/file'
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
您好,请问readme中提及的fine-tune训练,/path/to/training/data/file, correct.txt, wrong.txt 三个文件分别是什么文件,以及应该是什么格式呢? readme中仿佛没有提及。
回复为盼,感谢
比如tensorflow2.0里面没有flags,我运行代码报错
感谢作者的贡献。但是有部分字,在ids和makemeahanzi中都无法找到字形表示,例如“牛”。麻烦问下是如何得到的呢?另外,还没有仔细研读论文,只有几千条数据集效果好吗?如果加入大量没有错别字的训练集,对模型有影响吗?
想了解一下,这里的pretrain中的NSP,看了一下代码,并没有发现a_tokens和b_tokens有来自同一个句子的可能性,都是随机挑句子拼接成AAAABBBB,那么这里的NSP有什么用?毕竟Bert文中这样描述:
Specifically,
when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual
next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from
the corpus (labeled as NotNext).
请问作者能否为我指点迷津。谢谢### @eugene-yh @jwu26
通过CSD过滤我获取到了较高的精准度,但召回率仅有56%,即使无过滤召回率也只有65%,请问有进一步提高召回率的思路吗,比如更换bert模型哈工大的Chinese-BERT-wwm和最新albert模型,增加更丰富的训练数据,这样可行吗?
哪位能分享下解决extension,多个数字或字母连着 对齐的问题么
不是太明白在训练lm时为什么要替换错别字,感觉就采用预测mask,就可以获取字级别语义的字向量了,替换了错别字,在进行self-att的时候难道不会提供错误信息造成干扰吗?有点疑惑,望大神们解惑。
如题
anyone knows how to generate mask_probability.sav for finetuning the custom model
提供的训练数据,unihan的拼音数据, ids数据都是繁体中文,没有简体中文。
请问训练出的模型能较好地应用在简体中文上么?
回复为盼, 感谢
我執行這些指令,產生tf_examples.tfrecord
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
最後餵給run_pretraining.py
python3 run_pretraining.py \
--input_file=/tmp/tf_examples.tfrecord \
--output_dir=/tmp/pretraining_output \
--do_train=True \
--do_eval=True \
--bert_config_file=./tmp/model/bert_config.json \
--init_checkpoint=./tmp/model/bert_model.ckpt \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--learning_rate=2e-5
但得到了錯誤
WARNING: Logging before flag parsing goes to stderr.
W1112 13:50:39.595919 4557219264 deprecation_wrapper.py:119] From ../bert_modified/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
W1112 13:50:39.596668 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:496: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.
W1112 13:50:39.597512 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
W1112 13:50:39.597633 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
W1112 13:50:39.597735 4557219264 deprecation_wrapper.py:119] From ../bert_modified/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
W1112 13:50:39.598400 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:417: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.
W1112 13:50:39.598561 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:421: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.
W1112 13:50:39.599843 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:423: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.
I1112 13:50:39.599960 4557219264 run_pretraining.py:423] *** Input Files ***
W1112 13:50:40.611162 4557219264 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W1112 13:50:40.611783 4557219264 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x1389af2f0>) includes params argument, but params are not passed to Estimator.
I1112 13:50:40.612485 4557219264 estimator.py:209] Using config: {'_model_dir': '/tmp/pretraining_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13ca037b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2), '_cluster': None}
I1112 13:50:40.612823 4557219264 tpu_context.py:209] _TPUContext: eval_on_tpu True
W1112 13:50:40.612979 4557219264 tpu_context.py:211] eval_on_tpu ignored because use_tpu is False.
I1112 13:50:40.613079 4557219264 run_pretraining.py:462] ***** Running training *****
I1112 13:50:40.613152 4557219264 run_pretraining.py:463] Batch size = 32
W1112 13:50:40.623688 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1112 13:50:40.632311 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:340: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
W1112 13:50:40.637573 4557219264 deprecation.py:323] From run_pretraining.py:371: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W1112 13:50:40.637722 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
E1112 13:50:40.639678 4557219264 error_handling.py:70] Error recorded from training_loop: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'
I1112 13:50:40.639847 4557219264 error_handling.py:96] training_loop marked as finished
W1112 13:50:40.639980 4557219264 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
File "run_pretraining.py", line 496, in <module>
tf.app.run()
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_pretraining.py", line 469, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_default
input_fn, ModeKeys.TRAIN))
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1022, in _get_features_and_labels_from_input_fn
self._call_input_fn(input_fn, mode))
File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2828, in _call_input_fn
return input_fn(**kwargs)
File "run_pretraining.py", line 371, in input_fn
cycle_length=cycle_length))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1853, in apply
return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1290, in apply
dataset = transformation_func(self)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 93, in _apply_fn
buffer_output_elements, prefetch_input_elements)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 224, in __init__
map_func, self._transformation_name(), dataset=input_dataset)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2555, in __init__
self._function = wrapper_fn._get_concrete_function_internal()
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1355, in _get_concrete_function_internal
*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1349, in _get_concrete_function_internal_garbage_collected
graph_function, _, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1652, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1545, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 715, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2549, in wrapper_fn
ret = _wrapper_helper(*args)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2489, in _wrapper_helper
ret = func(*nested_args)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__
filenames, compression_type, buffer_size, num_parallel_reads)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__
filenames = _create_or_validate_filenames_dataset(filenames)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 57, in _create_or_validate_filenames_dataset
filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1087, in convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1145, in convert_to_tensor_v2
as_ref=False)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1224, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1018, in _TensorTensorConversionFunction
(dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'
您好,感谢您们的工作,我想复现这个模型的结果,但在fine-tune掩码语言模型时就会因为内存不足而中断训练。请问可以分享一下训练好的模型数据吗?
约10MB大小,不能保证完全正确。
几行预览:
U+4E07 万 wan4,mo4;maan6,mak6;MAN,MWUK;MAN,BAN;vạn ⿱一⿰丿𠃌
U+4E08 丈 zhang4;zoeng6;CANG;JOU,CHOU;trượng ⿻一⿻㇇乀
U+4E09 三 san1;saam1,saam3;SAM;SAN;tam ⿱一⿱一一
U+4E0A 上 shang4,shang3;soeng5,soeng6;SANG;JOU,SHOU;thượng ⿱⿰丨一一
U+4E0B 下 xia4;haa5,haa6;HA;KA,GE;hạ ⿱一⿻丨丶
U+4E0C 丌 qi2,ji1;gei1;KI;KI,GI;null ⿱一⿰丿丨
生成过程
Unihan_Readings.txt
获取汉字的各个语言发音
kHanyuPinyin, kMandarin, kTGHZ2013, kXHC1983
kCantonese, kKorean, kJapaneseOn, kVietnamese
ids.txt
遍历拆解汉字笔画,
Hi,
Thanks for your great job. I've read your paper but I doesn't find the token level results of SIGHAN13, SIGHAN14. Can you share the results, thanks.
您好!下载ids.txt文件后,发现针对简单字并没有stroke-level的信息,比如:
U+4EBA 人 人
递归处理后,应该得不到咱们论文描述的情况,这个咱们是如何处理的?需要用到其他外部数据吗?
top_difference=True, sim_type='shape', rank=0 #就是这一次存起来 记做A
top_difference=True, sim_type='shape', rank=1 #用第一次的
top_difference=True, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=True, sim_type='sound', rank=0#用第一次的
top_difference=True, sim_type='sound', rank=1#用第一次的
top_difference=True, sim_type='sound', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='shape', rank=0#用第一次的
top_difference=False, sim_type='shape', rank=1#用第一次的
top_difference=False, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='sound', rank=0
top_difference=False, sim_type='sound', rank=1
top_difference=False, sim_type='sound', rank=2
extension()文本对齐只考虑2个情况是为什么?
您好,请问验证数据集和测试数据集都是sighan中的测试数据集吗
您好,感谢您开源您的工作,非常棒,我有个问题,char_meta.txt 这个文件里面是需要自己提前准备吗?想问下怎么准备呢?看你们给的数据链接里面没有这些?
请问这里的手动绘制是指什么呢,人为观察出来一条开始合适的曲线之后要怎么转化成公式输入给程序呀
为什么没直接训练一个二分类器呢
import cjklib
from cjklib import characterlookup
cjk = characterlookup.CharacterLookup('T')
...
decomp = cjk.getStrokeOrder(ss[1].decode('utf-8'))
结果如下: (这里ss[1]为'不'字,我的电脑上python2.7无法显示汉字,都是以十六进制显示,)
ss[1].decode('utf-8')
u'\u4e0d'
decomp
[u'\u31d0', u'\u31d2', u'\u31d1', u'\u31d4']
最终结果为:
U+4E0D 不 ㇐㇒㇑㇔
㇐㇒㇑㇔这几个字符在idx.txt找不到对应字符,请问应该如何处理?
请问,如何操作才能找出比如人 的 ids -->⿰丿㇏ ?
在cjkvi-ids/ids.txt 中只有 U+4EBA 人 人 ,怎样才能继续拆解呢?是否有额外的字典? 谢谢
像红米k20pro 写成 红米k20po
荣耀V30 写成 荣耀V300,荣耀VV30
各位有解决方法么
在测试时候会把有些简体错误修改为繁体,感谢
看了代码和readme,我想训练一个适合自身业务的模型,我有几点疑问,想请教下:
···
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
···
经过以上的步骤后,获得微调样本tf_examples.tfrecord,我想知道,这个微调样本的用处以及如何使用
请问这里保存的是txt文件 还是其他格式呀
你好,请问一下如何绘制论文中figure3第②和第③张的confidence-similariy graph?有相关的代码可以参考吗?谢谢。
apted.jar文件在哪
您好,这种训练方法可用于原句与改正句未对齐的情况吗,例如:原句是“有着举足轻的作用”,改正句是“有着举足轻重的作用”
如题 ,谢谢。
您好,想问下有没有项目交流群呀,那个字符char_meta.txt文件处理还是有点问题
Hi, I could not find the pre-trained model. The link is the pre-trained bert model.
如题
您好,请问一下tackle_n_gram_bias这个策略的意义是什么呢?为什么在这个策略里只取了confidence最大的那个值,剩下的都放到了error_delete_positions。 error_delete_positon 只要有一个位置的信息结果指向这里,就意味这不做纠错?
~
as described in extension() method,
"""this function is to resolve the bug that when two adjacent full-width numbers/letters are fed to mlm, the output will be merged as one output, thus lead to wrong alignments."""
But this leaded to another bug: when I test a sentence:"本是几经济报道"
bert mask 几 --"21", then extension method cut this to 2 and 1.
when procedure run in 292 row of faspeel.py: char = sentences[i][j - 1]
an error occured: list index out of range
SIGHAN15测试集 download url
没有更改过bert_config.json文件,faspell_configs.json也只改过文件路径,在faspell.py的292行显示字符访问越界,请问为什么
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.