iqiyi / faspell Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 212.0 268 KB

2019-SOTA简繁中文拼写检查工具：FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)

License: GNU General Public License v3.0

Python 100.00%

faspell's People

Contributors

Stargazers

Watchers

Forkers

benhe119 gm19900510 seeker1943 zxlzr shihuaxing yyht tcxdgit qingkong111 chenmoshushi nonva allensmile zgd716 secsilm veelion a70 sdgdsffdsfff yxlbyc djy iomato dut3062796s chuxue12 einstein-github haxine bconline2002 eric-seekas sungram keymao louisliaoxh1989 nozominootoko dderek-01 fendaq ytxqox jalen666 jianfuli liannice hitluobin davidalphafox bcmi220 deluxebear wiigin eidenritto davidmr001 foxmeder jangocheng wjmboss lsl1989 liangxiao renyuxiang joe2hpimn liuzyong hhy5277 mingyates ljggg shaunstanislauslau happysky2046 super-ljg xuduofeng zhaohuanzhendl gavinljj qujingying zchenack hellotodaywolrd cchhdd kssssssssss wengbenjue awesome-ml xuehui0725 liuzongquan kunde122 shannonyu semsevens renaissance25 hongxi233 tanyhuan hecongqing yishuihanhan blueicesir sharmer156 mymsimple little-alexandra zhiyong5 newledge chenny0808 beitadoge jiawenlinlin temperrain jingjingandqiqi richarwu shenxuhui deylies xiaoshengjun pixelmonkeypro colinsongf axelning askintution bigwavelet danxiangjie zannet tonyxia2016 hell-to-heaven

faspell's Issues

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/file'

“## 微调训练
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

执行 $ python create_data.py -f /path/to/training/data/file 时，报错FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/file'

fine-tune训练时的数据格式问题

$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

您好，请问readme中提及的fine-tune训练，/path/to/training/data/file, correct.txt, wrong.txt 三个文件分别是什么文件，以及应该是什么格式呢？ readme中仿佛没有提及。

回复为盼，感谢

hi，代码是不兼容tensorflow2.0.0吧

比如tensorflow2.0里面没有flags，我运行代码报错

部分字没有字形

感谢作者的贡献。但是有部分字，在ids和makemeahanzi中都无法找到字形表示，例如“牛”。麻烦问下是如何得到的呢？另外，还没有仔细研读论文，只有几千条数据集效果好吗？如果加入大量没有错别字的训练集，对模型有影响吗？

faspelll.py， class SpellChecker()，def get_error() 的return全是None？？？？

NSP

想了解一下，这里的pretrain中的NSP，看了一下代码，并没有发现a_tokens和b_tokens有来自同一个句子的可能性，都是随机挑句子拼接成AAAABBBB，那么这里的NSP有什么用？毕竟Bert文中这样描述：

Specifically,
when choosing the sentences A and B for each pretraining example, 50% of the time B is the actual
next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from
the corpus (labeled as NotNext).

请问作者能否为我指点迷津。谢谢### @eugene-yh @jwu26

对于实验结果召回率不高的情况有什么优化建议吗

通过CSD过滤我获取到了较高的精准度，但召回率仅有56%，即使无过滤召回率也只有65%，请问有进一步提高召回率的思路吗，比如更换bert模型哈工大的Chinese-BERT-wwm和最新albert模型，增加更丰富的训练数据，这样可行吗？

哪位能分享下解决extension，多个数字或字母连着对齐的问题么

关于训练lm的问题

不是太明白在训练lm时为什么要替换错别字，感觉就采用预测mask，就可以获取字级别语义的字向量了，替换了错别字，在进行self-att的时候难道不会提供错误信息造成干扰吗?有点疑惑，望大神们解惑。

与pre-trained model 同时使用的fine-tuned model能传一份到google网盘吗？

如题

anyone knows how to generate mask_probability.sav

anyone knows how to generate mask_probability.sav for finetuning the custom model

提供的训练数据，unihan的拼音数据， ids数据都是繁体中文，没有简体中文

提供的训练数据，unihan的拼音数据， ids数据都是繁体中文，没有简体中文。
请问训练出的模型能较好地应用在简体中文上么？
回复为盼，感谢

執行 run_pretraining.py 未能成功

我執行這些指令，產生tf_examples.tfrecord

$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

最後餵給run_pretraining.py

python3 run_pretraining.py \
      --input_file=/tmp/tf_examples.tfrecord \
      --output_dir=/tmp/pretraining_output \
      --do_train=True \
      --do_eval=True \
      --bert_config_file=./tmp/model/bert_config.json \
      --init_checkpoint=./tmp/model/bert_model.ckpt \
      --train_batch_size=32 \
      --max_seq_length=128 \
      --max_predictions_per_seq=20 \
      --num_train_steps=20 \
      --num_warmup_steps=10 \
      --learning_rate=2e-5

但得到了錯誤

WARNING: Logging before flag parsing goes to stderr.
W1112 13:50:39.595919 4557219264 deprecation_wrapper.py:119] From ../bert_modified/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1112 13:50:39.596668 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:496: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W1112 13:50:39.597512 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W1112 13:50:39.597633 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:410: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W1112 13:50:39.597735 4557219264 deprecation_wrapper.py:119] From ../bert_modified/modeling.py:92: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W1112 13:50:39.598400 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:417: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

W1112 13:50:39.598561 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:421: The name tf.gfile.Glob is deprecated. Please use tf.io.gfile.glob instead.

W1112 13:50:39.599843 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:423: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

I1112 13:50:39.599960 4557219264 run_pretraining.py:423] *** Input Files ***
W1112 13:50:40.611162 4557219264 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W1112 13:50:40.611783 4557219264 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x1389af2f0>) includes params argument, but params are not passed to Estimator.
I1112 13:50:40.612485 4557219264 estimator.py:209] Using config: {'_model_dir': '/tmp/pretraining_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13ca037b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2), '_cluster': None}
I1112 13:50:40.612823 4557219264 tpu_context.py:209] _TPUContext: eval_on_tpu True
W1112 13:50:40.612979 4557219264 tpu_context.py:211] eval_on_tpu ignored because use_tpu is False.
I1112 13:50:40.613079 4557219264 run_pretraining.py:462] ***** Running training *****
I1112 13:50:40.613152 4557219264 run_pretraining.py:463]   Batch size = 32
W1112 13:50:40.623688 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W1112 13:50:40.632311 4557219264 deprecation_wrapper.py:119] From run_pretraining.py:340: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

W1112 13:50:40.637573 4557219264 deprecation.py:323] From run_pretraining.py:371: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W1112 13:50:40.637722 4557219264 deprecation.py:323] From /usr/local/lib/python3.7/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
E1112 13:50:40.639678 4557219264 error_handling.py:70] Error recorded from training_loop: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'
I1112 13:50:40.639847 4557219264 error_handling.py:96] training_loop marked as finished
W1112 13:50:40.639980 4557219264 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining.py", line 496, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining.py", line 469, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_default
    input_fn, ModeKeys.TRAIN))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1022, in _get_features_and_labels_from_input_fn
    self._call_input_fn(input_fn, mode))
  File "/usr/local/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2828, in _call_input_fn
    return input_fn(**kwargs)
  File "run_pretraining.py", line 371, in input_fn
    cycle_length=cycle_length))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1853, in apply
    return DatasetV1Adapter(super(DatasetV1, self).apply(transformation_func))
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1290, in apply
    dataset = transformation_func(self)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/experimental/ops/interleave_ops.py", line 93, in _apply_fn
    buffer_output_elements, prefetch_input_elements)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 224, in __init__
    map_func, self._transformation_name(), dataset=input_dataset)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2555, in __init__
    self._function = wrapper_fn._get_concrete_function_internal()
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1355, in _get_concrete_function_internal
    *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1349, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1652, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1545, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 715, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2549, in wrapper_fn
    ret = _wrapper_helper(*args)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2489, in _wrapper_helper
    ret = func(*nested_args)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 335, in __init__
    filenames, compression_type, buffer_size, num_parallel_reads)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 295, in __init__
    filenames = _create_or_validate_filenames_dataset(filenames)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/data/ops/readers.py", line 57, in _create_or_validate_filenames_dataset
    filenames = ops.convert_to_tensor(filenames, dtype=dtypes.string)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1087, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1145, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1224, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1018, in _TensorTensorConversionFunction
    (dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: 'Tensor("args_0:0", shape=(), dtype=float32)'

可以分享一下训练好的模型数据吗

您好，感谢您们的工作，我想复现这个模型的结果，但在fine-tune掩码语言模型时就会因为内存不足而中断训练。请问可以分享一下训练好的模型数据吗？

分享一个自己生成的特征文件 char_meta.txt

char_meta.txt

约10MB大小，不能保证完全正确。

几行预览：

U+4E07	万	wan4,mo4;maan6,mak6;MAN,MWUK;MAN,BAN;vạn	⿱一⿰丿𠃌
U+4E08	丈	zhang4;zoeng6;CANG;JOU,CHOU;trượng	⿻一⿻㇇乀
U+4E09	三	san1;saam1,saam3;SAM;SAN;tam	⿱一⿱一一
U+4E0A	上	shang4,shang3;soeng5,soeng6;SANG;JOU,SHOU;thượng	⿱⿰丨一一
U+4E0B	下	xia4;haa5,haa6;HA;KA,GE;hạ	⿱一⿻丨丶
U+4E0C	丌	qi2,ji1;gei1;KI;KI,GI;null	⿱一⿰丿丨

生成过程

从Unihan_Readings.txt获取汉字的各个语言发音
- 汉语多音字kHanyuPinyin, kMandarin, kTGHZ2013, kXHC1983
- 各语言kCantonese, kKorean, kJapaneseOn, kVietnamese
从ids.txt遍历拆解汉字笔画，
- 拆解复杂字，在ids.txt提供了部件笔画的
- 部分简单字没有笔画，利用makemehanzi的笔画
- 获取所有中文笔画的集合，cjklib分解后只有1画的
- 还是部分简单字没有笔画，标注后再遍历（利用wiki dictionary和自己标）

What are the token level results of SIGHAN13, SIGHAN14

Hi,
Thanks for your great job. I've read your paper but I doesn't find the token level results of SIGHAN13, SIGHAN14. Can you share the results, thanks.

stroke-level IDS的处理问题

您好！下载ids.txt文件后，发现针对简单字并没有stroke-level的信息，比如：
U+4EBA 人人
递归处理后，应该得不到咱们论文描述的情况，这个咱们是如何处理的？需要用到其他外部数据吗？

您好在生成数据阶段，用命令行执行create_data和在pycharm中create_data执行为什么会得到不同的结果

和

在命令行执行得到的结果不一样，在pycharm中执行得到的句子correct.txt 和wrong.txt 句子数量不想等，导致在后面无法生成tf.record, 不知道您是否知道解决方案呢？

CSD 部分 candidates 是不是只要存一次，我理解的对吗

top_difference=True, sim_type='shape', rank=0 #就是这一次存起来记做A
top_difference=True, sim_type='shape', rank=1 #用第一次的
top_difference=True, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=True, sim_type='sound', rank=0#用第一次的
top_difference=True, sim_type='sound', rank=1#用第一次的
top_difference=True, sim_type='sound', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='shape', rank=0#用第一次的
top_difference=False, sim_type='shape', rank=1#用第一次的
top_difference=False, sim_type='shape', rank=2#用第一次的
... , ... , ...
top_difference=False, sim_type='sound', rank=0
top_difference=False, sim_type='sound', rank=1
top_difference=False, sim_type='sound', rank=2

关于文本对齐的问题extension()

extension()文本对齐只考虑2个情况是为什么？

如何添加loss和性能打印

验证数据集

您好，请问验证数据集和测试数据集都是sighan中的测试数据集吗

char_meta.txt 问题

您好，感谢您开源您的工作，非常棒，我有个问题，char_meta.txt 这个文件里面是需要自己提前准备吗？想问下怎么准备呢？看你们给的数据链接里面没有这些？

关于手动绘制曲线的疑惑

请问这里的手动绘制是指什么呢，人为观察出来一条开始合适的曲线之后要怎么转化成公式输入给程序呀
为什么没直接训练一个二分类器呢

＠eugene-yh　用cjklib的getStrokeOrder将＇不＇字分解后得到的笔画在ｉｄs.txt中找不到对应的字符？

import cjklib
from cjklib import characterlookup
cjk = characterlookup.CharacterLookup('T')

．．．
decomp = cjk.getStrokeOrder(ss[1].decode('utf-8'))

结果如下：　（这里ｓｓ[1]为＇不＇字，我的电脑上ｐython2.7无法显示汉字，都是以十六进制显示，）
ss[1].decode('utf-8')
u'\u4e0d'
decomp
[u'\u31d0', u'\u31d2', u'\u31d1', u'\u31d4']

最终结果为：
U+4E0D 不㇐㇒㇑㇔

㇐㇒㇑㇔这几个字符在idx.txt找不到对应字符，请问应该如何处理？

汉字字形表示的问题

请问，如何操作才能找出比如人的 ids -->⿰丿㇏ ?
在cjkvi-ids/ids.txt 中只有 U+4EBA 人人，怎样才能继续拆解呢？是否有额外的字典？谢谢

有没有做对中英文同时纠错的道友，交流下呀

像红米k20pro 写成红米k20po
荣耀V30 写成荣耀V300，荣耀VV30

各位有解决方法么

怎么避免不把简体修改成繁体

在测试时候会把有些简体错误修改为繁体，感谢

some queries for this code

看了代码和readme，我想训练一个适合自身业务的模型，我有几点疑问，想请教下：

masked_lm.py文件中，我没有看到训练模型的时候，使用GPU？
readme中介绍，训练模型要按顺序执行三个步骤：预训练掩码语言模型，微调训练掩码语言模型，训练CSD过滤器。我不是很明白。。对于使用自身的数据集来训练模型的话，是否需要按顺序执行这三个步骤？？正常情况下，不是可以直接使用预训练模型文件和自身的数据集，就可以训练一个属于自身业务的模型了me？为什么还要分三步走？
这个CSD过滤器按照readme中的解释，训练起来比较复杂。但是，我还是没看明白，其中的数据集是如何准备的？？
原始代码中已经包含了OCR_train和OCR_test的数据集，其格式比较明确。其应该也需要在训练前，通过create_data.py文件生成指定的wrong和correct.txt文件，然后就可以直接使用转换出来的3个文件(除了wrong和correct.txt，还有一个mask_probability.sav文件），进行训练了？？
根据代码中的提示，-m参数需设置为e, 且agrs.train需为true的情况下，就可以直接进行模型训练？但是没有验证过程？
求指点。。

微调样本的用处是什么？如何使用？

···
$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt
···
经过以上的步骤后，获得微调样本tf_examples.tfrecord，我想知道，这个微调样本的用处以及如何使用

将"dump_candidates"设置为保存路径可以保存候选字符。

请问这里保存的是txt文件还是其他格式呀

如何绘制confidence-similariy graph

你好，请问一下如何绘制论文中figure3第②和第③张的confidence-similariy graph？有相关的代码可以参考吗？谢谢。

apted.jar文件在哪

数据对齐问题

您好，这种训练方法可用于原句与改正句未对齐的情况吗，例如：原句是“有着举足轻的作用”，改正句是“有着举足轻重的作用”

请问可以分享下 char_meta.txt 文件吗

感谢作者开源！
根据 issue
#13
#5
搞了两天，实在搞不出来 char_meta.txt 。想问下真的有人根据 issue 的回复搞出来了吗？
作者可否分享下这个文件呢？谢谢。

您好请问有pytorch版本的么？

如题，谢谢。

交流

您好，想问下有没有项目交流群呀，那个字符char_meta.txt文件处理还是有点问题

pre-trained model

Hi, I could not find the pre-trained model. The link is the pre-trained bert model.

链接中给的是预训练的数据，并不是训练完成的数据

如题

请问correct.txt 和wrong.txt是什么格式？

请问correct.txt 和wrong.txt是什么格式？有工具生成吗

tackle_n_gram_bias中操作

您好，请问一下tackle_n_gram_bias这个策略的意义是什么呢？为什么在这个策略里只取了confidence最大的那个值，剩下的都放到了error_delete_positions。 error_delete_positon 只要有一个位置的信息结果指向这里，就意味这不做纠错？

~

请教贴：fine-tune时processor如何编写，求赐教

extension() bug

as described in extension() method,

"""this function is to resolve the bug that when two adjacent full-width numbers/letters are fed to mlm, the output will be merged as one output, thus lead to wrong alignments."""

But this leaded to another bug: when I test a sentence:"本是几经济报道"
bert mask 几 --"21", then extension method cut this to 2 and 1.
when procedure run in 292 row of faspeel.py: char = sentences[i][j - 1]

an error occured: list index out of range

能给出测试数据 -SIGHAN15测试集的下载地址吗

SIGHAN15测试集 download url

在运行python faspell.py -m e -t -d时，数据越界

没有更改过bert_config.json文件，faspell_configs.json也只改过文件路径，在faspell.py的292行显示字符访问越界，请问为什么