Code Monkey home page Code Monkey logo

dtt-multi-branch's People

Contributors

kaijuml avatar marco-roberti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

dtt-multi-branch's Issues

Issues when running dependency parsing script

After successfully running the script for POS-tagging, I am facing issues on the dependency parsing part. When running the respective command python3 dep_parsing_spacy.py --orig wikibio/train_output.txt --dest wikibio/train_deprel.txt --format sent, I get the following error:

Loading sentences...
[OK] (526575 sentences in 14.364 seconds)
Loading SpaCy parser...
[OK] (1.922 seconds)
Using 32 processes, starting now.
Parsing sentences:   0%|                                                                                                          | 0/526575 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "dep_parsing_spacy.py", line 94, in _deal_with_one_instance
    ret = nlp.parser(Doc(nlp.vocab, sentence))
AttributeError: 'English' object has no attribute 'parser'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dep_parsing_spacy.py", line 111, in <module>
    total=len(sentences)
  File "dep_parsing_spacy.py", line 104, in <listcomp>
    processed_sentences = [item for item in tqdm.tqdm(
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 325, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
AttributeError: 'English' object has no attribute 'parser'

I an using the default spacy of version 3.0.6, but maybe some other version is needed.

Unable to successfully run POS-tagging cmd

I am trying to run the POS-tagging cmd python3 pos_tagging.py --do_train --do_tagging train --gpus 0 1 --dataset_folder wikibio that is listed in the README of data folder, but doesn't complete successfully. I get the following error:

Using the following devices: [1,2]
Using the following environment variables, please edit the script if needed
CUDA_VISIBLE_DEVICES=1,2
Using the following arguments, please edit the script if needed
--data_dir ./pos --model_type bert --labels ./pos/labels.txt --model_name_or_path bert-base-uncased --output_dir ./pos/trained --max_seq_length 256 --num_train_epochs 3 --per_gpu_train_batch_size 32 --save_steps 750
Traceback (most recent call last):
  File "run_ner.py", line 70, in <module>
    (),
  File "run_ner.py", line 68, in <genexpr>
    for conf in (BertConfig, RobertaConfig, DistilBertConfig, CamembertConfig, XLMRobertaConfig)
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'
Loading examples from train
Traceback (most recent call last):
  File "run_ner.py", line 70, in <module>
    (),
  File "run_ner.py", line 68, in <genexpr>
    for conf in (BertConfig, RobertaConfig, DistilBertConfig, CamembertConfig, XLMRobertaConfig)
AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'
Traceback (most recent call last):
  File "pos_tagging.py", line 225, in <module>
    do_tagging(args.pos_folder, args.dataset_folder, args.do_tagging, gpus, args.max_seq_length, args.split_size)
  File "pos_tagging.py", line 163, in do_tagging
    run_script(examples, pos_folder, dest, gpus, max_seq_length)
  File "pos_tagging.py", line 139, in run_script
    open(orig, mode="r", encoding='utf8') as origfile:
FileNotFoundError: [Errno 2] No such file or directory: './pos/trained/test_predictions.txt'

The first type of error is happening when computing the run_ner.py script and I think is related to the version of the transformers package that is used. As I have tested, when the version is => 3.X.X the above error of AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map' is thrown. When a lower version of the package is used, say 2.0.0, the following error ImportError: cannot import name 'CamembertConfig' from 'transformers' (/opt/conda/lib/python3.7/site-packages/transformers/__init__.py) is thrown. Maybe there should be a specific version of the package used where none of the errors is happening.

The second and more important error says that the above ./pos/trained/test_predictions.txt file is not found. Where shall I find this file? Do I need to get it from somewhere else? Thanks in advance :)

About loading models in pos_tagging.py

I have a question about the implementation of Part-of-Speech tagging.
The following command will tag the POS.

python3 pos_tagging.py --do_train --do_tagging train --gpus 0 1 --dataset_folder wikibio

--do_train will load the pre-trained model bert-base-uncased, perform fine tuning, and save the model in ./pos/trained.
But why does --do_tagging load --model_name_or_path bert-base-uncased in def run_script instead of loading the stored model in ./pos/trained ?

cmd = " ".join([
        f'CUDA_VISIBLE_DEVICES={gpus}',
        'python run_ner.py',
        f'--data_dir {pos_folder}/',
        '--model_type bert',
        f'--labels {os.path.join(pos_folder, "labels.txt")}',
        '--model_name_or_path bert-base-uncased',
        f'--output_dir {os.path.join(pos_folder, "trained")}',
        f'--max_seq_length {max_seq_length}',
        '--do_predict',
        '--per_gpu_eval_batch_size 64'
    ])

Error when running format weight script | broken download link | assertion error during the preprocessing step

I am trying to run the format weights script and I get the following errors.

Initially, I can't download the file that is given through the wget https://datacloud.di.unito.it/index.php/s/KPr9HnbMyNWqRdj/download command because I get 404 not found error.

Moreover, I thought of changing a little the next command so it uses the preprocessed files in data/wikibio and not data/download. The changed command looks like this python3 data/format_weights.py --orig data/wikibio --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1 --eos_weights 1 0 and when I execute it, I get the error below:

Writting formatted wieghts to: /workspace/src/repos/dtt-multi-branch/train_weights.txt
Reading orig file. Can take up to a minute.
WARNING: path is data/wikibio but format is txt (by default).
Traceback (most recent call last):
  File "data/format_weights.py", line 240, in <module>
    args.orig, func=lambda x,y: (x, float(y)))
  File "data/format_weights.py", line 239, in <listcomp>
    sent for sent in TaggedFileIterable.from_filename(
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 124, in __getitem__
    return next(self)
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 134, in __next__
    return next(self._iterable)
  File "/opt/conda/lib/python3.7/site-packages/more_itertools/more.py", line 2670, in __next__
    item = next(self._source)
  File "/workspace/src/repos/dtt-multi-branch/data/utils.py", line 173, in read_file
    with open(path, mode='r', encoding='utf8') as f:
IsADirectoryError: [Errno 21] Is a directory: 'data/wikibio'

I am not sure what the next course of action should be.

'eos_weights' args missing

when you run
python3 data/format_weights.py --orig download --dest train_weights.txt --strategy thresholds --thresholds 0.4 --normalize --weight_regularization 1

the argument --eos_weights seems to be required, what value do you suggest?

Thank you

词性标记部分找不到文件

运行这个代码之后python3 pos_tagging.py --do_train --do_tagging train --gpus 0 1 --dataset_folder wikibio
报错 No such file or directory: './pos/trained/test_predictions.txt'。

Typo in sentence scoring script

When running the sentence scoring script, there is a typo in one of the import statements here. Since we already are inside data folder when running the cmd, then the import should rather be from utils import FileIterable, TaggedFileIterable in order to avoid import errors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.