Code Monkey home page Code Monkey logo

tasksource's People

Contributors

sileod avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tasksource's Issues

TypeError: Unhashable type: 'list'

In tasksource version 0.0.39, when loading mmlu, I get the following error:

#mmlu = MultipleChoice('question',labels='answer',choices_list='choices',splits=['validation','dev','test'],
#    dataset_name="tasksource/mmlu",
#    config_name=get_dataset_config_names("tasksource/mmlu")
#)
from tasksource.tasks import mmlu

dataset = mmlu.load()

I get the following error:

Traceback (most recent call last):                                        
  File ".venv/lib/python3.10/site-packages/tasksource/p
reprocess.py", line 37, in load                                           
    return self(datasets.load_dataset(self.dataset_name,self.config_name))
  File ".venv/lib/python3.10/site-packages/datasets/loa
d.py", line 2106, in load_dataset                                         
    builder_instance = load_dataset_builder(                              
  File ".venv/lib/python3.10/site-packages/datasets/loa
d.py", line 1829, in load_dataset_builder                                 
    builder_instance: DatasetBuilder = builder_cls(                       
  File ".venv/lib/python3.10/site-packages/datasets/bui
lder.py", line 373, in __init__                                           
    self.config, self.config_id = self._create_builder_config(            
  File ".venv/lib/python3.10/site-packages/datasets/bui
lder.py", line 571, in _create_builder_config                             
    is_custom = (config_id not in self.builder_configs) and config_id != "
default"
TypeError: unhashable type: 'list'

Seems to be because it does not expect the result of get_dataset_config_names("tasksource/mmlu") to be a list of form List[str], (i.e:['abstract_algebra', 'anatomy', 'astronomy', ...])

No such errors for the examples shown in README.md

Loading MMLU does not give correct answers

Current Behaviour

When loading answers as given in the example, the labels are not correctly given. For example:

from tasksource import MultipleChoice

mmlu = MultipleChoice(
    'question',
    choices_list='choices',
    labels='answer',
    splits=['validation','dev','test'],
    dataset_name='tasksource/mmlu',
    config_name="high_school_computer_science",
)

dataset = mmlu.load()

for datum in dataset['test']:
    print(datum)
    break

Then, the output is as such, not containing the correct answer:

{'inputs': 'Let x = 1. What is x << 3 in Python 3?', 'labels': 0, 'choice0': '8', 'choice1': '1', 'choice2': '3', 'choice3': '16'}

Expected Behaviour

Gives the correct answer, i.e: the answers from huggingface:

from datasets import load_dataset

dataset = load_dataset("tasksource/mmlu", "high_school_computer_science")

for datum in dataset["test"]:
    print(datum)
    break

Leads to having the correct answer:

{'question': 'Let x = 1. What is x << 3 in Python 3?', 'choices': ['1', '3', '8', '16'], 'answer': 2}

super_glue/multirc is bugged

Hi, thanks for the great collection of datasets.
But it seems that not all datasets in it are correctly preprocessed. Multirc requires paragraph, question, individual answers concatenated together for a classification. But in your case you just take the first sentence (the question itself) without adding more data. In taks.py
super_glue___multirc = Classification(sentence1="question", labels="label")
And during load we get:

from tasksource import list_tasks, load_task
ddf = load_task('super_glue/multirc')
index sentence1 labels
0 What did the high-level effort to persuade Pakistan include? 0
1 What did the high-level effort to persuade Pakistan include? 0
2 What did the high-level effort to persuade Pakistan include? 1
3 What did the high-level effort to persuade Pakistan include? 1
4 What did the high-level effort to persuade Pakistan include? 1

This data does not make any sense, and model will not be trained in any way.
Maybe you should replace the code with something similar to this to put all the data together(following the WiC example).

super_glue___multirc = Classification( 
     sentence1=cat(["paragraph", "question","answer"], " : "),
    labels='label'
)

Error when using model in HF pipeline

Hi,
Thanks for this great work.

I'm getting a type error when trying to use the model sileod/deberta-v3-base-tasksource-nli in a HF pipeline.

Could you please check?

Code to reproduce:

from transformers import pipeline

# Fails: TypeError: _batch_encode_plus() got an unexpected keyword argument 'candidate_labels'
model = "sileod/deberta-v3-base-tasksource-nli"

# Works:
# model = "facebook/bart-large-mnli"

pipe = pipeline(model=model)

print(pipe("I have a problem with my iphone that needs to be resolved asap!",
           candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"]))

The full exception log is:

Traceback (most recent call last):
  File "test.py", line 11, in <module>
    print(pipe("I have a problem with my iphone that needs to be resolved asap!",
  File "venv/lib/python3.9/site-packages/transformers/pipelines/text_classification.py", line 155, in __call__
    result = super().__call__(*args, **kwargs)
  File "venv/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1119, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "venv/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1125, in run_single
    model_inputs = self.preprocess(inputs, **preprocess_params)
  File "venv/lib/python3.9/site-packages/transformers/pipelines/text_classification.py", line 179, in preprocess
    return self.tokenizer(inputs, return_tensors=return_tensors, **tokenizer_kwargs)
  File "venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2654, in _call_one
    return self.encode_plus(
  File "venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2727, in encode_plus
    return self._encode_plus(
  File "venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
    batched_output = self._batch_encode_plus(
TypeError: _batch_encode_plus() got an unexpected keyword argument 'candidate_labels'

As mentioned in the code above - the model facebook/bart-large-mnli works fine (uncomment the appropriate line to verify).

Output of transformers-cli env:

- `transformers` version: 4.29.2
- Platform: macOS-13.3.1-arm64-arm-64bit
- Python version: 3.9.6
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Thanks!

Using tasksource for question answering datasets

Hi, I am really new to the AI field. I wanted to train electra-large model on several question answering tasks. Can I do that using tasksource? I am really confused about which tasks is applicable to tasksource. Does it only support the 600+tasks it includes? What if I want to train on some other dataset? There are no available blogs or resources on tasksource.

Please, help.

Feature request: select tasks by language

Currently, the package doesn't allow choosing the language.
I think many people who are developing models for specific languages (or language sets) would like to be able to access task data for a given language, so if you implement this functionality, it might be of a great help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.