davendw49 / k2 Goto Github PK

Code and datasets for paper "K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization" in WSDM-2024

License: Apache License 2.0

Python 100.00%

ai4science data-science geoai geoscience kg large-language-models llm

k2's People

Contributors

Stargazers

Watchers

Forkers

eltociear vaibhavb02 richardscottoz gooseillo auspotter wanghaitaozj botoai 54457616 wangdq1989 rayford295 taner45 ccsnow127 cnsamool hongyispace

k2's Issues

🆘 I need the geosignal.json dataset for finetunning.

The amount of data translated is too large. Are there any Chinese version of geosignal.json?
Badly in need of need the Chinese version of the dataset.
It doesn't matter if it doesn't.
Sorry to take up the issue. Thanks a lot.🙏

Issues in evaluation code.

It looks like the code you've provided "run_eval.py" is not consistent with the benchmark dataset you've provided. I've encountered a few issues and I'd like to know their solutions:

args.geobenchmark is set to "npee" but there's no such benchmark file. I've read in another issue that you've asked to replace with "geobenchmark_npee.json". However, I am not sure how the code will run when we'll pass the apstudy.json benchmark as the code is only written for npee.
In the following image, the code line "for the_answer_is in ['wa', 'woa']" can you please explain what is this 'wa' and 'woa' as it's not mentioned in the code-base or inside the npee dataset anywhere.
In the same image, code line "source = source_target['source'][question_type][the_answer_is]", if you load npee.json benchmark file as json then source_target['source'] will give keyword error as only 6 keys are available ['noun', 'choice', 'completion', 'tf', 'qa', 'discussion'] so this key seems to be wrong.
Moreover, even if you say "source_target[question_type][the_answer_is]" is the correct format, still "the_answer_is" is a key error as only ['question', 'answer'] exist in the "choice" element of npee file. What's the right format?
How do you evaluate and test the apstudy.json benchmark as the code is not written for that?

input_ls.json

Your generate examples includes this - but there is no example file it seems?

Maybe include one with a few sample geoscience questions excluded to see if people get similar to you

'What is the most common igneous rock?' or things like that perhaps?

Due to the dependencies conflict in the k2.yaml, running the k2 model is impossible. Please update the right file.

There is an error when i try to run conda env create -f k2.yaml as is showing in the README file,Thoes are errors that pip showed:

ERROR: Cannot install datasets==2.11.0, evaluate==0.4.0, gradio-client==0.1.3, gradio==3.27.0, huggingface-hub==0.13.3 and transformers==4.32.0 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested huggingface-hub==0.13.3
    datasets 2.11.0 depends on huggingface-hub<1.0.0 and >=0.11.0
    evaluate 0.4.0 depends on huggingface-hub>=0.7.0
    gradio 3.27.0 depends on huggingface-hub>=0.13.0
    gradio-client 0.1.3 depends on huggingface-hub>=0.13.0
    transformers 4.32.0 depends on huggingface-hub<1.0 and >=0.15.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Due to the dependencies conflict in the k2.yaml, running the k2 model is impossible. Please update the right file.

Data and Scripts for Further Pretraining

Hi! Thanks for your efforts

It seems that run_clm.py is missing in the codebase.

Besides, how can I download the data for pretraining?

Example website generation parameters?

For your example, it would be interesting to know what parameters you have set for generation for each query if they are the same, to compare?

e.g. I have some different or failures with the defaults.

OSError: Can't load tokenizer for 'daven3/k2-v1'

Hi,

When trying to load the K2-v1 model following the guidelines, I get this error :

Is this related to the missing tokenizer.json file in the Hugging Face hub ?

Different answer with the same question

I used "How do you think about using Zinc isotope to trace paleo-oceanic environment?" and "how do you think about using Zinc isotope to trace paleo-oceanic environment?" as the instruction and I got totally different answer by K2. https://k2.acemap.info/

Tokenizer not available error

transformers) ubuntu@:~/data/k2$ python -m apply_delta --base-model-path decapoda-research/llama-7b-hf --target-model-path /home/ubuntu/geollama/ --delta-path daven3/k2_fp_delta
Loading the delta weights from daven3/k2_fp_delta
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/transformers/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/transformers/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/data/k2/apply_delta.py", line 165, in <module>
    apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
  File "/home/ubuntu/data/k2/apply_delta.py", line 127, in apply_delta
    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path, use_fast=False)
  File "/home/ubuntu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 720, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ubuntu/anaconda3/envs/transformers/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'daven3/k2_fp_delta'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'daven3/k2_fp_delta' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Now because you have a different sort of model it can't find a tokenizer from the pathname perhaps - do you actually have this available you could upload?

Full version of instruction-tuning geo-signal data

Hello. good efforts in K2.

Can you please release the full version of GeoSignal (Instruction-Tuning) dataset now?

How to evaluate geobench

How can I reproduce the results of geobench in paper？Any eval code or prompt uaed?

davendw49 / k2 Goto Github PK

k2's People

Contributors

Stargazers

Watchers

Forkers

k2's Issues

🆘 I need the geosignal.json dataset for finetunning.

Issues in evaluation code.

input_ls.json

Due to the dependencies conflict in the k2.yaml, running the k2 model is impossible. Please update the right file.

Data and Scripts for Further Pretraining

Example website generation parameters?

OSError: Can't load tokenizer for 'daven3/k2-v1'

Different answer with the same question

Tokenizer not available error

Full version of instruction-tuning geo-signal data

How to evaluate geobench

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent