zjunlp / iepile Goto Github PK

View Code? Open in Web Editor NEW

134.0 6.0 11.0 6.33 MB

[OneKE] [ACL 2024] IEPile: A Large-Scale Information Extraction Corpus

Home Page: http://oneke.openkg.cn/

License: Other

Shell 2.39% Python 97.61%

bilingual chinese corpus dataset english event-extraction ie information-extraction instructions knowledge-graph

iepile's Introduction

English | Chinese

IEPile: A Large-Scale Information Extraction Corpus

This is the official repository for IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Please note that our IEPile may undergo updates (we will inform you upon their release). It is recommended to utilize the most current version.

IEPile: A Large-Scale Information Extraction Corpus

News

[2024/05] The paper IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus is accepted by ACL 2024 main conference.
[2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called OneKE based on Chinese-Alpaca-2-13B.
[2024/02] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named IEPile, along with two models trained with IEPile, baichuan2-13b-iepile-lora and llama2-13b-iepile-lora.
[2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named InstructIE with paper.
[2023/08] We introduced a dedicated 13B model for Information Extraction (IE), named knowlm-13b-ie.
[2023/05] We initiated an instruction-based Information Extraction project.

1.Introduction

IEPile dataset download links: Google Drive | Hugging Face | WiseModel | ModelScpoe

Please be aware that the data contained in the dataset links provided above has already excluded any part related to the ACE2005 dataset. Should you require access to the unfiltered, complete dataset and have successfully obtained the necessary permissions, please do not hesitate to contact us via email at [email protected] or [email protected]. We will provide the complete dataset resources for your use.

Model download links for LLaMA2-IEPile | Baichuan2-IEPile | OneKE: zjunlp/llama2-13b-iepile-lora | zjunlp/baichuan2-13b-iepile-lora | zjunlp/OneKE

We have collected and cleaned existing Information Extraction (IE) datasets, integrating a total of 26 English IE datasets and 7 Chinese IE datasets. As shown in the Figure, these datasets cover multiple domains including general, medical, financial, and others.

In this study, we adopted the proposed "schema-based batched instruction generation strategy" to create a large-scale, high-quality, bilingual (Chinese and English) IE instruction tuning dataset named IEPile, containing approximately 0.32B tokens.

Based on IEPile, we fine-tuned the Baichuan2-13B-Chat and LLaMA2-13B-Chat models using the Lora technique. Experiments have demonstrated that the fine-tuned Baichuan2-IEPile and LLaMA2-IEPile models perform remarkably on fully supervised training sets and have achieved improvements in zero-shot information extraction tasks.

Supervision Results

2.Data

2.1Construction of IEPile

We concentrate on instruction-based IE, thus the construction of schema within the instructions is crucial. This is because they reflect the specific extraction requirements and are dynamically variable. Previous approaches with existing IE datasets often employ a rather extensive schema processing strategy when constructing instructions, utilizing all schemas within a label set for instruction building, raising two potential issues:

Inconsistency in the number of schema queries within instruction between training and evaluation. For example, the model's performance will decrease if it is trained on about 20 schema queries but tested with either 10 or 30, even if the training and evaluation schemas are similar in content.
Inadequate differentiation among schemas in the instructions. For example, semantically similar schemas like "layoffs", "depart" and "dismissals", may present co-occurrence ambiguities that could confuse the LLMs. Such schemas should co-occur more frequently within the instruction.

Therefore, we introduce the following solutions: 1）Hard Negative Schema; and 2） Batched Instruction Generation.

Hard Negative Schema

Assuming that dataset $\mathcal{D}$ possesses a full label set $L$. For a given text $S$, the schemas present in its annotation constitute the positive schema set $Pos_L$, while others form the negative schema set $Neg_L$. In our analysis, we discover that the primary cause of model misjudgment stems from the semantic ambiguity of the schema. In traditional approaches, the $Neg_L$ is simply defined as $L - Pos_L$. However, they overlook a critical aspect: it is important to pay special attention to negative schemas that are semantically close to positive schemas. Inspired by the theory of contrastive learning, we construct a hard negative schema dictionary $\mathcal{K}$, where each key represents a unique schema and the associated value is a collection of schemas that are semantically similar to the key schema. Based on this, we define the hard negative schema set as $Hard_L = \mathcal{K}[Pos_L]$, and the other negative schema set as $Other_L = L - Pos_L - Hard_L$. The final $Neg_L$ is constituted by $Hard_L$ and a small subset of $Other_L$. Through this strategy, we not only present semantically similar schemas more frequently within the instruction but also reduce the number of training instances without sacrificing model performance.

Batched Instruction Generation

Subsequently, we obtain the final schema set $L' = Pos_L + Neg_L$. We employ a batched instruction generation method, limiting the number of schemas inquired in each instruction to the number of $split_num$, which ranges between 4 to 6. Therefore, $L'$ will be divided into $|L'|/split_num$ batches for querying, with each batch querying $split_num$ schemas. Consequently, even if the number of schemas inquired during the evaluation phase differs from that of training, the batched mechanism allows us to distribute the inquiries across $split_num$ schemas, thereby mitigating the decline in generalization performance.

2.2Data Format of IEPile

Each instance in IEPile contains four fields: task, source, instruction, and output.

Below is a data example:

{
    "task": "NER", 
    "source": "CoNLL2003", 
    "instruction": "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}", 
    "output": "{\"person\": [\"Robert Allenby\", \"Allenby\", \"Miguel Angel Martin\"], \"organization\": [], \"else\": [], \"location\": [\"Australia\", \"Spain\"]}"
}

The data instance belongs to the NER task, is part of the CoNLL2003 dataset, the schema list to be extracted includes ["person", "organization", "else", "location"], and the text to be extracted from is "284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )". The output is {"person": ["Robert Allenby", "Allenby", "Miguel Angel Martin"], "organization": [], "else": [], "location": ["Australia", "Spain"]}.

Note that the order of schemas in the output is consistent with the order in the instruction.

More Tasks Instance

{
  "task": "EE", 
  "source": "PHEE", 
  "instruction": "{\"instruction\": \"You are an expert in event extraction. Please extract events from the input that conform to the schema definition. Return an empty list for events that do not exist, and return NAN for arguments that do not exist. If an argument has multiple values, please return a list. Respond in the format of a JSON string.\", \"schema\": [{\"event_type\": \"potential therapeutic event\", \"trigger\": true, \"arguments\": [\"Treatment.Time_elapsed\", \"Treatment.Route\", \"Treatment.Freq\", \"Treatment\", \"Subject.Race\", \"Treatment.Disorder\", \"Effect\", \"Subject.Age\", \"Combination.Drug\", \"Treatment.Duration\", \"Subject.Population\", \"Subject.Disorder\", \"Treatment.Dosage\", \"Treatment.Drug\"]}, {\"event_type\": \"adverse event\", \"trigger\": true, \"arguments\": [\"Subject.Population\", \"Subject.Age\", \"Effect\", \"Treatment.Drug\", \"Treatment.Dosage\", \"Treatment.Freq\", \"Subject.Gender\", \"Treatment.Disorder\", \"Subject\", \"Treatment\", \"Treatment.Time_elapsed\", \"Treatment.Duration\", \"Subject.Disorder\", \"Subject.Race\", \"Combination.Drug\"]}], \"input\": \"Our findings reveal that even in patients without a history of seizures, pregabalin can cause a cortical negative myoclonus.\"}", 
  "output": "{\"potential therapeutic event\": [], \"adverse event\": [{\"trigger\": \"cause \", \"arguments\": {\"Subject.Population\": \"NAN\", \"Subject.Age\": \"NAN\", \"Effect\": \"cortical negative myoclonus\", \"Treatment.Drug\": \"pregabalin\", \"Treatment.Dosage\": \"NAN\", \"Treatment.Freq\": \"NAN\", \"Subject.Gender\": \"NAN\", \"Treatment.Disorder\": \"NAN\", \"Subject\": \"patients without a history of seizures\", \"Treatment\": \"pregabalin\", \"Treatment.Time_elapsed\": \"NAN\", \"Treatment.Duration\": \"NAN\", \"Subject.Disorder\": \"NAN\", \"Subject.Race\": \"NAN\", \"Combination.Drug\": \"NAN\"}}]}"
}

{
  "task": "RE", 
  "source": "NYT11", 
  "instruction": "{\"instruction\": \"You are an expert in relationship extraction. Please extract relationship triples that match the schema definition from the input. Return an empty list for relationships that do not exist. Please respond in the format of a JSON string.\", \"schema\": [\"neighborhood of\", \"nationality\", \"children\", \"place of death\"], \"input\": \" In the way New Jersey students know that Thomas Edison 's laboratory is in West Orange , the people of Colma know that Wyatt Earp 's ashes are buried at Hills of Eternity , a Jewish cemetery he was n't ; his wife was , and that Joe DiMaggio is at Holy Cross Cemetery , where visitors often lean bats against his gravestone . \"}", 
  "output": "{\"neighborhood of\": [], \"nationality\": [], \"children\": [], \"place of death\": [{\"subject\": \"Thomas Edison\", \"object\": \"West Orange\"}]}"
}

Below are the explanations for each field:

Field	Description
task	The task to which the instance belongs, one of the five types (`NER`, `RE`, `EE`, `EET`, `EEA`).
source	The dataset to which the instance belongs.
instruction	The instruction for inputting into the model, processed into a JSON string via json.dumps, including three parts: `"instruction"`, `"schema"`, and `"input"`.
output	The output in the format of a dictionary's JSON string, where the key is the schema, and the value is the extracted content.

In IEPile, the instruction format of IEPile adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components: (1) 'instruction': Task description, which outlines the task to be performed by the instruction (one of NER, RE, EE, EET, EEA). (2) 'schema': A list of schemas to be extracted (entity types, relation types, event types). (3) 'input': The text from which information is to be extracted.

The file instruction.py provides instructions for various tasks.

3.Using IEPile to Train Models

3.1Environment

Before you begin, make sure to create an appropriate virtual environment following the instructions below:

conda create -n IEPile python=3.9   # Create a virtual environment
conda activate IEPile               # Activate the environment
pip install -r requirements.txt     # Install dependencies

3.2Download Data and Models

IEPile dataset download links: Google Drive | Hugging Face

IEPile
├── train.json    # Training set
└── dev.json      # Validation set

Here are some of the models supported by the code in this repository: [llama, alpaca, vicuna, zhixi, falcon, baichuan, chatglm, qwen, moss, openba]

mkdir data         # Put data here
mkdir models       # Put base models here
mkdir results      # Put prediction results here
mkdir lora         # Put LoRA fine-tuning results here

Data should be placed in the ./data directory.

3.3LoRA Fine-tuning

Important Note: All the commands below should be executed within the IEPile directory. For example, if you want to run the fine-tuning script, you should use the following command: bash ft_scripts/fine_llama.bash. Please ensure your current working directory is correct. Please make sure that each entry in the training/validation files includes the instruction, output fields.

output_dir='lora/llama2-13b-chat-v1'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/llama2-13b-chat' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --bf16

CUDA_VISIBLE_DEVICES="0,1,2,3": used to specify which GPUs are available for the current training task. In this case, "0,1,2,3" means that the four GPUs with IDs 0, 1, 2, and 3 are being utilized. If your machine is equipped with more than four GPUs, this setting allows you to select any four of them for use.
--nproc_per_node=4: specifies the number of processes to be launched on each node. Since four GPUs have been specified in this example, it is necessary to start four separate processes, with each process corresponding to one GPU.
For training tasks that use only a single GPU, the command CUDA_VISIBLE_DEVICES=0 python src/finetune.py can be used to initiate the training. Here, CUDA_VISIBLE_DEVICES=0 designates GPU number 0 for this training task.
model_name: Specifies the name of the model architecture you want to use (7B, 13B, Base, Chat belong to the same model architecture). Currently supported models include: ["llama", "alpaca", "vicuna", "zhixi", "falcon", "baichuan", "chatglm", "qwen", "moss", "openba"]. Please note, this parameter should be distinguished from --model_name_or_path.
model_name_or_path: Model path, please download the corresponding model from HuggingFace.
template: The name of the template used, including: alpaca, baichuan, baichuan2, chatglm3, etc. Refer to src/datamodule/template.py to see all supported template names. The default is the alpaca template. For Chat versions of models, it is recommended to use the matching template, while Base version models can default to using alpaca.
train_file, valid_file (optional): The file paths for the training set and the validation set, respectively. Note: Only JSON format files are currently supported. ⚠️If valid_file is not specified, a subset of val_set_size entries will be automatically allocated from train_file to serve as the validation set.
output_dir: The path to save the weight parameters after LoRA fine-tuning.
val_set_size: The number of samples in the validation set, default is 1000.
per_device_train_batch_size, per_device_eval_batch_size: The batch_size on each GPU device, adjust according to the size of the memory. For RTX3090, it is recommended to set between 2 and 4.
max_source_length, max_target_length, cutoff_len: The maximum input and output lengths, and the cutoff length, which can simply be considered as the maximum input length + maximum output length. Set appropriate values according to specific needs and memory size.
If running out of GPU memory occurs when saving the model after the evaluation phase, please set evaluation_strategy to no.

Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.

To learn more about parameter configuration, please refer to the src/utils/args.

The specific script for fine-tuning the LLaMA2-13B-Chat model can be found in ft_scripts/fine_llama.bash.

The specific script for fine-tuning the Baichuan2-13B-Chat model can be found in ft_scripts/fine_baichuan.bash.bash.

4.Continued Training with In-Domain Data

Although the Baichuan2-IEPile and LLaMA2-IEPile models have undergone extensive instruction fine-tuning on multiple general datasets and thus possess a degree of general information extraction capability, they may still exhibit certain limitations when processing data in specific domains (such as law, education, science, telecommunications). To address this challenge, it is recommended to conduct secondary training of these models on datasets specific to these domains. This will help the models better adapt to the semantic and structural characteristics of the specific domains, enhancing their information extraction capability within those domains.

4.1Training Data Conversion

Firstly, it's necessary to format the data to include instruction and output fields. For this purpose, we provide a script convert_func.py, which can batch convert data into a format that can be directly used by the model.

Before using the convert_func.py script, please make sure to refer to the data directory. This directory provides detailed instructions on the data format required for each task. Refer to sample.json to understand the format of the data before conversion, schema.json to see the organization of the schema, and train.json to describe the data format after conversion.

Additionally, you can directly use the bilingual (Chinese and English) information extraction dataset zjunlp/InstructIE, which includes 12 themes such as characters, vehicles, works of art, natural science, man-made objects, astronomical objects, etc.

python ie2instruction/convert_func.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/train.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num 6 \       
    --random_sort \
    --split train

language: Supports two languages, zh (Chinese) and en (English), with different instruction templates used for each language.
task: Currently supports five types of tasks: ['RE', 'NER', 'EE', 'EET', 'EEA'].
split_num: Defines the maximum number of schemas that can be included in a single instruction. The default value is 4, and setting it to -1 means no splitting is done. The recommended number of task splits varies by task: 6 for NER, and 4 for RE, EE, EET, EEA.
random_sort: Whether to randomize the order of schemas in the instructions. The default is False, which means schemas are sorted alphabetically.
split: Specifies the type of dataset, with options train or test.

The converted training data will contain four fields: task, source, instruction, output.

Generation of Hard Negative Samples: Promote co-occurrence of semantically close and easily confused schemas, reducing the amount of training samples.

python ie2instruction/convert_func.py \
    --src_path data/SPO/sample.json \
    --tgt_path data/SPO/train.json \
    --schema_path data/SPO/schema.json \
    --cluster_mode \
    --hard_negative_path data/hard_negative/SPO_DuIE2.0.json \
    --language zh \
    --task SPO \
    --split_num 4 \
    --random_sort \
    --split train

The addition of the --cluster_mode and --hard_negative_path data/hard_negative/SPO_DuIE2.0.json parameters, where --hard_negative_path corresponds to the dictionary of difficult negative samples. The hard_dict.json contains dictionaries of hard negative samples for all datasets involved in IEPILE.

4.2Continued Training

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
llama2-13b-iepile-lora	LLaMA2-13B-Chat	llama	bf16	llama2
baichuan2-13b-iepile-lora	BaiChuan2-13B-Chat	baichuan	bf16	baichuan2
llama3-8b-iepile-lora	LLaMA3-8B-Instruct	llama	bf16	alpaca
qwen1.5-14b-iepile-lora	Qwen1.5-14B-Chat	qwen2	bf16	qwen
OneKE	OneKE	llama	bf16	llama2_zh

output_dir='lora/llama2-13b-chat-v1-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/llama2-13B-Chat' \
    --checkpoint_dir 'zjunlp/llama2-13b-iepile-lora' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 64 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --bf16

Please refer to the 3.3LoRA Fine-tuning for further parameter description.
To continue training based on the fine-tuned LoRA weights, simply point the --checkpoint_dir parameter to the path of the LoRA weights, for example by setting it to 'zjunlp/llama2-13b-iepile-lora'.

Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.

Please note that when using LLaMA2-IEPile or Baichuan2-IEPile, keep both lora_r and lora_alpha at 64. We do not provide recommended settings for these parameters.

To continue training based on the fine-tuned model weights, just set the --model_name_or_path parameter to the path of the weights, such as 'zjunlp/KnowLM-IE-v2', without setting --checkpoint_dir.

The script can be found at ft_scripts/fine_continue.bash.

4.3Continued Training OneKE

4.3.1Full SFT

output_dir='lora/OneKE-continue'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/OneKE' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --bf16

4.3.1Lora SFT

output_dir='lora/OneKE-continue-lora'
mkdir -p ${output_dir}
CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/test_finetune.py \
    --do_train --do_eval \
    --overwrite_output_dir \
    --model_name_or_path 'models/OneKE' \
    --stage 'sft' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --train_file 'data/train.json' \
    --valid_file 'data/dev.json' \
    --output_dir=${output_dir} \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --preprocessing_num_workers 16 \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --optim "adamw_torch" \
    --max_source_length 400 \
    --cutoff_len 700 \
    --max_target_length 300 \
    --evaluation_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 10 \
    --lora_r 64 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --bf16

5.Prediction

5.1Test Data Conversion

Before preparing the test data conversion, please visit the data directory to understand the data structure required for each task: 1) For the input data format, see sample.json. 2) For the schema format, please refer to schema.json. 3) For the format of the transformed data, refer to train.json. Unlike training data, test data input does not need to include annotation fields (entity, relation, event).

python ie2instruction/convert_func.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/test.json \
    --schema_path data/NER/schema.json \
    --language zh \
    --task NER \
    --split_num 6 \
    --split test

When setting split to test, select the appropriate number of schemas according to the task type: 6 is recommended for NER, while 4 is recommended for RE, EE, EET, EEA. The transformed test data will contain five fields: id, task, source, instruction, label.

The label field will be used for subsequent evaluation. If the input data lacks the annotation fields (entity, relation, event), the transformed test data will not contain the label field, which is suitable for scenarios where no original annotated data is available.

5.2Basic Model + LoRA Prediction

Model download links for LLaMA2-IEPile | Baichuan2-IEPile : zjunlp/llama2-13b-iepile-lora | zjunlp/baichuan2-13b-iepile-lora

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
llama2-13b-iepile-lora	LLaMA2-13B-Chat	llama	bf16	llama2
baichuan2-13b-iepile-lora	BaiChuan2-13B-Chat	baichuan	bf16	baichuan2
llama3-8b-iepile-lora	LLaMA3-8B-Instruct	llama	bf16	alpaca
qwen1.5-14b-iepile-lora	Qwen1.5-14B-Chat	qwen2	bf16	qwen

⚠️ When performing the Basic Model + LoRA Prediction, it's necessary not only to download the Lora weight parameters but also the base model parameters. For example, when using baichuan2-13b-iepile-lora (specified with --checkpoint_dir), you must also download BaiChuan2-13B-Chat (specified with --model_name_or_path). 🚫You cannot merely set --model_name_or_path lora/baichuan2-13b-iepile-lora.

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
    --stage sft \
    --model_name_or_path 'models/llama2-13B-Chat' \
    --checkpoint_dir 'lora/llama2-13b-IEPile-lora' \
    --model_name 'llama' \
    --template 'llama2' \
    --do_predict \
    --input_file 'data/NER/test.json' \
    --output_file 'results/llama2-13b-IEPile-lora_output.json' \
    --finetuning_type lora \
    --output_dir 'lora/test' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 4

During inference, model_name, template, and bf16 must be the same as the settings used during training.
model_name_or_path: Specify the path to the base model being used, which must match the corresponding LoRA model.
checkpoint_dir: The path to the LoRA weight files.
output_dir: This parameter does not take effect during inference and any path can be specified.
input_file, output_file: Specify the input path for the test file and the output path for the prediction results, respectively.
cutoff_len, max_new_tokens: Set the maximum input length and the number of new tokens to be generated, adjusting according to device performance.

Quantization can be performed by setting bits to 4; it is recommended for the RTX3090.

5.3IE-Specific Model Prediction

checkpoint_dir	model_name_or_path	moadel_name	fp16/bf16	template
OneKE	OneKE	llama	bf16	llama2_zh

Model download links for OneKE(based on chinese-alpaca2): zjunlp/OneKE

CUDA_VISIBLE_DEVICES=0 python src/inference.py \
    --stage sft \
    --model_name_or_path 'models/OneKE' \
    --model_name 'llama' \
    --template 'llama2_zh' \
    --do_predict \
    --input_file 'data/NER/test.json' \
    --output_file 'results/OneKE_output.json' \
    --output_dir 'lora/test' \
    --predict_with_generate \
    --cutoff_len 512 \
    --bf16 \
    --max_new_tokens 300 \
    --bits 4

model_name_or_path: The path to the weights of the model specialized for Information Extraction (IE).

6.Evaluation

We provide scripts for evaluating the F1 scores for various tasks.

python ie2instruction/eval_func.py \
  --path1 data/NER/processed.json \
  --task NER

task: Currently supports five types of tasks: ['RE', 'NER', 'EE', 'EET', 'EEA'].
You can set sort_by to source to calculate the F1 scores on each dataset separately.

7.Statement and License

We believe that annotated data contains the wisdom of humanity, and its existence is to promote the benefit of all humankind and help enhance our quality of life. We strongly urge all users not to use our corpus for any actions that may harm national or public security or violate legal regulations. We have done our best to ensure the quality and legality of the data provided. However, we also recognize that despite our efforts, there may still be some unforeseen issues, such as concerns about data protection and risks and problems caused by data misuse. We will not be responsible for these potential problems. For original data that is subject to usage permissions stricter than the CC BY-NC-SA 4.0 agreement, IEPile will adhere to those stricter terms. In all other cases, our operations will be based on the CC BY-NC-SA 4.0 license agreement.

8.Limitations

From the data perspective, our study primarily focuses on schema-based IE, which limits our ability to generalize to human instructions that do not follow our specific format requirements. Additionally, we do not explore the field of Open Information Extraction (Open IE); however, if we remove schema constraints, our dataset would be suitable for Open IE scenarios. Besides, IEPile is confined to data in English and Chinese, and in the future, we hope to include data in more languages.

From the model perspective, due to computational resource limitations, our research only assessed two models: Baichuan and LLaMA, along with some baseline models. Our dataset can be applied to any other large language models (LLMs), such as Qwen, ChatGLM, Gemma.

9.Cite

If you use the IEPile or the code, please cite the paper:

@article{DBLP:journals/corr/abs-2402-14710,
  author       = {Honghao Gui and
                  Lin Yuan and
                  Hongbin Ye and
                  Ningyu Zhang and
                  Mengshu Sun and
                  Lei Liang and
                  Huajun Chen},
  title        = {IEPile: Unearthing Large-Scale Schema-Based Information Extraction
                  Corpus},
  journal      = {CoRR},
  volume       = {abs/2402.14710},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2402.14710},
  doi          = {10.48550/ARXIV.2402.14710},
  eprinttype    = {arXiv},
  eprint       = {2402.14710},
  timestamp    = {Tue, 09 Apr 2024 07:32:43 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2402-14710.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

10.Acknowledgements

We are very grateful for the inspiration provided by the MathPile and KnowledgePile projects. Special thanks are due to the builders and maintainers of the following datasets: AnatEM、BC2GM、BC4CHEMD、NCBI-Disease、BC5CDR、HarveyNER、CoNLL2003、GENIA、ACE2005、MIT Restaurant、MIT Movie、FabNER、MultiNERD、Ontonotes、FindVehicle、CrossNER、MSRA NER、Resume NER、CLUE NER、Weibo NER、Boson、ADE Corpus、GIDS、CoNLL2004、SciERC、Semeval-RE、NYT11-HRL、KBP37、NYT、Wiki-ZSL、FewRel、CMeIE、DuIE、COAE2016、IPRE、SKE2020、CASIE、PHEE、CrudeOilNews、RAMS、WikiEvents、DuEE、DuEE-Fin、FewFC、CCF law, and more. These datasets have significantly contributed to the advancement of this research. We are also grateful for the valuable contributions in the field of information extraction made by InstructUIE and YAYI-UIE, both in terms of data and model innovation. Our research results have benefitted from their creativity and hard work as well. Additionally, our heartfelt thanks go to hiyouga/LLaMA-Factory; our fine-tuning code implementation owes much to their work. The assistance provided by these academic resources has been instrumental in the completion of our research, and for this, we are deeply appreciative.

iepile's People

Contributors

Stargazers

Watchers

Forkers

zxlzr mingkin fellowtraveler moqingxinai yuhuofei zyzyzhou tkone2018 chenzhwsysu57 mszlean

iepile's Issues

huggingface上的这个代码示例有点问题

`
model = AutoModelForCausalLM.from_pretrained(
model_path,
config=config,
load_in_4bit=True,
device_map="auto",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)

我在做量化加载的时候按照示例代码就会一直报错
ValueError: You can't passload_in_4bitor load_in_8bitas a kwarg when passingquantization_configargument at the same time.
看了好几眼都是有这个参数的，后来查阅，发现quantization_config里使用load_in_4bit即可，需要删除模型加载里的load_in_4bit=True即可正常加载，hf没有issues，就发在这好了

ValueError: Cannot merge LORA layers when the model is gptq quantized

报错信息是这个
(IEPile) C:\Users\apoll\IEPile>predict.bat
07/10/2024 15:09:10 - INFO - main - model_class:<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>
tokenizer_class:<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
CUDA extension not installed.
CUDA extension not installed.
C:\anaconda\envs\IEPile\lib\site-packages\transformers\modeling_utils.py:4225: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead
warnings.warn(
Some weights of the model checkpoint at C:\Users\apoll\IEPile\models\qwen-0.5b were not used when initializing Qwen2ForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11.mlp.gate_proj.bias', 'model.layers.11.mlp.up_proj.bias', 'model.layers.11.self_attn.o_proj.bias', 'model.layers.12.mlp.down_proj.bias', 'model.layers.12.mlp.gate_proj.bias', 'model.layers.12.mlp.up_proj.bias', 'model.layers.12.self_attn.o_proj.bias', 'model.layers.13.mlp.down_proj.bias', 'model.layers.13.mlp.gate_proj.bias', 'model.layers.13.mlp.up_proj.bias', 'model.layers.13.self_attn.o_proj.bias', 'model.layers.14.mlp.down_proj.bias', 'model.layers.14.mlp.gate_proj.bias', 'model.layers.14.mlp.up_proj.bias', 'model.layers.14.self_attn.o_proj.bias', 'model.layers.15.mlp.down_proj.bias', 'model.layers.15.mlp.gate_proj.bias', 'model.layers.15.mlp.up_proj.bias', 'model.layers.15.self_attn.o_proj.bias', 'model.layers.16.mlp.down_proj.bias', 'model.layers.16.mlp.gate_proj.bias', 'model.layers.16.mlp.up_proj.bias', 'model.layers.16.self_attn.o_proj.bias', 'model.layers.17.mlp.down_proj.bias', 'model.layers.17.mlp.gate_proj.bias', 'model.layers.17.mlp.up_proj.bias', 'model.layers.17.self_attn.o_proj.bias', 'model.layers.18.mlp.down_proj.bias', 'model.layers.18.mlp.gate_proj.bias', 'model.layers.18.mlp.up_proj.bias', 'model.layers.18.self_attn.o_proj.bias', 'model.layers.19.mlp.down_proj.bias', 'model.layers.19.mlp.gate_proj.bias', 'model.layers.19.mlp.up_proj.bias', 'model.layers.19.self_attn.o_proj.bias', 'model.layers.2.mlp.down_proj.bias', 'model.layers.2.mlp.gate_proj.bias', 'model.layers.2.mlp.up_proj.bias', 'model.layers.2.self_attn.o_proj.bias', 'model.layers.20.mlp.down_proj.bias', 'model.layers.20.mlp.gate_proj.bias', 'model.layers.20.mlp.up_proj.bias', 'model.layers.20.self_attn.o_proj.bias', 'model.layers.21.mlp.down_proj.bias', 'model.layers.21.mlp.gate_proj.bias', 'model.layers.21.mlp.up_proj.bias', 'model.layers.21.self_attn.o_proj.bias', 'model.layers.22.mlp.down_proj.bias', 'model.layers.22.mlp.gate_proj.bias', 'model.layers.22.mlp.up_proj.bias', 'model.layers.22.self_attn.o_proj.bias', 'model.layers.23.mlp.down_proj.bias', 'model.layers.23.mlp.gate_proj.bias', 'model.layers.23.mlp.up_proj.bias', 'model.layers.23.self_attn.o_proj.bias', 'model.layers.3.mlp.down_proj.bias', 'model.layers.3.mlp.gate_proj.bias', 'model.layers.3.mlp.up_proj.bias', 'model.layers.3.self_attn.o_proj.bias', 'model.layers.4.mlp.down_proj.bias', 'model.layers.4.mlp.gate_proj.bias', 'model.layers.4.mlp.up_proj.bias', 'model.layers.4.self_attn.o_proj.bias', 'model.layers.5.mlp.down_proj.bias', 'model.layers.5.mlp.gate_proj.bias', 'model.layers.5.mlp.up_proj.bias', 'model.layers.5.self_attn.o_proj.bias', 'model.layers.6.mlp.down_proj.bias', 'model.layers.6.mlp.gate_proj.bias', 'model.layers.6.mlp.up_proj.bias', 'model.layers.6.self_attn.o_proj.bias', 'model.layers.7.mlp.down_proj.bias', 'model.layers.7.mlp.gate_proj.bias', 'model.layers.7.mlp.up_proj.bias', 'model.layers.7.self_attn.o_proj.bias', 'model.layers.8.mlp.down_proj.bias', 'model.layers.8.mlp.gate_proj.bias', 'model.layers.8.mlp.up_proj.bias', 'model.layers.8.self_attn.o_proj.bias', 'model.layers.9.mlp.down_proj.bias', 'model.layers.9.mlp.gate_proj.bias', 'model.layers.9.mlp.up_proj.bias', 'model.layers.9.self_attn.o_proj.bias']

This IS expected if you are initializing Qwen2ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing Qwen2ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
07/10/2024 15:09:11 - INFO - model.adapter - Fine-tuning method: LoRA
Traceback (most recent call last):
File "C:\Users\apoll\IEPile\src\inference.py", line 121, in
main()
File "C:\Users\apoll\IEPile\src\inference.py", line 115, in main
inference(model_args, data_args, training_args, finetuning_args, generating_args, inference_args)
File "C:\Users\apoll\IEPile\src\inference.py", line 47, in inference
model, tokenizer = load_model_and_tokenizer(
File "C:\Users\apoll\IEPile\src\model\loader.py", line 140, in load_model_and_tokenizer
model = init_adapter(model, model_args, finetuning_args, is_trainable)
File "C:\Users\apoll\IEPile\src\model\adapter.py", line 166, in init_adapter
model = model.merge_and_unload()
File "C:\anaconda\envs\IEPile\lib\site-packages\peft\tuners\lora\model.py", line 838, in merge_and_unload
return self._unload_and_optionally_merge(
File "C:\anaconda\envs\IEPile\lib\site-packages\peft\tuners\lora\model.py", line 445, in _unload_and_optionally_merge
self._check_merge_allowed()
File "C:\anaconda\envs\IEPile\lib\site-packages\peft\tuners\lora\model.py", line 423, in _check_merge_allowed
raise ValueError("Cannot merge LORA layers when the model is gptq quantized")
ValueError: Cannot merge LORA layers when the model is gptq quantized
请问该怎么解决，谢谢

模型的输出为什么一定要包含指令中的所有 schema 并保证顺序一致？

您好，非常感谢您整理了众多数据集并将其开源出来！

不过我对数据处理也有一些疑问，如题：模型的输出为什么一定要包含指令中的所有 schema 并保证顺序一致？
如果一个句子中不包含任何实体/关系/事件，或包含少量实体/关系/事件，会要求输出一个 values 都为空(或有效值很稀疏)的 json，这个 json 可能会非常长 (比如 CASIE 数据集)；为什么不选择仅输出句子中包含的实体/关系/事件呢，减少模型输出长度可能在训练和推理效率上也有一定的改善？

希望您能够解答我的疑问，感谢！

请问用100条数据做微调之后，那该模型权重进行预测，显示错误数23，但是f1 p r的值都为0，是什么原因呀？

还有一个问题就是继续训练的时候训练精度只有改为fp16才能loss不为0，训练和验证的loss也都偏向比0.1更小的数，不知上述两个问题是源于数据集太小的原因

base model是baichuan-13B-chat v1还是v2？

您好，baichuan-13B-chat目前有两个版本，v1和v2。想请问您在做lora训练时，采用的是v1还是v2呢？

模型输入长文本时无法获得正确输出的问题

当我使用
{"id": "a79d7267c800a36b6a7bde4d70684b84e193faca2d8c4468ceee8bc6c74e0416", "input": "相比之下，青岛海牛队和广州松日队的雨中之战虽然也是0∶0，但乏善可陈\n", "instruction": "假设你是一位语言专家，请抽下列文本中的所有实体。"}
{"id": "1e073138ed48eeb6f9726dc34addc6dff821cef502f4ba292c911351d597a8e6", "input": "理由多多，最无奈的却是：5月恰逢双重考试，她攻读的博士学位论文要通考；她任教的两所学校，也要在这段时日大考。", "instruction": "假设你是一位语言专家，请抽下列文本中的所有实体。"}
和
CUDA_VISIBLE_DEVICES=0 python src/inference.py --stage sft --model_name_or_path 'models/baichuan2-13B-Chat' --checkpoint_dir 'lora/baichuan2-13b-IEPile-lora' --model_name 'baichuan' --template 'baichuan2' --do_predict --input_file 'data/Mydata/ner_results.json' --output_file 'results/baichuan2-13b-IEPile-lora_output.json' --finetuning_type lora --output_dir 'lora/test' --predict_with_generate --cutoff_len 512 --bf16 --max_new_tokens 300 --bits 4
命令时，可以得到以下的正确输出：
[199, 31106, 30938, 31203, 3068, 31302, 7234, 5593, 72, 31488, 32482, 21738, 31271, 31267, 3026, 2724, 19529, 73, 5, 9971, 14862, 72, 11843, 31474, 32039, 31635, 31188, 8570, 32017, 31224, 28811, 31963, 31177, 31278, 31607, 3841, 2327, 52, 35030, 52, 72, 31354, 32868, 31909, 31197, 32058, 5, 200]
inputs:
<reserved_106> 假设你是一位语言专家，请抽下列文本中的所有实体。
相比之下，青岛海牛队和广州松日队的雨中之战虽然也是0∶0，但乏善可陈
<reserved_107>
在这段文本中，实体有：

青岛海牛队
广州松日队
雨中之战
在这段文本中，实体包括：
5月
双重考试
博士学位论文
两所学校

但是当我将其中的短文本替换为：
福建省漳州市中级人民法院民事判决书 (2020)闽06民终945号上诉人（原审被告）：苏玲，女，1979年1月10日出生，汉族，住漳州市芗城区。委托诉讼代理人：郑志伟，福建三和律师事务所执业律师。被上诉人（原审原告）：杨晓红，女，1979年8月5出生，汉族，住漳州市芗城区。委托诉讼代理人：吕子雄，福建衡评律师事务所执业律师。上诉人苏玲因与被上诉人杨晓红民间借贷纠纷一案，不服福建省漳州市芗城区人民法院（2019）闽0602民初5390号民事判决，向本院提起上诉。本院于2020年4月1日立案后，依法组成合议庭，进行了审理。本案现已审理终结。苏玲上诉请求:1、撤销一审判决，改判驳回杨晓红的诉讼请求或将本案发回重审；2、一、二审诉讼费用由杨晓红负担。事实和理由：本案借条利率与案涉债权投资产品的利率一致，苏玲与杨晓红的微信聊天记录也可证明案涉款项实际上是苏玲以其目前名义代杨晓红购买债权投资产品，双方没有借款合意，苏玲系因重大误解写下借条。如双方系借贷关系，则之后杨晓红不可能向苏玲表示要借款，而应要求苏玲提前还款。一审判决认定本案是民间借贷法律关系是错误的，据此所作判决亦是错误的。杨晓红辩称，本案一审认定的事实清楚，判决正确，苏玲应当按一审的判决偿还款项。1.本案双方是借贷关系，苏玲说存在委托理财，但未提供由杨晓红授权的授权委托书或双方签订的合同，不能认定双方有达成理财产品的合意。根据双方的微信聊天记录，苏玲在转账当天出具了借条，苏玲发了一份借条模版、一份欠条模版，苏玲出具的是借条，以此证明双方是借贷的法律关系。2.杨晓红在本案所涉购买理财产品的过程之前是有购买过理财产品，杨晓红有相应的理财产品账号，如要购买杨晓红可以自行在手机上购买，不用再通过苏玲购买。同时在苏玲购买时也并非是用杨晓红的名义购买理财产品，本案310000购买理财产品时是苏玲的母亲名义购买不是杨晓红的名义。3.借款到期后，杨晓红有向苏玲多次催讨，苏玲支付了10000元，后又支付了20000元，如系委托理财产品，苏玲无需向杨晓红还款，可见双方是借贷的关系。杨晓红向一审法院起诉请求：1、判令苏玲偿还杨晓红借款28万元及利息（自2018年4月25日起至实际还清之日止，按年利率10.42%计算）；2、判令苏玲支付杨晓红逾期还款违约金3000元；3、本案诉讼费由苏玲承担。一审法院认定事实：2018年4月25日，苏玲出具借条一份给杨晓红收执。借条记载：今借杨晓红人民币31万元整，大写：叁拾壹万元整，所有现金已收到。约定于2019年4月26日归还，年利率为10.42%，全部本息共计人民币344340元，如不能按期足额归还借款，借款人应向出借人支付违金人民币3000元整。备注：到期有7天周期，以实际到账为主。杨晓红以转账形式予苏玲。，苏玲在借款人处签名捺印。当日，杨晓红通过银行转账的方式向苏玲汇款合计30.9万元，通过微信转账的方式向苏玲转账1000元。在本案审理过程中，申请人杨晓红于2019年6月3日申请财产保全，冻结被申请人苏玲名下价值相当于317993.83元的财产，并提供担保函作为担保。一审法院于2019年6月3日作出（2019）闽0602民初5390号民事裁定书，裁定查封、冻结被申请人苏玲名下价值317993.83元的财产。一审法院认为，杨晓红提供借据及转账凭证证明杨晓红、苏玲之间存在民间借贷关系，合法有据，杨晓红与苏玲的民间借贷关系依法有效，受法律保护。苏玲对本案借条无异议且确认收到31万元，但辩称杨晓红向其转账31万元系杨晓红委托其向信和财富投资管理（北京）有限公司投资债权，杨晓红、苏玲之间系委托理财关系而非借贷关系，该辩称得到杨晓红的否认，杨晓红在诉讼中述称，苏玲确实在2018年4月向其推荐金信网的理财产品，其也有意购买，后因为担心金信网的资信不够决定不买并将资金借给苏玲，根据《最高人民法院关于审理民间借贷案件适用法律若干问题的规定》第十五条第一款杨晓红以借据、收据、欠条等债权凭证为依据提起民间借贷诉讼，苏玲依据基础法律关系提出抗辩或者反诉，并提供证据证明债权纠纷非民间借贷行为引起的，人民法院应当依据查明的案件事实，按照基础法律关系审理。，苏玲提供的微信聊天记录、银行账户交易明细以及证人胡某、证人王某的证言，不足以证明杨晓红委托苏玲投资第三方债权产品的事实，本案应按借款合同关系处理，故苏玲的辩称，证据不足，一审法院不予采纳。本案借款借期已届满，苏玲未按期归还借款，构成违约。杨晓红要求苏玲归还尚欠的借款本金28万元、自2018年4月25日起至实际还清款项之日止按年利率10.42%计付的利息以及逾期还款违约金3000元，合法有据，一审法院予以支持。依照《中华人民共和国合同法》第一百九十六条、第二百零五条、第二百零六条、第二百零七条、第一百零七条、第一百一十四条第一款，《最高人民法院关于审理民间借贷案件适用法律若干问题的规定》第十五条第一款，《最高人民法院关于民事诉讼证据的若干规定》第二条以及《中华人民共和国民事诉讼法》第六十四条的规定，判决如下：一、苏玲应于判决生效后十日内偿付杨晓红借款本金28万元及以本金28万元为基数按年利率10.42%从2018年4月25日起至实际还清款项之日止计付的利息；二、苏玲应于判决生效后十日内支付杨晓红逾期还款违约金3000元。如果未按判决指定的期间履行给付金钱义务，应当依照《中华人民共和国民事诉讼法》第二百五十三条之规定，加倍支付迟延履行期间的债务利息。一审案件受理费6069.9元，减半收取计3034.95，保全费2110元，均由苏玲承担。二审中，当事人没有提交新证据。对一审认定的事实，当事人均无异议，本院予以确认。本案争议焦点：苏玲与杨晓红的借贷关系能否成立。本院认为，苏玲出具的条据是借条，该借条不仅约定借款金额，还约定了借款期限和借款利率，故该借条从形式到内容均符合民间借贷法律特征，且杨晓红还依约支付款项。杨晓红请求判令苏玲偿还借款，有事实和法律依据。苏玲主张双方系委托代理投资关系，但未提供书面委托代理协议，双方的聊天记录和一审证人证言均不足以推翻涉案借条的证明效力，且杨晓红在向苏玲转账前曾以自己名义在网贷平台购买过投资产品，其具有自行投资的能力，而案涉投资产品不是以杨晓红的名义购买，故苏玲主张本案系委托理财纠纷，不是民间借贷法律关系，依据不足。综上所述，苏玲的上诉请求不能成立，应予驳回；一审判决认定事实清楚，适用法律正确，应予维持。依照《中华人民共和国民事诉讼法》第一百七十条第一款第一项规定，判决如下：驳回上诉，维持原判。二审案件受理费6069.9元，由苏玲负担。本判决为终审判决。审判长　周月华审判员　傅志杰审判员　傅　京二〇二〇年五月十八日法官助理　詹立宇书记员　邹晓燕关注公众号“马克数据网”

这类长文本时，模型没有报错，但是却无法得出正确的信息。此外自己还查找了代码和readme，未能确定需要修改哪些部分以支撑长文本实体，事件，关系等抽取和识别。这个模型是否具有支持这类长文本的功能，或者需要修改某些文件的参数呢？

llama-2-13b-chat-hf +llama2-13b-iepile-lora 4bit运行时，程序报错

import torch
from transformers import BitsAndBytesConfig
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '2'

from transformers import (
AutoConfig,
AutoTokenizer,
AutoModelForCausalLM,
GenerationConfig
)
from peft import PeftModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = /IEPile/models/pretrain/llama-2-13b-chat-hf'
lora_path = '/IEPile/models/pretrain/llama2-13b-iepile-lora'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
config=config,
device_map="auto",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)

model = PeftModel.from_pretrained(
model,
lora_path,
)

model.to(device)

model.eval()

sintruct = "{"instruction": "You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.", "schema": ["person", "organization", "else", "location"], "input": "284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )"}"
sintruct = '<reserved_106>' + sintruct + '<reserved_107>'

input_ids = tokenizer.encode(sintruct, return_tensors="pt").to(device)
input_length = input_ids.size(1)
print(input_ids)
print(input_length)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=512, max_new_tokens=256, return_dict_in_generate=True))
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)
报错:
Traceback (most recent call last):
File "/home/admin/wangsj/workspace/IEPile/script/infer_4bit.py", line 50, in
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=512, max_new_tokens=256, return_dict_in_generate=True))
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/peft/peft_model.py", line 977, in generate
outputs = self.base_model.generate(**kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/generation/utils.py", line 1602, in generate
return self.greedy_search(
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/generation/utils.py", line 2450, in greedy_search
outputs = self(
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
outputs = self.model(
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 311, in forward
query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
File "/root/anaconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 311, in
query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
RuntimeError: mat1 and mat2 shapes cannot be multiplied (137x5120 and 1x2560)

Package Version

accelerate 0.21.0
aiohttp 3.9.5
aiosignal 1.3.1
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.39.1
certifi 2024.2.2
charset-normalizer 3.3.2
cmake 3.29.2
datasets 2.16.1
dill 0.3.7
filelock 3.14.0
frozenlist 1.4.1
fsspec 2023.10.0
huggingface-hub 0.20.3
idna 3.7
jieba 0.42.1
Jinja2 3.1.3
lit 18.1.4
MarkupSafe 2.1.5
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
networkx 3.2.1
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 24.0
pandas 2.2.2
peft 0.4.0
pip 23.3.1
protobuf 3.20.1
psutil 5.9.8
pyarrow 16.0.0
pyarrow-hotfix 0.6
pydantic 1.10.7
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.4.28
requests 2.31.0
rouge-chinese 1.0.3
safetensors 0.4.3
scipy 1.9.1
sentencepiece 0.1.98
setuptools 68.2.2
six 1.16.0
sympy 1.12
tiktoken 0.6.0
tokenizers 0.13.3
torch 2.0.0
tqdm 4.66.2
transformers 4.33.0
triton 2.0.0
typing_extensions 4.11.0
tzdata 2024.1
urllib3 2.2.1
wheel 0.41.2
xxhash 3.4.1
yarl 1.9.4

求助，找不到文件夹

ls出现很奇怪的文件夹

EE-zh任务中DuIE-fin数据集疑问

为什么事件抽取的时候，对于相同事件类型，相同事件中元素组成进行了重复抽取？这样会影响模型的效果吗？例子：DuIE-fin test.json 中第6997条数据 { "text": "网宿科技(300017.SZ)拟回购注销73.72万股限制性股票\n股市震荡，需要注意什么？\n跨年行情，应该如何布局？\n【立即开户，领取福利】", "event": [ { "event_trigger": "回购", "event_type": "股份回购", "arguments": [ { "argument": "网宿科技", "role": "回购方" } ] }, { "event_trigger": "回购", "event_type": "股份回购", "arguments": [ { "argument": "网宿科技", "role": "回购方" } ] } ], "task": "EE" }

请问如何对biachuan2进行4比特量化，具体步骤怎么操作呀

ValueError: Target modules ['c_attn', 'attn.c_proj', 'w1', 'w2', 'mlp.c_proj'] not found in the base model. Please check the target modules and try again

你好，请问该框架暂时不支持基座模型千文的微调吗，在尝试进行微调的时候出现如下的错误

具体的.bash命令文件内容如下所示：
output_dir='lora/qwen1.5-14b-chat-v1' mkdir -p ${output_dir} CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun --nproc_per_node=8 --master_port=1288 /chentao/wuqi/project/IEPile/src/finetune.py \ --do_train --do_eval \ --overwrite_output_dir \ --model_name_or_path '/chentao/wuqi/model/Qwen1.5-14B-Chat' \ --stage 'sft' \ --model_name 'qwen' \ --template 'qwen' \ --train_file '/chentao/wuqi/project/IEPile/data/IEPILE/train.json' \ --valid_file '/chentao/wuqi/project/IEPile/data/IEPILE/dev.json' \ --output_dir=${output_dir} \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --preprocessing_num_workers 16 \ --num_train_epochs 10 \ --learning_rate 5e-5 \ --max_grad_norm 0.5 \ --optim "adamw_torch" \ --max_source_length 400 \ --cutoff_len 700 \ --max_target_length 300 \ --evaluation_strategy "epoch" \ --save_strategy "epoch" \ --save_total_limit 10 \ --lora_r 16 \ --lora_alpha 32 \ --lora_dropout 0.05 \ --bf16 \ --deepspeed configs/ds_config_bf16.json

您好，我想请问一下，如果我想抽取完整的三元组，有没有对应的prompt模版？我只看到实体、事件等单独的抽取方式

如题

hugging face 上release的dataset格式报错。

您好，我在用如下代码试图加载zjunlp/iepie （这个名字是否应该改为zjunlp/iepile😂）

import datasets
datasets.load_dataset('zjunlp/iepie')

在download完成后，显示报错：
File "/home/zkhu143/anaconda3/envs/llama2/lib/python3.8/site-packages/datasets/table.py", line 2286, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")

在检查zjunlp/iepie这一repo的文件后，发现各个文件的json格式都不一样，很多不满足
{'task': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None)}这样的格式。

请问这是否意味着后续将有更新呢？

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2788) of binary: /root/miniconda3/bin/python

有哪个大佬遇到一样的错么?

KeyError: 'response'

利用https://huggingface.co/datasets/zjunlp/InstructIE/tree/main/InstructIE-zh/InstructIE_%E5%8C%BB%E5%AD%A6
中的数据集在qwen1.5-14b-chat的lora微调模型的基础上继续训练，其中的运行代码.bash中的参数设置如下所示

运行报错如下所示

其中使用的数据集已经经过格式的转换

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.