StrucText-Eval: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

🎉 Overview

Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StrucText-Eval, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StrucText-Eval-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0% on StrucText-Eval-Hard, while human accuracy reaches up to 95.7%. Moreover, while fine-tuning using StrucText-Eval can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types.

The repo mainly consist of:

StrucText-Eval Dataset used in the paper
Code used to generate the dataset

🔥 Updates

2024/6/29: We released the customization code for customizing your own StrucText-Eval.
2024/6/19: We released the initial version of the dataset used in the paper.
2024/6/15: We released the first version of our paper.

💡 The Introduction to the Existing StrucText-Eval

There are eight types of structure-rich languages are covered in StrucText-Eval, including seven existing languages and one customized:

There are eight different tasks in StrucText-Eval, and two different level of difficulties are preset. The statistic information to the existing StrucText-Eval and the description of the tasks are listed as follow:

⚙️ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

conda create -n fcs python=3.10.9
conda activate fcs
pip install fire

Benchmark Generation with Different Setting

Please rewrite the setting in generate_setting.json. The descriptions to all the parameters are directly listed bellow:

{
  "#": 
  {  // Overall Setting
    "output_dir": "",  // where your benchmark files want to be writed down
    "few_shots": []  // how many few_shot do you want to use in your benchmark
  },
  "csv": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // useless
  },
  "json": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "latex": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "markdown": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "org": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "tree": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // useless
  },
  "xml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6 // the max number of char in each item
  },
  "yaml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  }
}

And use the following code to generate the benchmark:

generate_dataset.sh

cd LLMStructure
python datagen.py

📒 Citation

@article{gu2024StructBench,
  title={StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding},
  author={Gu, Zhouhong and Ye, Haoning and Zhou, Zeyang and Feng, Hongwei and Xiao, Yanghua},
  journal={arXiv preprint arXiv:2406.10621},
  year={2024}
}

mikegu721 / structext-eval Goto Github PK

structext-eval's Introduction

StrucText-Eval: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

🎉 Overview

🔥 Updates

💡 The Introduction to the Existing StrucText-Eval

⚙️ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

Benchmark Generation with Different Setting

📒 Citation

structext-eval's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent