Code Monkey home page Code Monkey logo

structext-eval's Introduction

StrucText-Eval: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding

๐ŸŽ‰ Overview

Given the substantial volumes of structured data held by many companies, enabling Large Language Models (LLMs) to directly understand structured text in non-structured forms could significantly enhance their capabilities across various business scenarios. To this end, we propose evaluation data generation method for assessing LLM's ability in understanding the structure-rich text, which generates structured data of controllable complexity based on manually crafted question templates and generation rules. Building on this generation method, we introduce StrucText-Eval, a benchmark comprising 6,032 questions across 8 different structured languages and 29 specific tasks. Furthermore, considering human proficiency in rule-based tasks, we also present StrucText-Eval-Hard, which includes 3,016 questions designed to further examine the gap between LLMs and human performance. Results indicate that the best-performing LLM currently achieve an accuracy of 65.0% on StrucText-Eval-Hard, while human accuracy reaches up to 95.7%. Moreover, while fine-tuning using StrucText-Eval can enhance existing LLMs' understanding of all structured languages, it does not necessarily improve performance across all task types.

The repo mainly consist of:

  1. StrucText-Eval Dataset used in the paper
  2. Code used to generate the dataset

๐Ÿ”ฅ Updates

  • 2024/6/29: We released the customization code for customizing your own StrucText-Eval.
  • 2024/6/19: We released the initial version of the dataset used in the paper.
  • 2024/6/15: We released the first version of our paper.

๐Ÿ’ก The Introduction to the Existing StrucText-Eval

There are eight types of structure-rich languages are covered in StrucText-Eval, including seven existing languages and one customized:

There are eight different tasks in StrucText-Eval, and two different level of difficulties are preset. The statistic information to the existing StrucText-Eval and the description of the tasks are listed as follow:

โš™๏ธ Customize Your Own Benchmark for Evaluating LLMs' Structure-Rich Text Understanding Ability

Dependency Installation

conda create -n fcs python=3.10.9
conda activate fcs
pip install fire

Benchmark Generation with Different Setting

Please rewrite the setting in generate_setting.json. The descriptions to all the parameters are directly listed bellow:

{
  "#": 
  {  // Overall Setting
    "output_dir": "",  // where your benchmark files want to be writed down
    "few_shots": []  // how many few_shot do you want to use in your benchmark
  },
  "csv": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // useless
  },
  "json": 
  {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "latex": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "markdown": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "org": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  },
  "tree": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // useless
  },
  "xml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6 // the max number of char in each item
  },
  "yaml": {
    "nodes": [],  // The depth of your structure-rich data
    "n_ary_ratio": 1,  // The width in each layer
    "para_len_ratio": 6  // the max number of char in each item
  }
}

And use the following code to generate the benchmark:

generate_dataset.sh

or

cd LLMStructure
python datagen.py

๐Ÿ“’ Citation

@article{gu2024StructBench,
  title={StructBench: An Autogenerated Benchmark for Evaluating Large Language Model's Ability in Structure-Rich Text Understanding},
  author={Gu, Zhouhong and Ye, Haoning and Zhou, Zeyang and Feng, Hongwei and Xiao, Yanghua},
  journal={arXiv preprint arXiv:2406.10621},
  year={2024}
}

structext-eval's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.