Code Monkey home page Code Monkey logo

prose-benchmarks's Introduction

PROSE Public Benchmark Suite

This repository contains the Microsoft PROSE public benchmark suite.

This suite contains benchmarks drawn from three classes of tasks:

  • Transformation.Text: string-to-string transformation
  • Split.Text: text-to-table transformation
  • Extraction.Text: substring extraction from semi-structured text
  • LastMile.Repair: syntax (and some semantic) program repair for code that can be achieved with few edits

For more details on LastMile.Repair follow the README in LastMile.Repair/

See below for more detailed descriptions of each class.

Where did this data come from?

A subset of the benchmarks were derived from publicly redistributable data sources. The source and license for each such benchmark are detailed in LICENSE.

The remainder of the benchmarks were synthetically generated by the PROSE team to conform to patterns that might be observed in real-world systems and databases. For example, numbers were randomly generated to simulate social security numbers. No personal data was used to generate this data. There is a small chance that synthetically generated data could inadvertently match one or more attributes of a real person. If you have any concerns that synthetically generated data matches attributes of a real person, please contact us at [email protected] with details. We are offering the data under a permissive license, but to help address any concerns of this nature at the source, we encourage you not to redistribute the data.

File Structure

  • Transformation.Text
    • <benchmark dir>
      • meta.json
      • spec.json
    • <benchmark dir>
      • meta.json
      • spec.json
    • ...
  • Split.Text
    • <benchmark dir>
      • meta.json
      • input.txt
      • output.json
    • <benchmark dir>
      • meta.json
      • input.txt
      • output.json
    • ...
  • Extraction.Text
    • <benchmark dir>
      • meta.json
      • input.txt
      • output.json
      • <ancestor>.<descendant>.spec.json
    • <benchmark dir>
      • meta.json
      • input.txt
      • output.json
      • <ancestor>.<descendant>.spec.json
    • ...

All files are encoded in UTF-8 without BOM.

Metadata

Each benchmark directory contains a distinguished meta.json file that annotates that benchmark with descriptive metadata.

The Features field denotes a list of classifications describing the transformations required to solve the task. At this time, only benchmarks in Transformation.Text have these annotations.

  • Casing: converting character cases between capitals and non-capitals
  • Concatenation: concatenating strings
  • Conditional: conditioning transformations on predicates
  • DateTimeRange: computing date/time ranges
  • DateTimeRounding: rounding dates/times
  • DateTime: manipulating dates/times
  • Multicolumn: transforming inputs consisting of multiple strings
  • Numeric: manipulating numbers
  • NumericRange: computing numeric ranges
  • NumericRounding: rounding numbers
  • Substring: extracting substrings

In the case that the benchmark was generated synthetically, meta.json will contain field Synthetic with value true otherwise false.

In addition to the meta.json file, each benchmark has the following structure:

Transformation.Text

A Transformation.Text benchmark consists of a single JSON file spec.json containing input-output pairs. Example:

{
  "Examples": [
    {
      "Input": [
        "/libero/enim7.png"
      ],
      "Output": "enim7"
    },
    {
      "Input": [
        "/"
      ],
      "Output": "root"
    },
    {
      "Input": [
        "/libero/enim9.png"
      ],
      "Output": "enim9"
    }
  ]
}

Split.Text

A Split.Text benchmark consists of two files:

  1. A text file input.txt containing the raw string from which to extract a table. Example:

    (58.326261139, 89.99508561)
    (65.889370802, 72.93175018)
    
  2. A JSON file output.json describing the tabular structure of (1). Note that nonconforming rows are represented as lists populated with null in each element, so that every constituent list in Rows contains the same number of elements. Example:

    {
      "Rows": [
        [
          "(",
          "58.326261139",
          ", ",
          "89.99508561",
          ")"
        ],
        [
          "(",
          "65.889370802",
          ", ",
          "72.93175018",
          ")"
        ],
        [
          null,
          null,
          null,
          null,
          null
        ]
      ]
    }

Extraction.Text

An Extraction.Text benchmark consists of three or more files:

  1. A text file input.txt containing the raw string from which extraction is to be performed. Example:

    Header0
        Subheading0a
    	Subheading0aa
    	Subheading0ba
        Subheading0b
    	Subheading0ba
    	Subheading0ca
    Header1
        Subheading1a
    	Subheading1aa
    	Subheading1ab
    Header2
        Subheading2a
    	Subheading2aa
    	Subheading2ab
        Subheading2b
    	Subheading2ba
    	Subheading2bb
    Header3
        Subheading3a
        Subheading3b
    	Subheading3ba
        Subheading3c
    	Subheading3ca
    	Subheading3cb
  2. A JSON file output.json specifying the tree structure of (1). Example:

    {
      "Property": "root",
      "Start": 0,
      "End": 363,
      "Children": [
        {
          "Property": "HeaderStruct",
          "Start": 0,
          "End": 102,
          "Children": [
            {
              "Property": "Header",
              "Start": 0,
              "End": 7,
              "Value": "Header0"
            },
            {
              "Property": "SubHeaderStruct",
              "Start": 12,
              "End": 59,
              "Children": [
                {
                  "Property": "SubHeader",
                  "Start": 12,
                  "End": 24,
                  "Value": "Subheading0a"
                },
                {
                  "Property": "SubSubHeaderStruct",
                  "Start": 26,
                  "End": 41,
                  "Children": [
                    {
                      "Property": "SubSubHeader",
                      "Start": 26,
                      "End": 39,
                      "Value": "Subheading0aa"
                    }
                  ]
                },
                {
                  "Property": "SubSubHeaderStruct",
                  "Start": 41,
                  "End": 59,
                  "Children": [
                    {
                      "Property": "SubSubHeader",
                      "Start": 41,
                      "End": 54,
                      "Value": "Subheading0ba"
                    }
                  ]
                }
              ]
            },
    ...

    Each node has a Property field containing a descriptive label. Each node also has Start and End indices denoting the node's corresponding character extent from input.txt. Each leaf node additionally has a Value field with corresponding substring, while each non-leaf node instead has a Children field containing a list of its subnodes. The distinguished Property value root is reserved for the root node that covers the entire input string.

  3. One or more JSON files with the naming scheme <ancestor>.<descendant>.spec.json. Each contains examples for extracting strings with Property <descendant> from strings with Property <ancestor>.

    Each .spec.json file denotes one of two possible kinds of extractions. If Kind is Sequence, then the extraction is of type string -> list of string, and the task is to extract a list of substrings from the input string. An instance of Microsoft.ProgramSynthesis.Extraction.Text.SequenceProgram in the PROSE SDK can be one solution to such a task. Example:

    {
      "Kind": "Sequence",
      "Examples": [
        {
          "Input": [
            {
              "Start": 0,
              "End": 363,
              "Value": "Header0\n    Subheading0a\n\tSubheading0aa\n\tSubheading0ba\n    Subheading0b\n\tSubheading0ba\n\tSubheading0ca\nHeader1\n    Subheading1a\n\tSubheading1aa\n\tSubheading1ab\nHeader2\n    Subheading2a\n\tSubheading2aa\n\tSubheading2ab\n    Subheading2b\n\tSubheading2ba\n\tSubheading2bb\nHeader3\n    Subheading3a\n    Subheading3b\n\tSubheading3ba\n    Subheading3c\n\tSubheading3ca\n\tSubheading3cb\n"
            }
          ],
          "Output": [
            [
              {
                "Start": 0,
                "End": 7,
                "Value": "Header0"
              },
              {
                "Start": 102,
                "End": 109,
                "Value": "Header1"
              },
              {
                "Start": 157,
                "End": 164,
                "Value": "Header2"
              },
              {
                "Start": 259,
                "End": 266,
                "Value": "Header3"
              }
            ]
          ]
        }
      ]
    }

    In this case, each input string in Input has a corresponding list of output strings in Output to be extracted from it.

    If Kind is Field, then the extraction is of type string -> string, and the task is to extract a substring from the input string. An instance of Microsoft.ProgramSynthesis.Extraction.Text.RegionProgram in the PROSE SDK can be one solution to such a task. Example:

    {
      "Kind": "Field",
      "Examples": [
        {
          "Input": [
            {
              "Start": 0,
              "End": 102,
              "Value": "Header0\n    Subheading0a\n\tSubheading0aa\n\tSubheading0ba\n    Subheading0b\n\tSubheading0ba\n\tSubheading0ca\n"
            },
            {
              "Start": 102,
              "End": 157,
              "Value": "Header1\n    Subheading1a\n\tSubheading1aa\n\tSubheading1ab\n"
            },
            {
              "Start": 157,
              "End": 259,
              "Value": "Header2\n    Subheading2a\n\tSubheading2aa\n\tSubheading2ab\n    Subheading2b\n\tSubheading2ba\n\tSubheading2bb\n"
            },
            {
              "Start": 259,
              "End": 363,
              "Value": "Header3\n    Subheading3a\n    Subheading3b\n\tSubheading3ba\n    Subheading3c\n\tSubheading3ca\n\tSubheading3cb\n"
            }
          ],
          "Output": [
            {
              "Start": 0,
              "End": 7,
              "Value": "Header0"
            },
            {
              "Start": 102,
              "End": 109,
              "Value": "Header1"
            },
            {
              "Start": 157,
              "End": 164,
              "Value": "Header2"
            },
            {
              "Start": 259,
              "End": 266,
              "Value": "Header3"
            }
          ]
        }
      ]
    }

    In this case, Input is a list of strings, and its corresponding Output is a list of substrings, one for each respective input string.


This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

prose-benchmarks's People

Contributors

aleun avatar ashishxtiwari avatar chrisparnin avatar josepablocam avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar mukulsingh105 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prose-benchmarks's Issues

License for Codefusion

What is the license for codefusion-related assets? There is no mentioning of codefusion in the root license file.

Question in CODEFUSION

Hi, @MukulSingh105

I found this repo according to your recent EMNLP paper "CODEFUSION: A Pre-trained Diffusion Model for Code Generation". Thanks for your work! I noticed the following table in the paper. In Table 1, the number of parameters of ChatGPT-3.5 is 20B. The number shocks me a lot. Could you please share the data source? Is the info of #P in GPT-3.5 from the official OpenAI team?

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.