Code Monkey home page Code Monkey logo

ace2005-toolkit's Introduction

ACE2005-toolkit

ACE 2005 data preprocess

File structure

ACE2005-toolkit
β”œβ”€β”€ ace_2005 (the ACE2005 raw data)
β”‚   β”œβ”€β”€ data
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ docs
β”‚   β”‚   └── ...
β”‚   │── dtd
β”‚   β”‚   └── ...
β”‚   └── index.html
β”œβ”€β”€ cache_data (empty before run)
β”‚   β”œβ”€β”€ Arabic/
β”‚   β”œβ”€β”€ Chinese/
β”‚   └── English/
β”œβ”€β”€ filelist (train/dev/test doc files)
β”‚   β”œβ”€β”€ ace.ar.dev
β”‚   β”œβ”€β”€ ace.ar.test
β”‚   β”œβ”€β”€ ace.ar.train
β”‚   β”œβ”€β”€ ace.en.dev
β”‚   β”œβ”€β”€ ace.en.test
β”‚   β”œβ”€β”€ ace.en.train
β”‚   β”œβ”€β”€ ace.zh.dev
β”‚   β”œβ”€β”€ ace.zh.test
β”‚   └── ace.zh.train
β”‚   
β”œβ”€β”€ output (final output, empty before run)
β”‚   β”œβ”€β”€ BIO (BIO output)
β”‚   β”‚   β”œβ”€β”€ train/
β”‚   β”‚   β”œβ”€β”€ test/
β”‚   β”‚   └── dev/
β”‚   └── ...
β”œβ”€β”€ udpipe (udpipe files)
β”‚   β”œβ”€β”€ arabic-padt-ud-2.5-191206
β”‚   β”œβ”€β”€ chinese-gsd-ud-2.5-191206
β”‚   └── english-ewt-ud-2.5-191206
β”œβ”€β”€ ace_parser.py
β”œβ”€β”€ extract.py
β”œβ”€β”€ format.py
β”œβ”€β”€ transform.py
β”œβ”€β”€ udpipe.py
β”œβ”€β”€ requirements.txt
└── run.sh

Preprocess steps

  1. Download the ACE2005 raw data and rename into ace_2005 ;
  2. Install all the requirements by pip install -r requirements.txt;
  3. Start preprocess by bash run.sh en, en can be replaced by zh or ar;
  4. Enter n to get data divided by filelist, or enter y and train/dev/test rate(e.g. 0.8 0.1 0.1) to get data divided by sentences;
  5. Enter y to get transform the data into BIO-type format, the transformed data will be in output/BIO/, each train (test or dev) data will be transformed into 4 BIO-style json files(token, entity_BIO, event_trigger_BIO and event_argument_BIO);
  6. The final output will be in directory output/.

Output format

The output will save separately in output/, each file can be loaded by json.loads(). After loading, the data will be in python list type, each line will be in python dict type:

{
    "sentence": "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region.",
    "tokens": [
        "Orders",
        "went",
        "out",
        "today",
        "to",
        "deploy",
        "17,000",
        "U.S.",
        "Army",
        "soldiers",
        "in",
        "the",
        "Persian",
        "Gulf",
        "region",
        "."
    ],
    "golden-entity-mentions": [
        {
            "entity-id": "CNN_CF_20030303.1900.02-E4-186",
            "entity-type": "GPE:Nation",
            "text": "U.S",
            "sent_id": "4",
            "position": [
                7,
                7
            ],
            "head": {
                "text": "U.S",
                "position": [
                    7,
                    7
                ]
            }
        },
        ...
    ],
    "golden-event-mentions": 
        {
            "event-id": "CNN_CF_20030303.1900.02-EV1-1",
            "event_type": "Movement:Transport",
            "arguments": [
                {
                    "text": "17,000 U.S. Army soldiers",
                    "sent_id": "4",
                    "position": [
                        6,
                        9
                    ],
                    "role": "Artifact",
                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"
                },
                {
                    "text": "the Persian Gulf region",
                    "sent_id": "4",
                    "position": [
                        11,
                        15
                    ],
                    "role": "Destination",
                    "entity-id": "CNN_CF_20030303.1900.02-E76-191"
                }
            ],
            "text": "Orders went out today to deploy 17,000 U.S. Army soldiers\nin the Persian Gulf region",
            "sent_id": "4",
            "position": [
                0,
                15
            ],
            "trigger": {
                "text": "deploy",
                "position": [
                    5,
                    5
                ]
            }
        },
        ...
    ],
    "golden-relation-mentions": [
        {
            "relation-id": "CNN_CF_20030303.1900.02-R1-1",
            "relation-type": "ORG-AFF:Employment",
            "text": "17,000 U.S. Army soldiers",
            "sent_id": "4",
            "position": [
                6,
                9
            ],
            "arguments": [
                {
                    "text": "17,000 U.S. Army soldiers",
                    "sent_id": "4",
                    "position": [
                        6,
                        9
                    ],
                    "role": "Arg-1",
                    "entity-id": "CNN_CF_20030303.1900.02-E25-1"
                },
                {
                    "text": "U.S. Army",
                    "sent_id": "4",
                    "position": [
                        7,
                        8
                    ],
                    "role": "Arg-2",
                    "entity-id": "CNN_CF_20030303.1900.02-E66-157"
                }
            ]
        }, 
        ...
    ]
}

You will get all the golden data of entities, events and relations in output files.

Adjustment

You can change the file names in filelist/, which will directly change the files belong to train/dev/test, we use a default (529/30/40) division.

Related work

Email us

Any questions can contact us by [email protected].

ace2005-toolkit's People

Contributors

clearailhc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ace2005-toolkit's Issues

Chinese BIO转捒

@ClearailhcοΌŒζ„Ÿθ°’εˆ†δΊ«γ€‚
εœ¨θΏ›θ‘ŒδΈ­ζ–‡bioθ½¬ζ’ζ—Άζ˜Ύη€ΊθΆ…ε‡Ίεˆ—θ‘¨θŒƒε›΄οΌŒθ€Œθ‹±ζ–‡θ½¬ζ’ζ—Άζ²‘ζœ‰θ―₯错误。
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.