Code Monkey home page Code Monkey logo

contract-discovery's Introduction

Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines

This repository contains data and additional information regarding the paper: Borchmann, Łukasz; Wiśniewski, Dawid; Gretkowski, Andrzej; Kosmala, Izabela; Jurkiewicz, Dawid; Szałkiewicz, Łukasz; Pałka, Gabriela; Kaczmarek, Karol; Kaliska, Agnieszka; Graliński, Filip. Contract Discovery: Dataset and a Few-Shot Semantic Retrieval Challenge with Competitive Baselines (to appear in Findings of EMNLP).

[Paper, BibTeX]

Semantic Retrieval Challenge

We proposed a new shared task of semantic retrieval from legal texts. A so-called contract discovery is to be performed – where legal clauses are extracted from documents, given a few examples of similar clauses from other legal acts. The task differs substantially from conventional NLI and shared tasks on legal information extraction (e.g., one has to identify text span instead of a single document, page, or paragraph).

The aim of this task The aim of this task is to identify spans in the requested documents (referred to as target documents) representing clauses analogous to the spans selected in other documents (referred to as seed documents).

The Semantic Retrieval Challenge directory contains a dataset in a form suitable for performing a repeated random sub-sampling validation procedure.

[More information]

Corpus of Legal Documents, Language Models

We release a large, cleaned, plain-text corpus of legal and financial texts for unsupervised model training or fine-tuning purposes. All the available documents of US EDGAR as for November 19, 2018, were crawled. The resulting corpus consists of approximately 1M documents and 2B words in total (1.5G of text after xz compression).

Besides, our GPT-1 and GPT-2 large models, fine-tuned on legal documents, are now publicly available. After unpacking, checkpoints can be loaded with Transformers library. Use them with default openai-gpt and gpt2-large tokenizers.

[Legal Corpus, Legal GPT-1, Legal GPT-2]

contract-discovery's People

Contributors

borchmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.