Code Monkey home page Code Monkey logo

open-phrasebank's Introduction

Open Phrasebank

Building your own phrasebank. ✨

Documentation Status PyPI - Version GitHub Action GitHub License Docker Pulls

This repository provides an accessible phrase bank, which is a collection of frequently used phrases that can be utilized, for example, in the auto-complete function of an IDE. (Note: This library does not provide IDE or auto-complete functions but offers ready-to-use phrasebanks)

Moreover, this repository includes features for constructing a phrase bank from a provided text or an open corpus.

Why Use Phrase Bank

Boosting Typing Experience with Phrasebank πŸš€

Academic Writing πŸ•΅οΈβ€β™€

You can further customize the phrasebank according to your needs, e.g. for certain disciplines, for certain styles (descriptive, analytical, persuasive and critical), for certain sections (abstract, body text), as long as you can find good ingredients.

Open Phrasebanks

Academic Phrasebank

Elsevier OA CC-BY contains 40k articles from Elsevier's journals, including from Arts, Business, STEM to Social Sciences1.

No. Phrasebank Source N of grams Lines Comments
1 πŸ“academic_phrasebank Book Academic Phrasebank 2014 2-5 2,190 Extract from pdf (Zhihao, 2024)
2 πŸ“elsevier_phrasebank Corpus Elsevier OA CC-BY 2020 2-6 3,792 Extract by n-gram (Zhihao 2024)
3 πŸ“bawe_1000.csv Corpus British Academic Written English 4-6 1,000 Due to inaccessible, only most frequent 1000 list here. (Zhihao, 2024)
4 πŸ“academic_word_list Academic Word List Coxhead (2000) 1 570 The 570 word for academic English (exclude frequent 2000 words)
5 πŸ“elsevier_awl 2,4 2-6 994 The Elsevier phrasebank that contains AWL (Zhihao, 2024)
6 πŸ“elsevier_ENVI_EART 2 2-7 3,700 Environment & Earth Science 3700 collection (Zhihao 2024)
7 πŸ“elsevier_PSYC_SOCI 2 2-7 3,700 Social Science & Psychology 3700 collection (Zhihao 2024)
8 πŸ“elsevier_MEDI 2 2-7 3,700 Medicine 3700 collection (Zhihao 2024)

English Frequent Phrasebank

No. Phrasebank Source N-gram Length Lines Comments
1 πŸ“google-10000-english Google Books Corpus 1 10,000 The 10,000 most common English words from Google Books Corpus
2 πŸ“Wordlist 1200.txt Internet 1 2,000 The 2,000 most common English words

Other Phrasebank

No. Phrasebank Source N-gram Length Lines Comments
1 πŸ“emoji 1 745 (Zhihao 2024)

Quickstart

You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.

pip install openphrasebank

Get a Self-defined Phrasebank in 3 Steps

Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.

1️⃣ Load and Tokenize the Data

import openphrasebank as opb

tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by", 
                                         subject_areas=['PSYC','SOCI'],
                                         keys=['title', 'abstract','body_text'],
                                         save_cache=True,
                                         cache_file='temp_tokens.json')

2️⃣ Generate N-grams

n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)

3️⃣ Filter and save

# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}

# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
    phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)

# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))

# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
    for line in sorted_phrases:
        file.write(line + '\n')

How to Contribute

You can either contribute the phrasebank or the code. Check out our contributing.

Known Issues

Phrasebank Issues
academic_phrasebank Due to the table in the PDF file not being properly handled, many sentences were not extracted correctly. (zhihao)
elsevier_phrasebank

ko-fi

Footnotes

  1. Over 20 disciplines orieg/elsevier-oa-cc-by Β· Datasets at Hugging Face ↩

open-phrasebank's People

Contributors

liuh886 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

open-phrasebank's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.