Code Monkey home page Code Monkey logo

langwhich's Introduction

๐Ÿช langWhich: NER for Programming Languages

This project seeks to detect programming languages using datasets from Stack Overflow and Reddit.

๐Ÿ“‹ Abstract

The goal of the project is to make a model that could be used generally but it will specifically be used as a Named Entity Recognition exercise on Stack Overflow/Reddit with verticals for Sentiment Analysis. It is an attempt to investigate how the different communities think about different programming languages.

Typically, two models are evaluated using this project; a pattern matching model and a spaCy NER model to pursue a comparison between rule-based Statistics and Machine Learning approaches for NLP.

To export and run this workflow on your local machine, use the spacy project run package command.

๐Ÿ—‚ project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows.

โฏ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
preprocess Convert the data to spaCy's binary format
patternmod Generate a named entity recognition model based on rules.
train Train a named entity recognition model
evaluate Evaluate the model and export metrics
package Package the trained model so it can be installed
show-stats Show the statistics that compares both models.

โญ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all preprocess โ†’ patternmod โ†’ train โ†’ evaluate

๐Ÿ—‚ Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/stackoverflow-train.jsonl Local JSONL-formatted training data
assets/stackoverflow-valid.jsonl Local JSONL-formatted validation data

๐Ÿ—‚ Config Files

The following configuration files are defined by the project.

File Source Description
configs/config.cfg Local CFG-formatted for base config
configs/proglang_patterns.jsonl Local JSONL-formatted rule patterns

๐Ÿ—‚ Scripts

The following Python scripts are defined by the project.

File Source Description
scripts/preprocess.py Local Pre-Processing Script
scripts/save_pattern_model.py Local Pattern NER Script
scripts/print_stats.py Local Results Comparison Script

๐Ÿ“‹ Command Line Interface

The commands and workflows can be used with the CLI as follows:

Initialize: project run

Initialization

Command Execution: project preprocess

Initialization

Workflow Execution: project all

Initialization

Initialization

Metrics: project show-stats

Initialization

๐Ÿ“‹ References

spaCy and spaCy Projects: Documentation

Explosion Templates: GitHub Repository

Vincent Warmerdam: GitHub

๐Ÿ“‹ Note

Part of this documentation has been auto-generated using the spacy project document command!

langwhich's People

Contributors

kunal-bhar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.