Code Monkey home page Code Monkey logo

context-based-classification-of-software-mentions-in-scientific-data's Introduction

Context-based-Classification-of-Software-Mentions-in-Scientific-Data(Bio-medical & Life Sciences)

Motivation

  • To know about which software / framework that researchers mostly use in their work
  • About Software availability and attribution of software developer
  • Objective

    • To classify the Software Mentions appearing in text data , extracted from Bio medical and Social Sciences articles or research papers.
    • To provide basis for building Software Knowledge Graph

    Classifcation

    There are four classes in which software mentions will be categorized as per their context.
    1. Usage- If software is being actually used by the researcher
    2. Mention- If software is just mentioned / disclosed / referred by the researcher but has not actually used it.
    3. Creation- If software is being developed by the researcher
    4. Deposition-If software is being first created and then deposited it somewhere for future availability by the researcher

    Dataset Copyrights

    This dataset is originally created by Rostock University developers and is their property. For commercial re-use of this data, contact the university administration.

    Data Pre-processing

    1. Annotated Software mentions using Brat Annotation Tool
      • 1,727 Files were annotated
      • 5,309 Sentences were annotated and extracted for Feature Space
    2. Brat Standoff Format to BIO Encoded Format
      • Relevant Sentence Extraction
      • Sentence Tokenization
      • BIO Encoding

    Feature Engineering

    • Replace Software Mentions with place holder
    • Extract Software Mentions Contextual Features
      • Find out Software Mentions Position
      • Extract Contextual Words as per window_size of 3
      • Padding if needed
    • Generate Word Embeddings of Contextual Words
      • Used Pre trained Model (wikipedia pubmed and PMC w2v.bin)
    • Generate Word Embeddings for POS tags of Contextual Words
    • Generate Specific Class based features
      • Frequent Words, Frequent tags etc.
    • Features Concatenation

    Modeling

    • Chose Random Forest Classifier (RF) from Scikit learn
      • Best Classical Machine Learning Algorithm
      • Anticipating performance and better predictability
    • Hyperparameters in RF
      • n_estimators No of trees in RF
      • max_depth depth of tree to fit to samples
      • c riterion information gain criteria at each node split
      • m ax_features No of features to consider when deciding for best split at nodes
      • min_samples_leaf Min no of samples that should be at leaf node
    • Results

    • Training Dataset 70%, Test Dataset 30%
    • Data F1 Score(%)
      Training 97.55
      Test 60.37

      License and Copyright

      This Code is written in scope of my pre-thesis at Rostock University, Germany. Licensed under the [MIT License](LICENSE).

    context-based-classification-of-software-mentions-in-scientific-data's People

    Contributors

    zohaibramzan avatar

    Watchers

     avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google โค๏ธ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.