To classify the Software Mentions appearing in text data , extracted from Bio medical and Social Sciences articles or research papers. Also To provide basis for building Software Knowledge Graph
Context-based-Classification-of-Software-Mentions-in-Scientific-Data(Bio-medical & Life Sciences)
Motivation
To know about which software / framework that researchers mostly use in their work
About Software availability and attribution of software developer
Objective
To classify the Software Mentions appearing in text data , extracted from Bio medical and Social
Sciences articles or research papers.
To provide basis for building Software Knowledge Graph
Classifcation
There are four classes in which software mentions will be categorized as per their context.
Usage- If software is being actually used by the researcher
Mention- If software is just mentioned / disclosed / referred by the researcher but has not
actually used it.
Creation- If software is being developed by the researcher
Deposition-If software is being first created and then deposited it somewhere for future
availability by the researcher
Dataset Copyrights
This dataset is originally created by Rostock University developers and is their property.
For commercial re-use of this data, contact the university administration.
Data Pre-processing
Annotated Software mentions using Brat Annotation Tool
1,727 Files were annotated
5,309 Sentences were annotated and extracted for Feature Space
Brat Standoff Format to BIO Encoded Format
Relevant Sentence Extraction
Sentence Tokenization
BIO Encoding
Feature Engineering
Replace Software Mentions with place holder
Extract Software Mentions Contextual Features
Find out Software Mentions Position
Extract Contextual Words as per window_size of 3
Padding if needed
Generate Word Embeddings of Contextual Words
Used Pre trained Model (wikipedia pubmed and PMC w2v.bin)
Generate Word Embeddings for POS tags of Contextual Words
Generate Specific Class based features
Frequent Words, Frequent tags etc.
Features Concatenation
Modeling
Chose Random Forest Classifier (RF) from Scikit learn
Best Classical Machine Learning Algorithm
Anticipating performance and better predictability
Hyperparameters in RF
n_estimators No of trees in RF
max_depth depth of tree to fit to samples
c riterion information gain criteria at each node split
m ax_features No of features to consider when deciding for best split at nodes
min_samples_leaf Min no of samples that should be at leaf node
Results
Training Dataset 70%, Test Dataset 30%
Data
F1 Score(%)
Training
97.55
Test
60.37
License and Copyright
This Code is written in scope of my pre-thesis at Rostock University, Germany.
Licensed under the [MIT License](LICENSE).
context-based-classification-of-software-mentions-in-scientific-data's People