Code Monkey home page Code Monkey logo

parsing-references-using-svm's Introduction

Parsing-References-using-SVM

EXTRACTION OF REFERENCE META DATA

This document gives an overview of the procedure. Refer to individual scripts which are properly commented for any issues.

The target is to extract the meta data from the references. This meta data includes Authors (tagged as ), organizations, locations, year of publication, volume, sub-volume, start-page of a reference, end-page and miscellanuous. This has been achieved in two parts.

EXTRACTION OF NAMES OF AUTHORS, ORGAIZATIONS AND LOCATIONS OF PUBLICATION (ALPHABETS)

For the first part, we have used the Stanford Named Entity Recogniser (available at http://nlp.stanford.edu/software/CRF-NER.shtml). It produces an inlineXML output file which tags the name of the person and the journal name as the organization name in the format as shown below: Lee S.K. . It also produces a MISC tag which identifies many words that don't fall into one of above categories.

EXTRACTION OF YEAR, VOLUME, SUB-VOLUME (NUMBERS)

For this part, after analysing various types of documents, we came to the conclusion that hard coding any pattern was not possible. After thinking about various patterns we decided to use the Support Vector Machine (SVM) algorithms to get the required output. SVM classification divides points in an n-dimensional space into categories, the complexity of the boundary largely depending on the type of kernel function used. We have used the libSVM library for implementing SVM algorithms and a gaussian kernel.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

A. REQUIREMENTS AND ASSUMPTIONS

  • Input files should be in .csv format
  • final tagged files are contained in the folder named "tagged"
  • Estimated time: ~2.72 sec per file

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

B. Step by step implementation:

  1. Files are copied from the location specified by the user and zprocess.pl is run to add dots after each single capital letter in the files. This is necessary because the stanford-ner library would tag, say, "Hammond L." as " Hammond L. " but "Hammond L" as " Hammond L". Subsequently, the stanford-ner library functions are used on the csv files to get stage1 (refer above) tagged files.

  2. For stage2, first the parser.pl takes original csv files and converts all the numbers therin (ignoring the indexes) to a 3-dimensional vector, with the number being the 2nd and the 1st and 3rd coordinates being the puctuation marks around the number. We've ignored all the alphabets as they've already been tagged in stage1. parser.pl creates files with "trainfile" appended to the basenames, and the format is as required by the libsvm library (Manual_tag/dummy_string\s1st_coordinate,2nd_coordinate,3rd_coordinate).

  3. Conversion to svm format:- super.svm is the master file which contains manually tagged reference metadata (numbers only). svm-predict creates super.svm.model which would be used for the final prediction task. svm-predict is finally invoked to classify each document and print the tags (in the form of numbers, like "1" for "year", "4" for "volume" etc.) to a file "temp_out...".

  4. First column from this file (the numeric tags) are copied and printed to another file "out..." alongwith the vetors to get numbers and their tags in the same row.

  5. Finally, tagger.pl searches for numbers of "out..." in the stage1 tagged files and places tags around them.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

C. HOW TO TRAIN REFERENCE TAGGER FOR BETTER RESULTS

  1. Copy everthing above (and including) "done" (line 141; till section 3) to a new bash script and comment out the three "rm" commands in lines 137-139.

  2. Manually tag all the .csv.svm files in stage2 folder by replacing "0" with the appropriate tag (1-year, 2-start page, 3-end page, 4-volume, 6-subvolume). The number to be tagged is the middle number in each row multiplied by 1000, and the numbers are in the same order as in the original file supplied to the package.

  3. Copy all the manually-tagged rows of the .csv.svm files and append them at the end of super.svm.

New training set is ready !

parsing-references-using-svm's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.