Comments (10)
You can train the parser module with XML files (see the samples in res/parser). However, the finder module is trained using TTX files (see res/finder). TTX is a custom format that makes annotating long text documents relatively easy. Using XML for this would be cleaner but would require a lot of extra tooling to make it feasible (for training documents you need to label every line of a document after all).
Please note that the documents in res/finder are only one part of the sources used to train the default finder module. We can't publish the other documents due to copyrights.
from anystyle.
thank you. So the xml must have that specific structure or i can use my own annotation?
And also, can i use input pdf documents with the parse command or should i provide txt files as input instead?
from anystyle.
The XML used for the parser must use the structure as in the sample files, i.e. one <dataset>
with any number of <sequence>
records. The tag names used inside a sequence correspond to the labels known to the model, that means you can introduce your own labels there. However, I'd definitely stick with the labels of the default model (author, date, title, etc.) because the feature extraction and normalization helpers operate on those specifically. If you introduce new labels you might want to add your own normalizer code for example to process such segments.
The parse command takes text input (one reference per line). But of course you can use finder and parser module in combination, for example with the CLI tool. The finder module would extract the references from a PDF or text document and pass it on to the parser module which would then segment and label each reference individually.
from anystyle.
Basically, the finder module takes entire documents; it splits the document into lines and operates on each line: every line is assigned a label; multiple lines with the same label are grouped together; reference groups are extracted; a heuristic based on regular expressions is applied to try and separate individual references.
The parser module takes one or more lines as input; each line is interpreted as a single reference; the line is split into word-tokens and each word is labeled; successive words with the same label are grouped together; normalizer routines are applied for specific labels.
from anystyle.
thank you again.Does anystyle provide a converter for creating TTX files?
from anystyle.
Yes you can save documents as TTX. TTX is just plain text but with a certain prefix on each line; it was build for manual annotation using diff and simple text editors like nvim. You can also find more background info in some issue threads here.
from anystyle.
So if i understood correctly:
I could train the parse model by annotating my xml files according to your model and use the trained parser as default when running the find command. Is it correct?
How can I set the trained parser as default?
from anystyle.
When using the CLI tool you can pass the model file as an argument from the command line. If you use the Ruby Gem you can set model
option of a given parser instance (or change the default setting).
from anystyle.
while trying to training the parser i got this error
error: undefined method 'strip' for nil:NilClass
but i cannot understand how to fix this. What could it be?
from anystyle.
I think this is probably a cryptic error message due to invalid training data. It's usually something like a blank segment (i.e., something like <title></title>
).
from anystyle.
Related Issues (20)
- Optimize dataset
- Scramble non-open access finder datasets to avoid copyright issues HOT 4
- Add stripping of "by ..." to names normalizer HOT 1
- how can I make my gem installed version behave like the online anystyle.io HOT 5
- Anystyle "doing nothing". HOT 1
- anystyle find: File name too long @ rb_sysopen HOT 1
- Move CI to GitHub
- References in Danish (or other languages?) HOT 2
- ttx parser breaks if line doesn't end with space HOT 5
- pdftotext not found HOT 2
- Question: training different types of "ref" HOT 1
- Tokenizer doesn't parse Volume/issue typeset with no space. HOT 3
- Minimal amount of data for finder/parser model training? HOT 2
- Include Roman numerals in pages in Bibtex output
- Input count wrong: limit yourself to 1000 references per request HOT 4
- Error: uninitialized constant AnyStyle::ParserCore::StringUtils in Ruby 3.2.2 HOT 4
- Consider normalizing whitespace HOT 3
- Is it possible to delete characters/words when editing the parsed citation? HOT 3
- Error collecting references from a pdf HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anystyle.