savkov / bratutils Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 12.0 88 KB

A collection of utilities for manipulating data and calculating inter-annotator agreement in brat annotation files.

License: MIT License

Python 99.30% Makefile 0.70%

bratutils's People

Contributors

Stargazers

Watchers

Forkers

hugosousa lr34 jackustc marny30 evpok denglizong jeanphilippegoldman ozborn tobiasoleary lisaterumi lraithel

bratutils's Issues

Relations not supported

Hi,

Was very happy to find this code I was looking for something to compare Brat annotations across files.
Did you ever look at implementing relations?

relations not supported still

I'm guessing relations never got added as I still receive errors. Has anyone come up with a simple fix to ignore relations so it still runs?

The merging of documents methods seem to be in broken

Merge document utilities seem to be in disarray. Would be nice to update them and write some proper tests for them.

See here

Support for relations has been long asked for but I've been reluctant to implement it because the code is not my best and am reluctant to go back into the heavy logic. However, I just worked on getting the parsing function to handle gracefully all types and it looks like relations can be implemented in a way that is self-contained and probably quite straightforward. So I'll lay out what I want to do here and ask for feedback.

Relations are effectively triples of two arguments and a relation type. Assuming that the possible arguments are predetermined, e.g. arguments can only be tokens, or chunks or some other pre-annotated spans, evaluating the agreement is really quite easy -- F1-score where each triple is treated as a unique annotation. I can probably copy lost of the code straight from bioeval.

I haven't thought about this for too long but using F1-score seems to be a bit of a copout here. The probability of a random assignment of a relation is not infinitely small. So maybe kappa can be implemented here instead.

Additionally, in many cases the arguments are not necessarily predetermined, so that would be quite hard to evaluate at the same time and honestly I have no idea how to do it ATM.

So I'm looking for some input here. Would be nice to hear what you think.

cc @jeanphilippegoldman @soluna1

Discontinuous Annotations

I would like to use this in a corpus annotation project that uses discontinuous annotations, but I receive the following error.

Traceback (most recent call last):
File "vso-inter-annotator.py", line 5, in
doc = a.DocumentCollection('data/BoireAnnotations/VSO_Hypertension1/')
File "build/bdist.macosx-10.10-intel/egg/bratutils/agreement.py", line 834, in init
File "build/bdist.macosx-10.10-intel/egg/bratutils/agreement.py", line 654, in init
File "build/bdist.macosx-10.10-intel/egg/bratutils/agreement.py", line 292, in init
File "build/bdist.macosx-10.10-intel/egg/bratutils/agreement.py", line 301, in _parse_annotation
ValueError: invalid literal for int() with base 10: '6419;6435'

vso-inter-annotator.py contains the following:
#VSO inter-rater agreement using BRAT utils

from bratutils import agreement as a

doc = a.DocumentCollection('data/BoireAnnotations/VSO_Hypertension1/')
doc2 = a.DocumentCollection('data/HerringAnnotations/VSO_Hypertension1/')

doc.make_gold()
statistics = doc2.compare_to_gold(doc)

print statistics

Here is the annotation file that is causing the error.

T1 VSO_0000005 3395 3407 182/107 mmHg
T2 VSO_0000005 4300 4312 200/100 mmHg
T3 VSO_0000008 4254 4260 36.8°C
T4 VSO_0000005 6518 6529 160/80 mmHg
T5 VSO_0000005 15833 15844 170/80 mmHg
T6 VSO_0000038 16385 16408 Systolic blood pressure
T7 VSO_0000005 16438 16446 200 mmHg
T8 VSO_0000005 16867 16878 135/95 mmHg
T9 VSO_0000005 16959 16971 160/100 mmHg
T10 VSO_0000005 17659 17671 220/120 mmHg
T11 VSO_0000005 18143 18154 135/95 mmHg
T12 VSO_0000004 3370 3384 blood pressure
T13 VSO_0000007 4239 4250 temperature
T14 VSO_0000004 4282 4296 blood pressure
T15 VSO_0000004 6486 6500 Blood pressure
T16 VSO_0000004 15802 15816 Blood pressure
T17 VSO_0000004 16826 16840 Blood pressure
T18 VSO_0000004 16941 16955 blood pressure
T19 VSO_0000004 17624 17638 Blood pressure
T20 GO_0008217 17713 17738 Blood pressure normalized
T21 VSO_0000004 18125 18139 blood pressure
T23 VSO_0000030 4341 4360 63 beats per minute
T24 GO_0008217 6405 6419;6435 6442 blood pressure control
T31 GO_0008217 16046 16060;16072 16079 blood pressure control
T33 VSO_0000006 16826 16844;16855 16863 Blood pressure was measured
T34 GO_0008217 17015 17029;17041 17048 blood pressure control
T38 VSO_0000029 4314 4324 Heart rate
T39 VSO_0000004 6147 6161 blood pressure
T41 GO_0008217 6486 6514 Blood pressure was decreased
T43 VSO_0000006 15802 15829 Blood pressure was measured
T22 VSO_0000004 6405 6419 blood pressure
T25 VSO_0000004 16046 16060 blood pressure
T26 VSO_0000004 17015 17029 blood pressure
T27 VSO_0000004 17713 17727 Blood pressure
T28 VSO_0000004 18517 18531 blood pressure

Thank you

typo coinsiding > coinciding

bratutils/src/bratutils/agreement.py

Line 444 in 0b3ba5c

coinsiding = []

many thanks for your work. saved me days.

Attribute support

I have never used the attributes annotation and am not sure of the tasks that they are part of. It would be nice if someone takes the lead in designing this. I would be willing to help with integrating it into the project.

Incorrect value of spurious tags when no overlapping

Hello again,

I have two .ann files.

The gold

T1  Medical-Concept 36 41   tumor
T2  Medical-Concept 327 351 síndrome mielodisplásica
T3  Medical-Concept 440 445 tumor
T4  Medical-Concept 22 32   morfologia
T5  Medical-Concept 79 117  Nomenclatura Sistematizada de Medicina
T6  Medical-Concept 120 126 SNOMED
T7  Medical-Concept 189 204 Linfoma maligno
T8  Medical-Concept 207 216 folicular
T9  Medical-Concept 220 227 nodular
T10 Medical-Concept 270 310 Anemia refratária com excesso de blastos
T11 Medical-Concept 356 366 deleção 5q
T12 Medical-Concept 368 371 5q-

And the candidate set

T1  Medical-Concept 270 287 Anemia refratária
T2  Medical-Concept 327 335 Síndrome
T3  Medical-Concept 471 476 seção

For the comparison I'm running the following code

from bratutils import agreement as a


__author__ = 'Aleksandar Savkov'

doc = '3711'
gold = a.Document('../res/ht_gold/' + doc + '.ann')
extension = a.Document('../res/ht_extension/' + doc + '.ann')

gold.make_gold()
statistics = extension.compare_to_gold(gold)

print statistics

This should produce as result: 0 correct, 12 missing and 3 spurious tags. Right?

The produced result is 3 missing tags and 0 correct/partial/spurious. I think the spurious tags are not being correctly handled.

Is my thinking right, or this is actually the desired output?

Hugo

Relations and attributes crash the parsing function

As reported in #14 relations and attributes crash the parsing function. This should be easily fixed as the problems seem to be the way the parsing is done -- not generic enough. It also looks like a good place to start in supporting relations and attributes in agreement.

Document instance has no attribute 'postag_list'

Hello. I need to compare automatic annotations performed by a software application with manual annotations (in brat standoff format), and this seems to be a nice tool to use.

While testing it and trying to understand the source code, I tried the following small sample code

import agreement as a

doc = a.Document("myfile.ann")
doc2 = a.Document("myfile.ann")

doc.make_gold()
statistics = doc2.compare_to_gold(doc)

However, on the execution of compare_to_gold function, it says that Document instance has no attribute 'postag_list', which is true, but I don't understand where this comes from either.

Am I missing something? Could you eventually post a small working example for comparing two .ann files? I'd appreciate that.

Thanks.

Apply to attributes and relations too

Hi Savkov,

Knowing this tool before would have saved me lot of time. I used NLTK package to measure IAA of brat annotation files. A bit a nightmare to convert the "ann" file to something readable. So, I think that this tool is very useful, and the code is great, congratulations!

Our problem is that we have data structured in this way:

T1 Food 24 31 bacalao
T2 Restaurant 0 8 Un sitio
T3 Restaurant 46 54 Un lugar
T5 Restaurant 55 66 con encanto
A3 Polarity T5 POS
A4 Restaurant_Aspects T5 General_experience
R2 refers_to Arg1:T5 Arg2:T3
T4 Food 34 43 riquísimo
A1 Polarity T4 POS
A2 Food_Aspects T4 General_experience
R1 refers_to Arg1:T4 Arg2:T1

And we want to measure agreement for the 3 categories, entities (Food#Bacalao), attribute ( aspect -->General_experience#con encanto; and polarity --> POS), and also relationships (R1 refers_to...). Are you planning to implement these options too? It would be really useful for annotation at aspect-based Sentiment Analysis.

Many thanks

savkov / bratutils Goto Github PK

bratutils's People

Contributors

Stargazers

Watchers

Forkers

bratutils's Issues

Relations not supported

relations not supported still

The merging of documents methods seem to be in broken

Relations support

Discontinuous Annotations

typo coinsiding > coinciding

Attribute support

Incorrect value of spurious tags when no overlapping

Relations and attributes crash the parsing function

Document instance has no attribute 'postag_list'

Apply to attributes and relations too

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent