sukrutrao / fast-dawid-skene Goto Github PK

Code for the algorithms in the paper: Vaibhav B Sinha, Sukrut Rao, Vineeth N Balasubramanian. Fast Dawid-Skene: A Fast Vote Aggregation Scheme for Sentiment Classification. KDD WISDOM 2018

Home Page: https://sites.google.com/view/fast-dawid-skene

License: MIT License

Python 100.00%

crowdsourced-aggregation crowdsourcing expectation-maximization python sentiment-classification

fast-dawid-skene's People

Contributors

Stargazers

Watchers

Forkers

s-dey tongrenypj jarrelscy maxblumental ra0k cognoscentai jesperkers lidiwen8 anirudh-murali dwright37 anjaramillo

fast-dawid-skene's Issues

Variable number of labelers?

Hi, I tried to test the algorithm over some real-life data. I found the limitation of having a fixed number of labelers a little bit impractical. I wonder if this is a particular implementation limitation or one of the aggregation model? What if a user provide some kind of reasonable upper limit for a task labelers count (3-5-7)? Thank you!

pandas Typeerror

Hi,
I have cloned the repo and activated a conda environment based on the requirements file you provide. I use Python 2.7 and when I run python scripts/fast_dawid_skene.py --dataset toy --k 2 --mode aggregate --algorithm FDS --print_result I get this error: TypeError: argument of type 'int' is not iterable. This is the stack trace:

Traceback (most recent call last):
File "scripts/fast_dawid_skene.py", line 64, in
main()
File "scripts/fast_dawid_skene.py", line 58, in main
run(args)
File "/home/konstantina/projects/Fast-Dawid-Skene/scripts/../fast_dawid_skene/main.py", line 41, in run
result_annotations.reset_index(level=0, inplace=True)
File "/home/konstantina/anaconda3/envs/dawid-skene/lib/python2.7/site-packages/pandas/core/frame.py", line 3055, in reset_index
if level is None or i in level:
TypeError: argument of type 'int' is not iterable

I changed level=0 to level={0} in main.py and there is no error anymore. Is this an appropriate solution?

How to print one final label per data instance

Hey,
I am currently running the command python scripts/fast_dawid_skene.py --dataset toy --mode aggregate --algorithm FDS --print_result successfully for my dataset, but when I check the results (either in the output file or the shell output), I see one final annotation per annotator as a result. I thought that one label per document will be created. I checked what happens at the end of main.py for printing and also the code in utils.to_csv, but I am unsure what to change to make it work. Could you please explain how to get the final labels? Thanks!

Incremental mode

Hi, was able to use the aggregator actually, thank you very much!

It has squeezed out 1.8M responses into 500K labels using 4.5 hours on 1 thread on server and 35Gb of memory ;) I think we can incorporate the solution, but I need to implement some enchancements to make it more usefull in production scenario. I will share my thoughts here just to let you know what we think would be useful in our real situation:

for the MLtoRank scenario we need to constantly get new labels and aggregate new judgements. 5 hours delay for adding an extra 1000 labels may be too long and too much electricity to burn. So we need to learn how to perform incremental step. That may include a possibility to backup all distributions and other state, prefill new cells with defaults and perform 1-2 extra iterations.
the prior may be enchanced by implementing a "partial" steps that updates only part of the rows imperically close to the changed ones. Then after one partial step we can perform one full to settle down if needed.
we have a method to order extra 3 marks if the first 3 do not give a confident label. So we will need to output a confidence level for chosen label for the decision to make an extra order.

And one extra off topic:

actually we have a binary labels with extra "grey" option. This is not a true ordinal, because "grey" option is rare, a few percent: it is allowed to use it in complex situations, also it can be produced if we have a lot of confident answers with trues and false. I think we can write some sort of heuristic based on labels probability distribution. E.g. calculate P(white) * 0.3 + P(gray) + P(black) * 0.3 + and compare with P(white) and P(black).

I would be happy to hear any thoughts regarding this, thank you!

sukrutrao / fast-dawid-skene Goto Github PK

fast-dawid-skene's People

Contributors

Stargazers

Watchers

Forkers

fast-dawid-skene's Issues

Variable number of labelers?

pandas Typeerror

How to print one final label per data instance

Incremental mode

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent