garrafao / wugs Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 3.0 5.18 MB

Code to process Word Usage Graphs

License: GNU General Public License v3.0

Python 2.54% Shell 0.33% Jupyter Notebook 95.24% HTML 1.37% CSS 0.13% JavaScript 0.39%

wugs's People

Contributors

Stargazers

Watchers

Forkers

akutuzov tuo-zhang winobes

wugs's Issues

Move iterations in graph2cluster2.sh into Python file

some data created by graph2plot2.py not readable

Now, the Stats, Grouping stats, Agreement stats works on the server and locally. But Plotting stats and Annotator data does not work locally, also, the info and annotator filter do not work.
To reproduce, please run:
scripts/run_system2.sh test_uug/ correlation spring

remove quotes in data in context column (confusing)

https://github.com/Garrafao/WUGs/blob/main/durel_system/upload_formats/test_project/uses/adventure.csv

Quotes are not needed, it looks like the context, must be quoted, but it shouldn't.

Russian and Chinese data

add Russian and Chinese data to durel_system/upload_formats/

additional quotes introduced in one_for_all notebook

check these two lines in the notebook:

judgments_wug.to_csv('judgements_all.csv', index=False)
uses_wug.to_csv('uses_all.csv', index=False)

you should fix this because they will introduce additional quotation mark in the output

test wsbm with dense edge weights

WUG Pipeline Sub-processes

What are the separate sub-processes of the pipeline? Which values are calculated in these sub-processes? What is the format of the output values?

please add pandas to requirements.txt

Input Options WUG Pipeline

Which parameters can be passed to the pipeline to filter / specify the process? In which format should the parameters be presented? For which sub-processes of the pipeline are the parameters needed?

[visualization] if not all json files are generated, none of them are loaded in html

the stats_agreement.json is not generated when viewing graph in the discowug_unc dataset, and it leads to all statistics not rendering

there are repeated rows in the uses_all.csv in output of one_for_all notebook

add instance file to the sample data

for example, the test_english project does not have instance files

add unit tests

Remove internal explanations from durel_system/tutorials/

and add sample HTML with a description for how to create samples for integration of certain language

Possibility of saving html page with the current positioning of nodes

renaming the data_joint.json files and alike to js suffix

the data contained inside these files are not Json but javascript code. Here you are creating a global variable with standard javascript grammar, also, in the code example like:

Optimize filtering in HTML's

the filtering in the output HTML plots is not very efficient according to Lukas.

support for applying WSBM with multiple edge weight lists (for multiple annotators)

Cluster distribution with only one grouping

if there is no grouping split in the usage data, no stats_groupings file is created and thus no information on cluster frequency distributions is exported.

test_uug dataset not usable

Please try to upload both the uses file and the instances file into the system and make sure they can be uploaded. Currently, both two files have problems.

document run_system.sh scripts

Remove graph aggregation and cleaning options from clustering and plotting scripts for system pipeline

Flexible non-judgment value

Currently, the pipeline interprets judgments of 0.0 as non-valid. For label ranges including 0 as valid label, such as cosine similarity ranging from -1 to 1, this leads to wrongly treating 0.0 as non-valid label. Thus, we need to make the non-valid label a parameter.

missing js bug in spring template

try to figure out how to incorporate the new mechanism inside lucas's template into the template for spring

Timestamps for WUG pipeline

It would be helpful to have timestamps for the errors that occur in the WUG pipeline to improve the traceability of these errors when the pipeline is executed multiple times.

Running the pipeline with Python 3.11

When running the pipeline in an environment with python 3.11 the following error occurs:

Traceback (most recent call last):
  File "/home/arbeit/Desktop/DURel/durel_system/WUGs/scripts/data2join.py", line 46, in <module>
    w = csv.DictWriter(f, ['identifier1', 'identifier2', 'judgment', 'comment', 'annotator', 'lemma'], delimiter='\t', quoting = csv.QUOTE_NONE, quotechar='')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/csv.py", line 139, in __init__
    self.writer = writer(f, dialect, *args, **kwds)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

This error (and comparable ones) occur in three scripts: data2join, data2agr and data2annotator. I was able to 'solve' it by removing quotechar=0 in all three files. It should be tested whether this change breaks anything in the pipeline.

suppress warnings in commandline output

There are too many and they make it hard to find important output.

Installing pygraphviz with pip in conda (and elsewhere)?

conda version 22.11.1
python version 3.9.13.final.0

When pip installing the requirements.txt from inside a conda environment, an error can occur with pygraphviz. Installing pygraphviz directly though conda works conda install --channel conda-forge pygraphviz. The error could occur because pygraphviz does not find graphviz when pip installed inside conda.

Collecting pygraphviz
  Using cached pygraphviz-1.10.zip (120 kB)
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: pygraphviz
  Building wheel for pygraphviz (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [55 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-39
      creating build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/scraper.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/__init__.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/agraph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/testing.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/graphviz.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      creating build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_unicode.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_attribute_defaults.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_string.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_node_attributes.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_subgraph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/__init__.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_html.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_scraper.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_repr_mimebundle.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_clear.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_drawing.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_edge_attributes.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_graph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_layout.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_close.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_readwrite.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      running egg_info
      writing pygraphviz.egg-info/PKG-INFO
      writing dependency_links to pygraphviz.egg-info/dependency_links.txt
      writing top-level names to pygraphviz.egg-info/top_level.txt
      reading manifest file 'pygraphviz.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching '*.png' under directory 'doc'
      warning: no files found matching '*.txt' under directory 'doc'
      warning: no files found matching '*.css' under directory 'doc'
      warning: no previously-included files matching '*~' found anywhere in distribution
      warning: no previously-included files matching '*.pyc' found anywhere in distribution
      warning: no previously-included files matching '.svn' found anywhere in distribution
      no previously-included directories found matching 'doc/build'
      adding license file 'LICENSE'
      writing manifest file 'pygraphviz.egg-info/SOURCES.txt'
      copying pygraphviz/graphviz.i -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/graphviz_wrap.c -> build/lib.linux-x86_64-cpython-39/pygraphviz
      running build_ext
      building 'pygraphviz._graphviz' extension
      creating build/temp.linux-x86_64-cpython-39
      creating build/temp.linux-x86_64-cpython-39/pygraphviz
      gcc -pthread -B /home/line/anaconda3/envs/DURel/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/line/anaconda3/envs/DURel/include -I/home/line/anaconda3/envs/DURel/include -fPIC -O2 -isystem /home/line/anaconda3/envs/DURel/include -fPIC -DSWIG_PYTHON_STRICT_BYTE_CHAR -I/home/line/anaconda3/envs/DURel/include/python3.9 -c pygraphviz/graphviz_wrap.c -o build/temp.linux-x86_64-cpython-39/pygraphviz/graphviz_wrap.o
      pygraphviz/graphviz_wrap.c:2711:10: fatal error: graphviz/cgraph.h: No such file or directory
       2711 | #include "graphviz/cgraph.h"
            |          ^~~~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pygraphviz
  Running setup.py clean for pygraphviz
Failed to build pygraphviz
Installing collected packages: pygraphviz
  Running setup.py install for pygraphviz ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for pygraphviz did not run successfully.
  │ exit code: 1
  ╰─> [57 lines of output]
      running install
      /home/line/anaconda3/envs/DURel/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-39
      creating build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/scraper.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/__init__.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/agraph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/testing.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/graphviz.py -> build/lib.linux-x86_64-cpython-39/pygraphviz
      creating build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_unicode.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_attribute_defaults.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_string.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_node_attributes.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_subgraph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/__init__.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_html.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_scraper.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_repr_mimebundle.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_clear.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_drawing.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_edge_attributes.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_graph.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_layout.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_close.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      copying pygraphviz/tests/test_readwrite.py -> build/lib.linux-x86_64-cpython-39/pygraphviz/tests
      running egg_info
      writing pygraphviz.egg-info/PKG-INFO
      writing dependency_links to pygraphviz.egg-info/dependency_links.txt
      writing top-level names to pygraphviz.egg-info/top_level.txt
      reading manifest file 'pygraphviz.egg-info/SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no files found matching '*.png' under directory 'doc'
      warning: no files found matching '*.txt' under directory 'doc'
      warning: no files found matching '*.css' under directory 'doc'
      warning: no previously-included files matching '*~' found anywhere in distribution
      warning: no previously-included files matching '*.pyc' found anywhere in distribution
      warning: no previously-included files matching '.svn' found anywhere in distribution
      no previously-included directories found matching 'doc/build'
      adding license file 'LICENSE'
      writing manifest file 'pygraphviz.egg-info/SOURCES.txt'
      copying pygraphviz/graphviz.i -> build/lib.linux-x86_64-cpython-39/pygraphviz
      copying pygraphviz/graphviz_wrap.c -> build/lib.linux-x86_64-cpython-39/pygraphviz
      running build_ext
      building 'pygraphviz._graphviz' extension
      creating build/temp.linux-x86_64-cpython-39
      creating build/temp.linux-x86_64-cpython-39/pygraphviz
      gcc -pthread -B /home/line/anaconda3/envs/DURel/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/line/anaconda3/envs/DURel/include -I/home/line/anaconda3/envs/DURel/include -fPIC -O2 -isystem /home/line/anaconda3/envs/DURel/include -fPIC -DSWIG_PYTHON_STRICT_BYTE_CHAR -I/home/line/anaconda3/envs/DURel/include/python3.9 -c pygraphviz/graphviz_wrap.c -o build/temp.linux-x86_64-cpython-39/pygraphviz/graphviz_wrap.o
      pygraphviz/graphviz_wrap.c:2711:10: fatal error: graphviz/cgraph.h: No such file or directory
       2711 | #include "graphviz/cgraph.h"
            |          ^~~~~~~~~~~~~~~~~~~
      compilation terminated.
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pygraphviz

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.```

fix kri2 for case of non-complete value domain

switch to Anaconda for installation

Instead of providing a requirements.txt, we should provide installation instructions for Python Anaconda because this will allow to use the graph-tool library.

aggregation step

reorganize system2 pipeline to have separate, modular data aggregation step

Slow filtering / efficiency problem in JavaScript of WUG filter template

Depending on the complexity of the WUG, and on the filtering task, filtering can be slow. I tried to mitigate this, for example by using a map for annotations when calculating the sub-graph for the annotators filter, which made the filter a little faster than the earlier implementation.

If filtering is still too slow to be usable in some cases, there could be more time-efficient ways to access data, e.g. by going around the implementation of vis.js (possibly collecting all nodes and edges into collections when loading the page and having your own iterators for that data, or using a library that iterates over datasets faster, for example a C-based library).

integers on judgment plots

Currently, annotator judgments on judgment plots are cast to integers. Should be adapted to floats for non-ordinal scales.

Usim

Please check folder 'neat' . For neat 5th row (identifier: neat-105), indices_target_sentence_tokenized are reflected for only half of the target sentence. Also in the neat 9th row, there are missing values for indices_target_token/sentence. Most of the other instances return correct indices in manual checks.

remove instances from testwug_en

There is no need to upload instances with source data. Instead, instances should be created in the process of label aggregation from source judgments see here:

https://github.com/Garrafao/durel_system_annotators/blob/master/tests/data.py

The number of excluded nodes

It is my understanding that the exclude_nodes.py script removes from the graph the nodes which either have more than 50% of 0 judgments or which do not have valid judgments at all.
The nodes here are usages (sentences containing target words).

Based on this understanding, for each target word I would expect the number of excluded nodes plus the number of preserved nodes to be equal to the number of the original nodes (number of usages). However, this is not the case. With my data, after the processing is over, I am looking at the files in the stats subdirectory and observe cases like this:

Excluded nodes (from excluded_nodes.csv): 15
Preserved nodes (from stats_grouping.csv): 10

However, there were 22 unique sentences for this word in the original data. I see in the graphs and in the clusters, that there are indeed 10 sentences taken into account, so 12 sentences were discarded, not 15. I observe this for more than one target word. In extreme cases, the number of excluded nodes actually is higher than the total number of sentences for a word.

I did not yet look deep into the code, but may be I am just misunderstanding something?

mlrose requirement sklearn

The latest pip version of mlrose uses sklearn instead of scikit-learn, which is now deprecated: https://github.com/scikit-learn/sklearn-pypi-package

However, the developers have modified their github repo (if not the pip package): https://github.com/gkhayes/mlrose

This means that to fix the issue, you can change the requirements.txt and replace mlrose==1.3.0 with git+https://github.com/gkhayes/mlrose (git has to be installed in the environment for this to work).

Another option is to check if there is a newer version of mlrose packaged by someone else. I didn't do this.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.