meghdadfar / wordview Goto Github PK

View Code? Open in Web Editor NEW

11.0 3.0 1.0 41.11 MB

A Python package for Exploratory Data Analysis (EDA) for text-based data.

License: MIT License

Python 90.39% Makefile 0.68% Batchfile 0.72% HTML 8.20%

multiword-expressions exploratory-data-analysis feature-extraction text-analytics

wordview's People

Contributors

Stargazers

Watchers

Forkers

jgregoriods

wordview's Issues

Updating platformdirs (3.5.1 -> 3.8.0): Failed

Updating/installing new dependencies via poetry is generating several errors, primarily:

Installing dependencies from the lock file

Package operations: 54 installs, 21 updates, 0 removals

  • Updating platformdirs (3.5.1 -> 3.8.0): Failed

  AttributeError
  HTTPResponse object has no attribute 'strict'

  at .venv/lib/python3.10/site-packages/cachecontrol/serialize.py:54 in dumps
       50│                 ),
       51│                 u"status": response.status,
       52│                 u"version": response.version,
       53│                 u"reason": text_type(response.reason),
    →  54│                 u"strict": response.strict,
       55│                 u"decode_content": response.decode_content,
       56│             }
       57│         }
       58│

This could be resolved by pip install <PACKAGE>.

This might be a compatibility problem with Python 3.10, but it might be a problem with poetry.

Replace TypeError with ValueError in `MWE.init()` and move TypeError to `am.calculate_am()`

Since in MWE.__init__() in effect data validation is carried out (in particular, checking that the arguments passed by the user are valid) which is not quite the same as type-checking. Hence ValueError is a better error. Instead, type-checking and a possibleTypeError should be thrown where mwe_types is tried to be iterated over.

E.g.

def calculate_am(count_data: dict, am: str, mwe_types: List[str]) -> Dict[str, Dict]:
    res = {}
    num_words = sum(count_data["WORDS"].values())
    if am == "pmi":
        try:
            for mt in mwe_types:
                ...
        except TypeError:
            logger.error("err == TypeError: 'int' object is not iterable")

Feature Request: Use Natural Language to explore, analyze and describe data

Description:

Wordview can extract a wide range of statistics and information from text data. These statistics are shown in plots, word clouds, and tables. The goal of this feature is to implement functionalities to describe data in natural language. The user should be able to pass a describe_in_nl=True argument to Exploratory Data Analysis (EDA) functions to get the description of the analysis in natural language. For instance, in the function TextStatsPlots.show_stats(), the user should be able to pass TextStatsPlots.show_stats(describe_in_nl=True). Or analogously, when calling TextStatsPlots.show_distplot(plot='doc_len') the user should be able to pass describe_in_nl=True. describe_in_nl should be implemented for all EDA functions.

Use Case:

It's more intuitive to get a natural language description of your data which only highlights the most important points, than following all the details in plots, distributions, and word clouds.
Sometimes the user just wants a quick description of what all these numbers in the plots mean.
This feature can be quite useful for less technical users

Possible Implementation (optional):

TBD

Investigate the consistency of use and impacts of lower() across Wordview

Test clustering's sentencetransfomer vectorization model

Extract and show more stats and plots

tfs
dfs
idfs

Bug Report: Sometimes, MWEs are uncommon phrases

Description

In many cases, the extracted top MWEs are very uncommon. That's because both the MWE and some or all of their components have a very low frequency leading the PMI to be large.

To Reproduce

Steps to reproduce the behavior:
Simply run MWE extraction and check the results.

Expected behavior

Top MWE results should be common expressions not very rare and unknown.

Examples

whip these ninjas

Possible Solutions

Add a frequency threshold (as a parameter) that defaults to 1. MWE candidates that were observed below this threshold are discarded.

`TextStatsPlots.show_stats()` should call `TextStatsPlots.return_stats()`

Improvement:`TextStatsPlots.show_stats()` should call `TextStatsPlots.return_stats()`

Description:

We have created a new function TextStatsPlots.return_stats() for Datachat. It does most of what is done inside TextStatsPlots.show_stats() but instead of printing the stats in a tabulate table, it returns a json. The goal of this issue is to update TextStatsPlots.show_stats() so that it calls TextStatsPlots.return_stats() and then properly format the return value in a list so that it can be passed to tabulate. This will improve the consistency and allows us to get the stats only in one place i.e. TextStatsPlots.return_stats().

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis`

Description

Currently in wordview.text_analysis.core.do_txt_analysis tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.

Solution:

for text in tqdm(df["review"]):
    try:
        sentences = sent_tokenize(text.lower())
        for sentence in sentences:
            sentence_tokens = word_tokenize(sentence)
            num_tokens += len(sentence_tokens)
    except Exception as e:
        print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
        continue

Higher order MWEs

Currently, only two-word MWEs are supported.
Include higher-order Ngrams.

Add more clustering methods

Change the logging

Feature Request: Analysis of sequence-level labels

Description:

One of the most widely used types of labels in NLP are sequence-level labels (e.g. Named Entity tags such as PER, LOC). The goal here is to implement this as a sub-feature in LabelStatsPlots class.

For instance, for the following paragraph

the following analysis must be carried out and visualized in a suitable plot (or table)

Use Case:

Learn about the distribution of various sequence-level labels
Learn which sequence-level labels are over or under-represented
In most text analysis tools and even labeling tools, the statistics of sequence level labels are widely ignored.

Feature Request: Better exploratory analysis of POS tags

Feature Request: Better exploratory analysis of POS tags #82

Description

Right now, POS tags are shown in the form of word clouds.

TextStatsPlots.show_word_clouds(type="VB")
# To see nouns
TextStatsPlots.show_word_clouds(type="NN")
# To see adjectives
TextStatsPlots.show_word_clouds(type="JJ")

This should be improved in the two following ways:

More POS tags should be included. E.g. RB, PRP, etc. See the complete list of Penn POS tags here.
POS tags should become more specific. E.g. Instead of only NN --> NN, NNS, NNP, NNPS, etc.
More options for visualizing the POS tags should be included. The user should be able to choose between a wordcloud, a bar plot, and a distribution plot.

Feature Request: [Bias Analysis]

Description:

Include functionality to identify bias in the corpus, so that it can be handled before it makes its way into the model training.

Use Case:

Before a corpus is used for model training, biases can be detected and eliminated to prevent the training of a biased model.

Possible Implementation (optional):

WEAT. The user can write a query with a bunch of keywords and association of those keywords are identified via WEAT. Note that first an Embedding model should be trained on this corpus. But this can be done fast, especially using packages like gensim.

Supporting Screenshots/Mockups:

...

Additional Context:

...

Bug Report: For English MWEs "Noun Noun Noun Compounds" and "Adjective Adjective Noun Compounds" seem to be always empty.

Description

For English MWEs "Noun Noun Noun Compounds" and "Adjective Adjective Noun Compounds" seem to be always empty. Investigate why this is happening. It might not be a bug, but investigation is needed.

To Reproduce

Run MWEs according to the docs and see the results either in the stored file or via rendering the mwe table.

Expected behavior

At least some "Noun Noun Noun Compounds" and "Adjective Adjective Noun Compounds" should be returned.

Feature Request: Statistics, plots & word clouds for syntactic N-grams

Description:

Currently, Wordview does not provide any statistics in any form e.g. plots, word clouds, tables, etc. for N-grams, although there are already functionalities to extract syntactic N-grams and their counts for MWEs. It would be useful if N-gram counts are presented to the user in some shape or form. E.g. in word cloud, bar plots, etc.

Why syntactic N-grams and not just any n-grams? Because e.g. in English, N-grams with no value such is a, it is etc. are very frequent. These are not semantic units and hence will represent no meaning. Syntactic N-grams adds a syntactic restriction to enforce the acceptance of only N-grams which form a semantic unit. E.g. a noun phrase.

Use Case:

Most frequent ngram analysis is frequently used for reports by publishers, news networks, education.

Feature Request: Extend chat to support all Wordview components

Description:

Currently, Wordview chat is implemented within the TextStatsPlos class. The goal of this feature is to include the same chat function in two other major Wordview components/classes, namely:

BiasDetector
MWE

Possible Implementation

The same code from TextStatsPlos.chat() can be used, however, the following notes should be taken into consideration:

Handle the situation when the user decides to open several chats (server port, thread, etc.)
Use the same chat.html which means this file should be moved elsewhere and be used by all chat functions.

Find alternative for fasttext.

Currently, fasttext dependency that is used in worldview for language identification cannot be installed due to a pybind11 issue. fasttext is also not maintained and not compatible with newer Python versions, hence, we have to look for an alternative.

Bug Report: Some MWEs appear in multiple classes

Describe the bug

Some multiword expressions appear in multiple (close) classes. For instance, femme fatale appears both in Noun Noun and Adjective Noun compounds. This can be due to POS tagging inaccuracies and is also a known problem. But it could also be due to some other issues. It has to be investigated and fixed. When the issue is with POS tags, we have to decide in which category we place it and put it only there in that category, and avoid presenting one MWE in two categories.

To Reproduce

Follow docs to generate MWEs.

Expected behaviour

A clear and concise description of what you expected to happen.

Screenshots

Refactor MWE module.

Situation

Some methods in MWE class produce IO side effects, which could potentially:
1. Affect client applications' performance.
2. Generate storage volume integration problems in cloud environments (e.g. running as lambda functions).

Objective

Split methods that do more than one thing (at least unrelated things), such as a calculation alongside a read/write I/O operations.

Justification

This decoupling makes sense because the pipelines requiring IO would significantly slow down an application using wordview in a streaming architecture.
The tradeoff is that any user/application using the non-IO pipeline would need to make sure they've enough memory on their instances, or even better, build their own pipelines in case they want to use they're own caching systems.

Downstream Work

We can add additional "pipelining" methods that connect several granular methods into either a without-IO or with-IO "pipelines".

Add more anomaly models

Currently, there is only one anomaly detection model (i.e. NormalDistAnomalies). This model is limited in the sense that:

It only accepts items with a one-dimensional feature representation.
If the distribution of this feature cannot be Gaussanized, the model does not work.

The goal is to include more anomaly detection models to address both of the above issues.

Testing of clustering fails on ubuntu due to CUDA issue.

Right now clustering tests, in particular, tests/clustering/test_clustering.py fails in GitHub Actions, throwing the following error:

lib/python3.10/site-packages/torch/__init__.py:168: in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/ctypes/__init__.py:374: in __init__
    self._handle = _dlopen(self._name, mode)
E   OSError: libcurand.so.10: cannot open shared object file: No such file or directory

Since the tests run without any problem on Mac OS 11.7.6, this seems to be a CUDA issue on ubuntu-latest (which is the machine on which the Python environment is created and the application is being tested).

cuda is one of the dependencies of torch and torch is a dependency of sentence-transformers.

Feature Request: Sentence Length Distribution

Description:

Add an option and implement the corresponding function to show to the distribution of sentence lengths in TextStatsPlots.show_distplot().

Use Case:

Helps the user makes better decisions about semantic composition functions and sentence embeddings.

Implementation

This feature can be implemented in the same way as: TextStatsPlots._create_doc_len_plot.

Make sure user manually, or library automatically, runs nltk.download('punkt').

Situation

It is necessary to have downloaded nltk's punkt in order for nltk.tokenize to work at this entry point.
If the user has not downloaded nltk's punkt, the program will just catch the exception without informing that the nltk punkt dependency is missing.

Objective

Either:

Include instruction to user/collaborator in readme to run the following before using wordview:
```
import nltk
nltk.download('punkt')
```
Configure poetry to automatically download nltk punkt, if in fact poetry can handle such type of downloads.

Bug Report: API docs are not correctly rendered

Description

API docs are not correctly rendered in documentation pages:
https://meghdadfar.github.io/wordview/api.html

Investigate and fix the issue.

Screenshots

Feature Request: Implement more anomaly detection models

Description:

Currently, only one anomaly detection model i.e. NormalDistAnomalies is implemented.
This model is limited in the sense that:

It only accepts items with a one-dimensional feature representation (e.g. frequency).
If the distribution of this one-dimensional feature cannot be Gaussanized, the model does not work.

The goal is to implement more anomaly detection models to address both of the above issues.

Possible Implementation (optional):

TBD

Bug Report: Sometimes, MWEs are wrong or misspelled phrases

Description

In many cases, when the corpus contains misspelled or foreign words and phrases, top MWEs end up being those very rare misspelled expressions. This is a known problem when measuring PMI.

To Reproduce

Steps to reproduce the behavior:
Simply run MWE extraction and check the results.

Expected behavior

Top MWE results should be common expressions consisting of correct words.

Examples

Light Verb Constructions: LOCK THE DOOOOR

Possible Solutions

The proposed solution is to check the components of MWEs against a lexicon of the selected language to ensure they are actual words and not made-up words.

Enable User-Defined POS Pattern, Variable-Length MWE Candidate Generation

Situation

Current MWE candidate generation by mwe_utils.extract_mwes_from_sent() only generates bigram candidates for collocations of the following sequences:
- NC: [["NN", "NNS"], ["NN", "NNS"]]
- JNC: [["JJ"], ["NN", "NNS"]]
If we test the text segment New York State, we would get the following results:
- mwe_type == "NC" you capture York State (which is a false positive).
- mwe_type == "JNC" you capture New York (which is also a false positive).

Solution

Enable mwe_utils.extract_mwes_from_sent() to extract:
- variable-length MWE candidates,
- using a user-defined POS pattern for syntactical structure.
Possibly rename method to:
- mwe_utils.generate_candidate_mwe()
- mwe_utils.generate_candidate_collocations()
- etc.

User-defined POS Pattern for Syntactical Structure

Modify the mwe_type parameter or generate an additional optional parameter in the mwe_utils.extract_mwes_from_sent() method to be of type list[list[str]].
The user can provide an argument for a sequence of POS tags for the MWE he/she wants to extract.
- Alternative, we can enable POS-sequence aliasing, where aliases can be passed as arguments to mwe_type or the POS-sequence passed to the new parameter of type list[list[str]].
  - e.g.: Currently, NC is an alias for [["NN", "NNS"], ["NN", "NNS"]] and JNC for [["JJ"], ["NN", "NNS"]].
Every element of the outer list can be a list of candidate POS tags, enabling:
- Flexibility in the POS-sequence pattern.
- Allow the use of POS Taggers that return not just the top probable POS tag, but several.
- Allow the use of more than one POS Tagger for pooling POS tagging classification.
- Compensate for POS Tagger misclassification errors.

Variable-Length MWE

Enable the use of wildcard, quantifier, and set-negation characters in user-defined POS sequences.
- Wildcards: .
  - e.g.: "[.]?" would allow any POS tag, zero or one time.
    - e.g.: "[[DT][.]?[NN]]" would catch the bus as well as the blue bus.
- Quantifiers: ?, +, *, {n}, {n,}, {n,m}
  - Any outer list element without a quantifier essentially means "match one and only one occurrence", i.e. [<POS>]{1}
- Character Set Negation: [^]
  - e.g. to negate bigram that don't start with a determiner: "[[^DT][.]]"
User can set a max-length sequence.
- A default maximum length must be defined to stem complexity beyond a certain POS sequence length (quantifier-expanded or not).
- A warning must be logged to console when user sets a max-length above some computational threshold (TBD), or the expanded POS sequence exceeds this threshold.
Raise errors when quantifier-expanded POS sequences surpass the max-length argument.

Objective

Provide users with configurability of MWE patterns to extract based on their specific use cases.
Provide users with flexibility for dealing with POS Tagger misclassification errors.
Provide users with the ability for passing top-n POS tag candidates generated by POS Tagger instead of only top-1 POS tag candidate (as it is currently).
Provide users with the ability to use results from more than one POS Tagger, incorporating misalignment between POS Taggers (i.e. more than one candidate POS tag per token).
Provide users with the ability to extract MWE ignoring POS tag structure of collocations (i.e. generate strictly statistical collocations).
- i.e. to generate all possible trigrams: "[[.][.][.]]"

Justification

Current mwe_type argument options (i.e. NC and JNC) significantly restricts wordview's utility to users (e.g. NLP Engineers).
Current mwe_type argument options (i.e. NC and JNC) significantly restricts real-world, enterprise use cases/problems wordview can solve.

Downstream Work

Enable anchors, i.e. ^ and $ in POS sequences relative to a sentence or user-defined text segment.
Enable positive and negative lookaheads (?=)/(?!) and lookbehinds (?<=)/(?<!) in POS sequences.
Enable option to select from multiple POS Taggers wrapped by the library.
Enable option to select more than one POS Taggers wrapped by the library.
Enable option for user to provide his/her own (bespoke) POS-tagged tokenised text.

NB

This implementation is basically the implementation of the library tracer for tagging syntactical sequences of higher-order concepts (e.g. controlled vocabularies/mappings), but applied strictly to POS tags (i.e. lower-order syntactical concepts/mappings).

meghdadfar / wordview Goto Github PK

wordview's People

Contributors

Stargazers

Watchers

Forkers

wordview's Issues

Feature Request: Use Natural Language to explore, analyze and describe data

Description:

Use Case:

Possible Implementation (optional):

Description

To Reproduce

Expected behavior

Examples

Possible Solutions

Improvement:TextStatsPlots.show_stats() should call TextStatsPlots.return_stats()

Description:

Description

Solution:

Feature Request: Analysis of sequence-level labels

Description:

Use Case:

Feature Request: Better exploratory analysis of POS tags #82

Description

Feature Request: [Bias Analysis]

Description:

Use Case:

Possible Implementation (optional):

Supporting Screenshots/Mockups:

Additional Context:

Description

To Reproduce

Expected behavior

Feature Request: Statistics, plots & word clouds for syntactic N-grams

Description:

Use Case:

Feature Request: Extend chat to support all Wordview components

Description:

Possible Implementation

Describe the bug

To Reproduce

Expected behaviour

Screenshots

Refactor MWE module.

Situation

Objective

Justification

Downstream Work

Feature Request: Sentence Length Distribution

Description:

Use Case:

Implementation

Situation

Objective

Description

Screenshots

Feature Request: Implement more anomaly detection models

Description:

Possible Implementation (optional):

Description

To Reproduce

Expected behavior

Examples

Possible Solutions

Situation

Solution

User-defined POS Pattern for Syntactical Structure

Variable-Length MWE

Objective

Justification

Downstream Work

NB

Recommend Projects

Recommend Topics

Recommend Org

Improvement:`TextStatsPlots.show_stats()` should call `TextStatsPlots.return_stats()`