meghdadfar / wordview Goto Github PK
View Code? Open in Web Editor NEWA Python package for Exploratory Data Analysis (EDA) for text-based data.
License: MIT License
A Python package for Exploratory Data Analysis (EDA) for text-based data.
License: MIT License
Updating/installing new dependencies via poetry is generating several errors, primarily:
Installing dependencies from the lock file
Package operations: 54 installs, 21 updates, 0 removals
• Updating platformdirs (3.5.1 -> 3.8.0): Failed
AttributeError
HTTPResponse object has no attribute 'strict'
at .venv/lib/python3.10/site-packages/cachecontrol/serialize.py:54 in dumps
50│ ),
51│ u"status": response.status,
52│ u"version": response.version,
53│ u"reason": text_type(response.reason),
→ 54│ u"strict": response.strict,
55│ u"decode_content": response.decode_content,
56│ }
57│ }
58│
This could be resolved by pip install <PACKAGE>
.
This might be a compatibility problem with Python 3.10, but it might be a problem with poetry.
Since in MWE.__init__()
in effect data validation is carried out (in particular, checking that the arguments passed by the user are valid) which is not quite the same as type-checking. Hence ValueError
is a better error. Instead, type-checking and a possibleTypeError
should be thrown where mwe_types
is tried to be iterated over.
E.g.
def calculate_am(count_data: dict, am: str, mwe_types: List[str]) -> Dict[str, Dict]:
res = {}
num_words = sum(count_data["WORDS"].values())
if am == "pmi":
try:
for mt in mwe_types:
...
except TypeError:
logger.error("err == TypeError: 'int' object is not iterable")
Wordview can extract a wide range of statistics and information from text data. These statistics are shown in plots, word clouds, and tables. The goal of this feature is to implement functionalities to describe data in natural language. The user should be able to pass a describe_in_nl=True
argument to Exploratory Data Analysis (EDA) functions to get the description of the analysis in natural language. For instance, in the function TextStatsPlots.show_stats()
, the user should be able to pass TextStatsPlots.show_stats(describe_in_nl=True)
. Or analogously, when calling TextStatsPlots.show_distplot(plot='doc_len')
the user should be able to pass describe_in_nl=True
. describe_in_nl
should be implemented for all EDA functions.
TBD
In many cases, the extracted top MWEs are very uncommon. That's because both the MWE and some or all of their components have a very low frequency leading the PMI to be large.
Steps to reproduce the behavior:
Simply run MWE extraction and check the results.
Top MWE results should be common expressions not very rare and unknown.
whip these ninjas
Add a frequency threshold (as a parameter) that defaults to 1. MWE candidates that were observed below this threshold are discarded.
TextStatsPlots.show_stats()
should call TextStatsPlots.return_stats()
We have created a new function TextStatsPlots.return_stats()
for Datachat. It does most of what is done inside TextStatsPlots.show_stats()
but instead of printing the stats in a tabulate table, it returns a json. The goal of this issue is to update TextStatsPlots.show_stats()
so that it calls TextStatsPlots.return_stats() and then properly format the return value in a list so that it can be passed to tabulate. This will improve the consistency and allows us to get the stats only in one place i.e. TextStatsPlots.return_stats()
.
Currently in wordview.text_analysis.core.do_txt_analysis
tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.
for text in tqdm(df["review"]):
try:
sentences = sent_tokenize(text.lower())
for sentence in sentences:
sentence_tokens = word_tokenize(sentence)
num_tokens += len(sentence_tokens)
except Exception as e:
print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
continue
Currently, only two-word MWEs are supported.
Include higher-order Ngrams.
One of the most widely used types of labels in NLP are sequence-level labels (e.g. Named Entity tags such as PER, LOC). The goal here is to implement this as a sub-feature in LabelStatsPlots class.
For instance, for the following paragraph
the following analysis must be carried out and visualized in a suitable plot (or table)
Right now, POS tags are shown in the form of word clouds.
TextStatsPlots.show_word_clouds(type="VB")
# To see nouns
TextStatsPlots.show_word_clouds(type="NN")
# To see adjectives
TextStatsPlots.show_word_clouds(type="JJ")
This should be improved in the two following ways:
Include functionality to identify bias in the corpus, so that it can be handled before it makes its way into the model training.
WEAT. The user can write a query with a bunch of keywords and association of those keywords are identified via WEAT. Note that first an Embedding model should be trained on this corpus. But this can be done fast, especially using packages like gensim.
...
...
For English MWEs "Noun Noun Noun Compounds" and "Adjective Adjective Noun Compounds" seem to be always empty. Investigate why this is happening. It might not be a bug, but investigation is needed.
Run MWEs according to the docs and see the results either in the stored file or via rendering the mwe table.
At least some "Noun Noun Noun Compounds" and "Adjective Adjective Noun Compounds" should be returned.
Currently, Wordview does not provide any statistics in any form e.g. plots, word clouds, tables, etc. for N-grams, although there are already functionalities to extract syntactic N-grams and their counts for MWEs. It would be useful if N-gram counts are presented to the user in some shape or form. E.g. in word cloud, bar plots, etc.
Why syntactic N-grams and not just any n-grams? Because e.g. in English, N-grams with no value such is a, it is etc. are very frequent. These are not semantic units and hence will represent no meaning. Syntactic N-grams adds a syntactic restriction to enforce the acceptance of only N-grams which form a semantic unit. E.g. a noun phrase.
Currently, Wordview chat is implemented within the TextStatsPlos class. The goal of this feature is to include the same chat function in two other major Wordview components/classes, namely:
The same code from TextStatsPlos.chat() can be used, however, the following notes should be taken into consideration:
Currently, fasttext
dependency that is used in worldview for language identification cannot be installed due to a pybind11
issue. fasttext
is also not maintained and not compatible with newer Python versions, hence, we have to look for an alternative.
Some multiword expressions appear in multiple (close) classes. For instance, femme fatale appears both in Noun Noun and Adjective Noun compounds. This can be due to POS tagging inaccuracies and is also a known problem. But it could also be due to some other issues. It has to be investigated and fixed. When the issue is with POS tags, we have to decide in which category we place it and put it only there in that category, and avoid presenting one MWE in two categories.
Follow docs to generate MWEs.
A clear and concise description of what you expected to happen.
MWE
class produce IO side effects, which could potentially:
wordview
in a streaming architecture.Currently, there is only one anomaly detection model (i.e. NormalDistAnomalies
). This model is limited in the sense that:
It only accepts items with a one-dimensional feature representation.
If the distribution of this feature cannot be Gaussanized, the model does not work.
The goal is to include more anomaly detection models to address both of the above issues.
Right now clustering tests, in particular, tests/clustering/test_clustering.py
fails in GitHub Actions, throwing the following error:
lib/python3.10/site-packages/torch/__init__.py:168: in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/ctypes/__init__.py:374: in __init__
self._handle = _dlopen(self._name, mode)
E OSError: libcurand.so.10: cannot open shared object file: No such file or directory
Since the tests run without any problem on Mac OS 11.7.6, this seems to be a CUDA issue on ubuntu-latest
(which is the machine on which the Python environment is created and the application is being tested).
cuda
is one of the dependencies of torch
and torch
is a dependency of sentence-transformers
.
Add an option and implement the corresponding function to show to the distribution of sentence lengths in TextStatsPlots.show_distplot()
.
This feature can be implemented in the same way as: TextStatsPlots._create_doc_len_plot.
punkt
in order for nltk.tokenize
to work at this entry point.punkt
, the program will just catch the exception without informing that the nltk punkt
dependency is missing.Either:
Include instruction to user/collaborator in readme to run the following before using wordview
:
import nltk
nltk.download('punkt')
Configure poetry to automatically download nltk punkt
, if in fact poetry can handle such type of downloads.
API docs are not correctly rendered in documentation pages:
https://meghdadfar.github.io/wordview/api.html
Investigate and fix the issue.
Currently, only one anomaly detection model i.e. NormalDistAnomalies
is implemented.
This model is limited in the sense that:
The goal is to implement more anomaly detection models to address both of the above issues.
TBD
In many cases, when the corpus contains misspelled or foreign words and phrases, top MWEs end up being those very rare misspelled expressions. This is a known problem when measuring PMI.
Steps to reproduce the behavior:
Simply run MWE extraction and check the results.
Top MWE results should be common expressions consisting of correct words.
Light Verb Constructions: LOCK THE DOOOOR
The proposed solution is to check the components of MWEs against a lexicon of the selected language to ensure they are actual words and not made-up words.
mwe_utils.extract_mwes_from_sent()
only generates bigram candidates for collocations of the following sequences:
[["NN", "NNS"], ["NN", "NNS"]]
[["JJ"], ["NN", "NNS"]]
New York State
, we would get the following results:
mwe_type == "NC"
you capture York State
(which is a false positive).mwe_type == "JNC"
you capture New York
(which is also a false positive).mwe_utils.extract_mwes_from_sent()
to extract:
mwe_utils.generate_candidate_mwe()
mwe_utils.generate_candidate_collocations()
mwe_type
parameter or generate an additional optional parameter in the mwe_utils.extract_mwes_from_sent()
method to be of type list[list[str]]
.mwe_type
or the POS-sequence passed to the new parameter of type list[list[str]]
.
NC
is an alias for [["NN", "NNS"], ["NN", "NNS"]]
and JNC
for [["JJ"], ["NN", "NNS"]]
..
"[.]?"
would allow any POS tag, zero or one time.
"[[DT][.]?[NN]]"
would catch the bus
as well as the blue bus
.?
, +
, *
, {n}
, {n,}
, {n,m}
[<POS>]{1}
[^]
"[[^DT][.]]"
"[[.][.][.]]"
mwe_type
argument options (i.e. NC
and JNC
) significantly restricts wordview
's utility to users (e.g. NLP Engineers).mwe_type
argument options (i.e. NC
and JNC
) significantly restricts real-world, enterprise use cases/problems wordview
can solve.^
and $
in POS sequences relative to a sentence or user-defined text segment.(?=)
/(?!)
and lookbehinds (?<=)
/(?<!)
in POS sequences.tracer
for tagging syntactical sequences of higher-order concepts (e.g. controlled vocabularies/mappings), but applied strictly to POS tags (i.e. lower-order syntactical concepts/mappings).A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.