wayfair-incubator / extra-model Goto Github PK
View Code? Open in Web Editor NEWCode to run the ExtRA algorithm for unsupervised topic/aspect extraction on English texts.
License: MIT License
Code to run the ExtRA algorithm for unsupervised topic/aspect extraction on English texts.
License: MIT License
installing extra-model from pypi fails with the following error message:
Downloading https://files.pythonhosted.org/packages/3e/fb/5899a59ee8d0f02202c1f02fe47671e0c93d1812b1deb2491505718473da/cymem-2.0.5.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/wayfair/tmp/pip-build-88wt78b9/cymem/setup.py", line 10, in <module>
from Cython.Build import cythonize
ModuleNotFoundError: No module named 'Cython'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /wayfair/tmp/pip-build-88wt78b9/cymem/
System information:
OS: centos 7.5
Python version: 3.6.0
Command run: pip install --index-url https://pypi.org/simple extra-model
Running pip install cython
manually before attempting to install extra-model
fixed this. It might be an issue with our setup.py
file: SciTools/cf-units#106
The only example that I could find -
extra-model/extra_model/_models.py
Line 56 in 40fc8e2
This file is the only one with 13 lines that are not covered, so would be great to find a way to test them as well.
In the #352 it became clear that supporting 3.10 would not be a simple upgrade because of pycld3
dependency. The most obvious way to solve it is to find an alternative because the last update for pycld3
has been done more than a year ago and it's unlikely that it'll updated to support 3.10.
Now it can happen that we'll run extra-model
almost to the end and fail because CommentId
is named incorrectly.
We should at least validate names of the columns before we start extra-model
to avoid situations like this.
The only argument we have is input_filename
, everything else is an option. Having it as an argument makes it very clunky to supply some options (e.g., output_filename
), but not others.
Hi extra-modelers ๐ !
When running docker-compose run --rm setup
, I get the following error:
Formatting file. This will take approximately 10 minutes.
loading projection weights from /usr/local/lib/python3.8/site-packages/gensim/test/test_data/embeddings/glove.840B.300d.txt
Traceback (most recent call last):
File "/usr/local/bin/extra-model-setup", line 33, in <module>
sys.exit(load_entry_point('extra-model', 'console_scripts', 'extra-model-setup')())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/package/extra_model/_cli.py", line 65, in entrypoint_setup
setup(output_path)
File "/package/extra_model/_setup.py", line 41, in setup
format_file(file_unzipped, output_path)
File "/package/extra_model/_setup.py", line 84, in format_file
_ = glove2word2vec(glove_file, tmp_file)
File "/usr/local/lib/python3.8/site-packages/gensim/utils.py", line 1519, in new_func1
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/gensim/scripts/glove2word2vec.py", line 109, in glove2word2vec
glovekv = KeyedVectors.load_word2vec_format(glove_input_file, binary=False, no_header=True)
File "/usr/local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1630, in load_word2vec_format
return _load_word2vec_format(
File "/usr/local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1892, in _load_word2vec_format
with utils.open(fname, 'rb') as fin:
File "/usr/local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 180, in open
fobj = _shortcut_open(
File "/usr/local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 287, in _shortcut_open
return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/site-packages/gensim/test/test_data/embeddings/glove.840B.300d.txt'
ERROR: 1
I believe this occurs because the glove_file
argument being passed into glove2word2vec
here is a relative path. I haven't had a chance to experiment yet, but I think changing it to an absolute path will resolve this error.
For example, with something like this - https://github.com/anko/txm
Right now all of the interfaces we expose assume the workflow like this: csv
file in -> csv
file out.
It would be great to expose another interface that takes pandas
df and returns pandas
df as an output. We do this internally anyways, so nothing super difficult.
There is one test failing with new spacy
version + "Not sturdy table" is not handled correctly, so there is still room for improvement in this area.
Currently, extra-model
has dependencies on pycld2==0.31
, cytoolz==0.9.0
, and spacy==2.0.18
: these pacakges either directly or indirectly use C extensions that are not shipped as a wheel. As a result, gcc
is a requirement of extra-model
so that these dependencies can be build from source.
The best case is to eliminate any dependencies on gcc
. If so, images deployed to production will (1) be smaller, (2) build faster, and (3) be more secure. Additionally, users of extra-model
are less likely to encounter installation errors because of missing C libraries.
It should be possible to eliminate the dependency on gcc
with the following changes:
cytoolz
: Neither cytoolz
nor toolz
are used in the codebase (perhaps an old dependency that was never cleaned up?) We can remove this package from the requirements file.pycld2
: This project hasn't been updated since 2019. If upgrading to use cld3
would be acceptable (difference between cld2
and cld3
), we could use pycld3
as a drop-in replacement. pycld3
provides wheels for compatibility and is actively maintained.spacy
: Newer releases of spacy
eliminate the offending dependencies. There is already a PR (#54) that updates spacy
to a compatible version.Once these changes are made, we can start using the slim-buster
docker image instead of buster
. The slim version is substantially smaller (112MB vs. 875MB) and doesn't contain gcc
--which replicates a desirable production environment.
I'll put up a draft PR to demonstrate---and once #54 is merged I will update the PR to use the slim image.
The link the readme to https://www.aclweb.org/anthology/papers/D18-1384/d18-1384 ends in a 404. Assuming this is not just a problem on my end or a temporary problem with aclweb.org, we should update the link.
We have a how-to
on how to setup extra-model
, but it would be good to have a tutorial that goes end-to-end on some real(istic) example.
There seems to be a discrepancy between _cli.py and _run.py in that entrypoint_setup()
in _cli.py
exposes an output_path
argument to specify where to write the embeddings to but run()
in _run.py
does not offer the option to pass that as an argument; there it's hardcoded to ./embeddings
: https://github.com/wayfair-incubator/extra-model/blob/main/extra_model/_run.py#L8
I think we should either
embeddings_path
argument to run()
oroutput_path
argument from entrypoint_setup()
We test in 3.7-3.9, so makes sense to add 3.7 to classifiers
Should also add 3.6 to matrix to check if tests still pass there and if yes - add it to classifiers as well.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.
docker-compose.yaml
docker/Dockerfile
python 3.10-slim-buster
python 3.10-slim-buster
.github/workflows/main.yml
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
actions/checkout v4
actions/setup-python v5.1.0
.github/workflows/publish_release.yml
actions/checkout v4
actions/setup-python v5.1.0
pypa/gh-action-pypi-publish 3fbcf7ccf443305955ce16db9de8401f7dc1c7dd
pyproject.toml
requirements-test.txt
bandit ==1.7.8
black ==24.4.2
bump2version ==1.0.1
flake8 ==7.0.0
isort ==5.13.2
pytest ==8.2.0
mypy ==1.10.0
pytest-cov ==5.0.0
pytest-mock ==3.14.0
pdbpp ==0.10.3
pydocstyle ==6.3.0
mkdocs ==1.6.0
mkdocstrings ==0.25.0
mkdocs-material ==9.5.20
mkdocstrings-python ==1.10.0
requirements.txt
click ==8.1.7
numpy ==1.26.4
nltk ==3.8.1
scikit-learn ==1.4.2
vaderSentiment ==3.3.2
pandas ==2.2.2
langdetect ==1.0.9
networkx ==3.2.1
gensim ==4.3.2
scipy ==1.12.0
spacy ==3.7.4
setup.cfg
click ==8.1.7
numpy ==1.26.4
nltk ==3.8.1
scikit-learn ==1.4.2
vaderSentiment ==3.3.2
pandas ==2.2.2
langdetect ==1.0.9
networkx ==3.2.1
gensim ==4.3.2
scipy ==1.12.0
spacy ==3.7.4
E.g., using this - https://github.com/ekalinin/github-markdown-toc.go
Currently -
extra-model/extra_model/_cli.py
Line 15 in c5de631
/app/output
as a default + in the help. Since we mount extra-model
code to /io/
, both of these should say /io/output
instead.2 functions are different only in 1 variable that can be passed in as a parameter
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.