Code Monkey home page Code Monkey logo

wayfair-incubator / extra-model Goto Github PK

View Code? Open in Web Editor NEW
49.0 5.0 12.0 82.39 MB

Code to run the ExtRA algorithm for unsupervised topic/aspect extraction on English texts.

License: MIT License

Dockerfile 0.01% Shell 0.02% Python 1.76% HTML 97.56% CSS 0.01% JavaScript 0.37% Jupyter Notebook 0.26%
nlp nlp-keywords-extraction nlp-library machine-learning-algorithms python3 python python-library aspect-based-sentiment-analysis aspect-extraction

extra-model's People

Contributors

0xrushi avatar atruslow avatar chrisantonellis avatar dependabot[bot] avatar gwenyyh avatar jashparekh avatar khairajani avatar mmozerwayfair avatar natalisucks avatar renovate[bot] avatar romatik avatar sanrehmo avatar subhash686 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

extra-model's Issues

Installing from pypi fails

installing extra-model from pypi fails with the following error message:

  Downloading https://files.pythonhosted.org/packages/3e/fb/5899a59ee8d0f02202c1f02fe47671e0c93d1812b1deb2491505718473da/cymem-2.0.5.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/wayfair/tmp/pip-build-88wt78b9/cymem/setup.py", line 10, in <module>
        from Cython.Build import cythonize
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /wayfair/tmp/pip-build-88wt78b9/cymem/

System information:

OS: centos 7.5
Python version: 3.6.0
Command run: pip install --index-url https://pypi.org/simple extra-model

Running pip install cython manually before attempting to install extra-model fixed this. It might be an issue with our setup.py file: SciTools/cf-units#106

Find a way to support python 3.10

In the #352 it became clear that supporting 3.10 would not be a simple upgrade because of pycld3 dependency. The most obvious way to solve it is to find an alternative because the last update for pycld3 has been done more than a year ago and it's unlikely that it'll updated to support 3.10.

Add input validation before running `extra-model`

Now it can happen that we'll run extra-model almost to the end and fail because CommentId is named incorrectly.

We should at least validate names of the columns before we start extra-model to avoid situations like this.

"No such file or directory" error when running setup docker-compose service

Hi extra-modelers ๐Ÿ‘‹ !

When running docker-compose run --rm setup, I get the following error:

  Formatting file. This will take approximately 10 minutes.
  loading projection weights from /usr/local/lib/python3.8/site-packages/gensim/test/test_data/embeddings/glove.840B.300d.txt
Traceback (most recent call last):
  File "/usr/local/bin/extra-model-setup", line 33, in <module>
    sys.exit(load_entry_point('extra-model', 'console_scripts', 'extra-model-setup')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/package/extra_model/_cli.py", line 65, in entrypoint_setup
    setup(output_path)
  File "/package/extra_model/_setup.py", line 41, in setup
    format_file(file_unzipped, output_path)
  File "/package/extra_model/_setup.py", line 84, in format_file
    _ = glove2word2vec(glove_file, tmp_file)
  File "/usr/local/lib/python3.8/site-packages/gensim/utils.py", line 1519, in new_func1
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/gensim/scripts/glove2word2vec.py", line 109, in glove2word2vec
    glovekv = KeyedVectors.load_word2vec_format(glove_input_file, binary=False, no_header=True)
  File "/usr/local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1630, in load_word2vec_format
    return _load_word2vec_format(
  File "/usr/local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 1892, in _load_word2vec_format
    with utils.open(fname, 'rb') as fin:
  File "/usr/local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 180, in open
    fobj = _shortcut_open(
  File "/usr/local/lib/python3.8/site-packages/smart_open/smart_open_lib.py", line 287, in _shortcut_open
    return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.8/site-packages/gensim/test/test_data/embeddings/glove.840B.300d.txt'
ERROR: 1

I believe this occurs because the glove_file argument being passed into glove2word2vec here is a relative path. I haven't had a chance to experiment yet, but I think changing it to an absolute path will resolve this error.

Add an interface to use `extra-model` inside of Python

Right now all of the interfaces we expose assume the workflow like this: csv file in -> csv file out.
It would be great to expose another interface that takes pandas df and returns pandas df as an output. We do this internally anyways, so nothing super difficult.

Re-work adjective negation

There is one test failing with new spacy version + "Not sturdy table" is not handled correctly, so there is still room for improvement in this area.

Eliminate non-pure Python dependencies without wheels

Currently, extra-model has dependencies on pycld2==0.31, cytoolz==0.9.0, and spacy==2.0.18: these pacakges either directly or indirectly use C extensions that are not shipped as a wheel. As a result, gcc is a requirement of extra-model so that these dependencies can be build from source.

The best case is to eliminate any dependencies on gcc. If so, images deployed to production will (1) be smaller, (2) build faster, and (3) be more secure. Additionally, users of extra-model are less likely to encounter installation errors because of missing C libraries.

It should be possible to eliminate the dependency on gcc with the following changes:

  1. cytoolz: Neither cytoolz nor toolz are used in the codebase (perhaps an old dependency that was never cleaned up?) We can remove this package from the requirements file.
  2. pycld2: This project hasn't been updated since 2019. If upgrading to use cld3 would be acceptable (difference between cld2 and cld3), we could use pycld3 as a drop-in replacement. pycld3 provides wheels for compatibility and is actively maintained.
  3. spacy: Newer releases of spacy eliminate the offending dependencies. There is already a PR (#54) that updates spacy to a compatible version.

Once these changes are made, we can start using the slim-buster docker image instead of buster. The slim version is substantially smaller (112MB vs. 875MB) and doesn't contain gcc--which replicates a desirable production environment.

I'll put up a draft PR to demonstrate---and once #54 is merged I will update the PR to use the slim image.

Discrepancy between CLI entrypoint and run script

There seems to be a discrepancy between _cli.py and _run.py in that entrypoint_setup() in _cli.py exposes an output_path argument to specify where to write the embeddings to but run() in _run.py does not offer the option to pass that as an argument; there it's hardcoded to ./embeddings: https://github.com/wayfair-incubator/extra-model/blob/main/extra_model/_run.py#L8

I think we should either

  • add an embeddings_path argument to run() or
  • remove the output_path argument from entrypoint_setup()

Add 3.6 and 3.7 to classifiers

We test in 3.7-3.9, so makes sense to add 3.7 to classifiers
Should also add 3.6 to matrix to check if tests still pass there and if yes - add it to classifiers as well.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

docker-compose
docker-compose.yaml
dockerfile
docker/Dockerfile
  • python 3.10-slim-buster
  • python 3.10-slim-buster
github-actions
.github/workflows/main.yml
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • actions/checkout v4
  • actions/setup-python v5.1.0
.github/workflows/publish_release.yml
  • actions/checkout v4
  • actions/setup-python v5.1.0
  • pypa/gh-action-pypi-publish 3fbcf7ccf443305955ce16db9de8401f7dc1c7dd
pep621
pyproject.toml
pip_requirements
requirements-test.txt
  • bandit ==1.7.8
  • black ==24.4.2
  • bump2version ==1.0.1
  • flake8 ==7.0.0
  • isort ==5.13.2
  • pytest ==8.2.0
  • mypy ==1.10.0
  • pytest-cov ==5.0.0
  • pytest-mock ==3.14.0
  • pdbpp ==0.10.3
  • pydocstyle ==6.3.0
  • mkdocs ==1.6.0
  • mkdocstrings ==0.25.0
  • mkdocs-material ==9.5.20
  • mkdocstrings-python ==1.10.0
requirements.txt
  • click ==8.1.7
  • numpy ==1.26.4
  • nltk ==3.8.1
  • scikit-learn ==1.4.2
  • vaderSentiment ==3.3.2
  • pandas ==2.2.2
  • langdetect ==1.0.9
  • networkx ==3.2.1
  • gensim ==4.3.2
  • scipy ==1.12.0
  • spacy ==3.7.4
setup-cfg
setup.cfg
  • click ==8.1.7
  • numpy ==1.26.4
  • nltk ==3.8.1
  • scikit-learn ==1.4.2
  • vaderSentiment ==3.3.2
  • pandas ==2.2.2
  • langdetect ==1.0.9
  • networkx ==3.2.1
  • gensim ==4.3.2
  • scipy ==1.12.0
  • spacy ==3.7.4

  • Check this box to trigger a request for Renovate to run again on this repository

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.