neuml / codequestion Goto Github PK

View Code? Open in Web Editor NEW

516.0 516.0 47.0 4.66 MB

🔎 Semantic search for developers

License: Apache License 2.0

Python 97.79% Makefile 2.21%

machine-learning nlp python search txtai

codequestion's People

Contributors

Stargazers

Watchers

Forkers

saneshashank ryanxjhan esadr nikhilkhandelwal mikebirdgeneau fruitywelsh spreck colinsongf wh-forker nuswgg devenlu androiddelaney xrosliang howardkingwang mbyase mrzhang2533 nukami waterplayer jeffersonzaki doinker yueshan723 gaohuan2015 kingslk rishivinayaka personx000 alex-ji-repo carvalhoamc admariner flowold nisar-1234 der-ofenmeister liufanghua2012 aria1991 battbeach mhdella beeekey 5l1v3r1 neobrainz paperwave laurentperez matrixsociety sorokinvld remkamal kuritkaj sidcoder112 edberg21 a-killingjoke

codequestion's Issues

We need more

Migrate from word vector models to sentence transformers models

Since the original release in January 2020 there has been a lot of progress! sentence-transformers models now perform better than the models currently in codequestion with similar speed (even on CPUs!).

Models in codequestion 2.0 should move to sentence-transformers.

Ca u please tell me how to create indexes for covid json files like u mentioned in ur notebook

Vector model file not found

Hello,

Thank you very much for the project. But I have one small issue, right now it seems that when you run python -m codequestion.download it downloads a configuration file that will be used by codequestion to load the model.

The path to the model seems hardcoded to /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude
How can we specify to codequestion to use our home or modify the config file?

Best

Upgrade to txtai 5.x

txtai 5.0 was recently released and much has happened since the last version of codequestion!

The next release of codequestion should replace questions.db with storing content directly in the index. Topics and path traversal should also be added via semantic graphs.

How to Speed Up Embeddings 10 Million Text

I have 10 Million text documents

# Build embeddings index
embeddings.index(Index.stream(texts))

This part of code run too slowly
How i can run faster this part

Migrate from mdv to rich library

The original mdv project is no longer maintained and there is an active fork: https://pypi.org/project/mdv3/

Switch to the rich console library.

Update minimum Python version to 3.8

Python 3.7 is now EOL. The minimum supported version of Python for releases moving forward should be 3.8+

how to get the dump data of stackoverflow and train it from scratch?

Update models to use latest Stack Exchange data

Pull latest data down and build new models

Add code quality checks

Add the following standard processes and procedures.

Unit tests
Test coverage
GitHub actions workflow
Pre-commit code quality checks

UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2

~/miniconda3/envs/deepl/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
codequestion query shell

Received the warning above when launching codequestion shell after a fresh install.

System details:

Ubuntu 18.04
Miniconda Python 3.8
CPU-only pytorch 1.6

Migrate embeddings logic to txtai

Currently, there is duplicate logic for building embeddings logic in this project. Migrate this project to use txtai.

52177 segmentation fault codequestion

➜ python3.10 -m pip install codequestion

sformers, torch, txtai, codequestion
Successfully installed MarkupSafe-2.1.2 codequestion-2.0.0 faiss-cpu-1.7.3 html2markdown-0.1.7 huggingface-hub-0.13.4 jinja2-3.1.2 mpmath-1.3.0 networkx-3.1 python-louvain-0.16 scipy-1.10.1 sympy-1.11.1 tokenizers-0.13.3 torch-2.0.0 transformers-4.28.1 txtai-5.5.0

➜ python3.10 -m codequestion.download

Downloading model from https://github.com/neuml/codequestion/releases/download/v2.0.0/cqmodel.zip to /var/folders/4b/fykz7dvx2fj550ml_6t5qkww0000gn/T/cqmodel.zip
100%|
Decompressing model to /Users/tonis/.codequestion
Download complete

➜ codequestion

Loading model from /Users/tonis/.codequestion/models/stackexchange
[1]    58256 segmentation fault  codequestion

I also tried in venv. But I'm not a Python expert

file not found /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude

root@0497bd526f2b:/# codequestion
Loading model from /root/.codequestion/models/stackexchange
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.22.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/codequestion", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 35, in main
Shell().cmdloop()
File "/usr/local/lib/python3.8/cmd.py", line 105, in cmdloop
self.preloop()
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 21, in preloop
self.embeddings, self.db = Query.load()
File "/usr/local/lib/python3.8/site-packages/codequestion/query.py", line 127, in load
embeddings.load(path)
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 332, in load
self.vectors = self.loadVectors(self.config["path"])
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 104, in loadVectors
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: '/home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude'

Feature Request: Code language preference config

It is would be nice to be able to list coding language preference so you are more likly to answers relevant to you when you fail to specify the lang

How to build SE 300d - BM25 embedding?

Currently, the model is using BM25 + fastText for indexing, but how to use SE 300d - BM25 and ParaNMT - BM25, I want to evaluate them.

Update to txtai 3.2

Update to txtai 3.2 which now has some components as optional.

Test performance (accuracy and speed) of using a transformer model vs word embedding model

Build embeddings via sentence transformers and compare accuracy/speed vs BM25-fastText.

Integrate FastAPI for model serving

Similar to neuml/txtai#12 - allow serving codequestion models via FastAPI. This will help speed up calls via the command line (#4), along with the possibility of remote service calls.

Can this be used to train on custom archives , such as archive standard library documentations?

test requires to specify source

The example in the readme simply has python -m codequestion.evaluate but that give an error of missing -s {SOME SOURCE} or --source {SOME SOURCE}.

I was able to run it with python -m codequestion.evaluate -s test (assuming it ran after following the rest of the steps).

pip install results No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

System: Windows 10 (x64) running Python 3.8.1 and pip 20.2.3.

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

(env) D:\code\codequestion>python -m pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4e/5f/528232275f6509b1fff703c9280e58951a81abe24640905de621c9f81839/pip-20.2.3-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
      Successfully uninstalled pip-19.2.3
Successfully installed pip-20.2.3

Here's what the full run looks like.

(env) D:\code\codequestion>pip install codequestion
Collecting codequestion
  Using cached codequestion-1.1.0-py3-none-any.whl (17 kB)
Collecting tqdm==4.48.0
  Using cached tqdm-4.48.0-py2.py3-none-any.whl (67 kB)
Collecting txtai>=1.2.0
  Using cached txtai-1.2.0-py3-none-any.whl (20 kB)
Collecting mdv>=1.7.4
  Using cached mdv-1.7.4.tar.gz (54 kB)
Collecting html2text>=2020.1.16
  Using cached html2text-2020.1.16-py3-none-any.whl (32 kB)
Collecting numpy>=1.18.4
  Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
     |████████████████████████████████| 13.0 MB 6.4 MB/s
Collecting annoy>=1.16.3
  Downloading annoy-1.16.3.tar.gz (644 kB)
     |████████████████████████████████| 644 kB 6.4 MB/s
Collecting pymagnitude-lite>=0.1.43
  Downloading pymagnitude_lite-0.1.143-py3-none-any.whl (34 kB)
Collecting nltk>=3.5
  Using cached nltk-3.5.zip (1.4 MB)
Collecting sentence-transformers>=0.3.3
  Using cached sentence-transformers-0.3.6.tar.gz (62 kB)
Collecting fasttext>=0.9.2
  Downloading fasttext-0.9.2.tar.gz (68 kB)
     |████████████████████████████████| 68 kB 4.8 MB/s
Collecting hnswlib>=0.4.0
  Downloading hnswlib-0.4.0.tar.gz (17 kB)
Collecting scikit-learn>=0.23.1
  Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 3.3 MB/s
Collecting regex>=2020.5.14
  Using cached regex-2020.7.14-cp38-cp38-win_amd64.whl (264 kB)
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
     |████████████████████████████████| 769 kB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

Vector model file not found (cord19-300d.magnitude)

Hi,

I get the following error when running python -m paperai.index

raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\\Users\\x\\.cord19\\vectors\\cord19-300d.magnitude'

PS. I am quite new to all this; so, apologies if the mistake is on my end.

Sync with txtai 2.0

Update to txtai 2.0

OMP: Error #15: Initializing libomp140.x86_64.dll, but found libiomp5md.dll already initialized.

I'm able to get codequestion running but as soon as I begin to query it crashes with the following error:

OMP: Error #15: Initializing libomp140.x86_64.dll, but found libiomp5md.dll already initialized.

I'm running in Windows 10, venv, Python 3.11.3, Windows Powershell.

I already tried uninstalling numby and torch and reinstalling them, but same results.

(keras-gpu-2) C:\Users\bbate>codequestion
The system cannot find the path specified.
2020-09-15 13:57:44.515137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Loading model from C:\Users\bbate\.codequestion\models\stackexchange
Traceback (most recent call last):
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\bbate\Miniconda3\envs\keras-gpu-2\Scripts\codequestion.exe\__main__.py", line 7, in <module>
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 48, in main
    Shell().cmdloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\cmd.py", line 105, in cmdloop
    self.preloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 22, in preloop
    self.embeddings, self.db = Query.load()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\query.py", line 127, in load
    embeddings.load(path)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\embeddings.py", line 258, in load
    self.embeddings = ANN.create(self.config)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\ann.py", line 51, in create
    raise ImportError("Faiss library is not installed")
ImportError: Faiss library is not installed