neuml / codequestion Goto Github PK
View Code? Open in Web Editor NEW๐ Semantic search for developers
License: Apache License 2.0
๐ Semantic search for developers
License: Apache License 2.0
Since the original release in January 2020 there has been a lot of progress! sentence-transformers models now perform better than the models currently in codequestion with similar speed (even on CPUs!).
Models in codequestion 2.0 should move to sentence-transformers.
Hello,
Thank you very much for the project. But I have one small issue, right now it seems that when you run python -m codequestion.download
it downloads a configuration file that will be used by codequestion to load the model.
The path to the model seems hardcoded to /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude
How can we specify to codequestion to use our home or modify the config file?
Best
txtai 5.0 was recently released and much has happened since the last version of codequestion!
The next release of codequestion should replace questions.db with storing content directly in the index. Topics and path traversal should also be added via semantic graphs.
I have 10 Million text documents
# Build embeddings index
embeddings.index(Index.stream(texts))
This part of code run too slowly
How i can run faster this part
The original mdv project is no longer maintained and there is an active fork: https://pypi.org/project/mdv3/
Switch to the rich console library.
Python 3.7 is now EOL. The minimum supported version of Python for releases moving forward should be 3.8+
Pull latest data down and build new models
Add the following standard processes and procedures.
~/miniconda3/envs/deepl/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
codequestion query shell
Received the warning above when launching codequestion
shell
after a fresh install.
System details:
pytorch 1.6
Currently, there is duplicate logic for building embeddings logic in this project. Migrate this project to use txtai.
โ python3.10 -m pip install codequestion
sformers, torch, txtai, codequestion
Successfully installed MarkupSafe-2.1.2 codequestion-2.0.0 faiss-cpu-1.7.3 html2markdown-0.1.7 huggingface-hub-0.13.4 jinja2-3.1.2 mpmath-1.3.0 networkx-3.1 python-louvain-0.16 scipy-1.10.1 sympy-1.11.1 tokenizers-0.13.3 torch-2.0.0 transformers-4.28.1 txtai-5.5.0
โ python3.10 -m codequestion.download
Downloading model from https://github.com/neuml/codequestion/releases/download/v2.0.0/cqmodel.zip to /var/folders/4b/fykz7dvx2fj550ml_6t5qkww0000gn/T/cqmodel.zip
100%|
Decompressing model to /Users/tonis/.codequestion
Download complete
โ codequestion
Loading model from /Users/tonis/.codequestion/models/stackexchange
[1] 58256 segmentation fault codequestion
I also tried in venv. But I'm not a Python expert
root@0497bd526f2b:/# codequestion
Loading model from /root/.codequestion/models/stackexchange
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.22.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/codequestion", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 35, in main
Shell().cmdloop()
File "/usr/local/lib/python3.8/cmd.py", line 105, in cmdloop
self.preloop()
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 21, in preloop
self.embeddings, self.db = Query.load()
File "/usr/local/lib/python3.8/site-packages/codequestion/query.py", line 127, in load
embeddings.load(path)
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 332, in load
self.vectors = self.loadVectors(self.config["path"])
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 104, in loadVectors
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: '/home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude'
It is would be nice to be able to list coding language preference so you are more likly to answers relevant to you when you fail to specify the lang
Currently, the model is using BM25 + fastText for indexing, but how to use SE 300d - BM25 and ParaNMT - BM25, I want to evaluate them.
Update to txtai 3.2 which now has some components as optional.
Build embeddings via sentence transformers and compare accuracy/speed vs BM25-fastText.
Similar to neuml/txtai#12 - allow serving codequestion models via FastAPI. This will help speed up calls via the command line (#4), along with the possibility of remote service calls.
The example in the readme simply has python -m codequestion.evaluate
but that give an error of missing -s {SOME SOURCE}
or --source {SOME SOURCE}
.
I was able to run it with python -m codequestion.evaluate -s test
(assuming it ran after following the rest of the steps).
System: Windows 10 (x64) running Python 3.8.1 and pip 20.2.3.
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)
(env) D:\code\codequestion>python -m pip install --upgrade pip
Collecting pip
Using cached https://files.pythonhosted.org/packages/4e/5f/528232275f6509b1fff703c9280e58951a81abe24640905de621c9f81839/pip-20.2.3-py2.py3-none-any.whl
Installing collected packages: pip
Found existing installation: pip 19.2.3
Uninstalling pip-19.2.3:
Successfully uninstalled pip-19.2.3
Successfully installed pip-20.2.3
Here's what the full run looks like.
(env) D:\code\codequestion>pip install codequestion
Collecting codequestion
Using cached codequestion-1.1.0-py3-none-any.whl (17 kB)
Collecting tqdm==4.48.0
Using cached tqdm-4.48.0-py2.py3-none-any.whl (67 kB)
Collecting txtai>=1.2.0
Using cached txtai-1.2.0-py3-none-any.whl (20 kB)
Collecting mdv>=1.7.4
Using cached mdv-1.7.4.tar.gz (54 kB)
Collecting html2text>=2020.1.16
Using cached html2text-2020.1.16-py3-none-any.whl (32 kB)
Collecting numpy>=1.18.4
Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 13.0 MB 6.4 MB/s
Collecting annoy>=1.16.3
Downloading annoy-1.16.3.tar.gz (644 kB)
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 644 kB 6.4 MB/s
Collecting pymagnitude-lite>=0.1.43
Downloading pymagnitude_lite-0.1.143-py3-none-any.whl (34 kB)
Collecting nltk>=3.5
Using cached nltk-3.5.zip (1.4 MB)
Collecting sentence-transformers>=0.3.3
Using cached sentence-transformers-0.3.6.tar.gz (62 kB)
Collecting fasttext>=0.9.2
Downloading fasttext-0.9.2.tar.gz (68 kB)
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 68 kB 4.8 MB/s
Collecting hnswlib>=0.4.0
Downloading hnswlib-0.4.0.tar.gz (17 kB)
Collecting scikit-learn>=0.23.1
Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 6.8 MB 3.3 MB/s
Collecting regex>=2020.5.14
Using cached regex-2020.7.14-cp38-cp38-win_amd64.whl (264 kB)
Collecting transformers==3.0.2
Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 769 kB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)
Hi,
I get the following error when running python -m paperai.index
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\\Users\\x\\.cord19\\vectors\\cord19-300d.magnitude'
PS. I am quite new to all this; so, apologies if the mistake is on my end.
Update to txtai 2.0
I'm able to get codequestion running but as soon as I begin to query it crashes with the following error:
OMP: Error #15: Initializing libomp140.x86_64.dll, but found libiomp5md.dll already initialized.
I'm running in Windows 10, venv, Python 3.11.3, Windows Powershell.
I already tried uninstalling numby and torch and reinstalling them, but same results.
It would be nice to be able to use more like a cl tool where you could just type: codequestion "how do I iterate a list in python"
.
This change will update the minimum dependency for codequestion
to txtai 6.0
.
The main code change needed here is with the scoring package. With the addition of term indexing, checks need to be added to determine if a scoring index is for term indexing or word vectors weighting.
Trying to configure on Windows 10, I seem to have gotten everything installed but get this traceback when I run it:
(keras-gpu-2) C:\Users\bbate>codequestion
The system cannot find the path specified.
2020-09-15 13:57:44.515137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Loading model from C:\Users\bbate\.codequestion\models\stackexchange
Traceback (most recent call last):
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\bbate\Miniconda3\envs\keras-gpu-2\Scripts\codequestion.exe\__main__.py", line 7, in <module>
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 48, in main
Shell().cmdloop()
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\cmd.py", line 105, in cmdloop
self.preloop()
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 22, in preloop
self.embeddings, self.db = Query.load()
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\query.py", line 127, in load
embeddings.load(path)
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\embeddings.py", line 258, in load
self.embeddings = ANN.create(self.config)
File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\ann.py", line 51, in create
raise ImportError("Faiss library is not installed")
ImportError: Faiss library is not installed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.