Code Monkey home page Code Monkey logo

denspi's People

Contributors

dependabot[bot] avatar jhyuklee avatar seominjoon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

denspi's Issues

How to handle short sentences/contexts

Hi @seominjoon @jhyuklee

  • The default model performs well for SQUAD_v1.1 dataset (where context length is ~700 chars)
    But It fails to perform, when I try to index my custom data which has small paragraph/contexts (length ~100-150 characters).
    • The problem is, irrespective of the query, the same result (wrong) is being returned as the output
    • Most the time, the result is just single random character like ? . (end of the context)
    • I have debugged into this and realized that the problem stays in start vectors which we generate from model output

Ques:

  1. May I know why this scenario occurs?
  2. What is the solution?

Setting:
All the results are obtained using the commands mentioned in README.

the choice of faiss index

hi, thanks for open-sourcing the project. great work!

I have questions on the choice of faiss index, i'd really appreciate if you find time to clarify:

  1. Could you please share the detailed procedure of how you index wikipedia?

  2. Is IVF1048576_HNSW32_SQ8 and search with nprobe=64a precise summary of your choice?

  3. I find in open/build_index.py, there is a function named merge_indexes. Did you build multiple sub-indexes then merge? or did not? because I feel the choice may have some effect on the performance.

  4. just more specific Q1, the process of building index seems quite complicated in your code as follows. by default, it goes through
    https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L121
    https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L126-L131
    https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L134-L137
    then
    https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L148
    https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L164

Can the following two lines encode the same idea?

index = faiss.index_factory(d, "IVF1048576_HNSW32,SQ8")
index.train(data)

thanks!

Sparse-first search and hybrid search not working

Hello,
I am facing issue in sparse-first search and hybrid search:

Dense-First Search is working fine, but when I select the other options it gives the following error:
KeyError: "Unable to open object (object '3580546' doesn't exist)"

I have used the pretrained model and then created custom phrase index for "dev-v1.1"

ERROR:flask.app:Exception on /api [GET]
Traceback (most recent call last):
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask_cors/extension.py", line 161, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
raise value
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functionsrule.endpoint
File "open/run_demo.py", line 128, in api
doc_top_k=5)
File "open/run_demo.py", line 94, in search
search_strategy=search_strategy, doc_top_k=5)
File "/root/denspi/open/mips_sparse.py", line 291, in search
doc_top_k=5)
File "/root/denspi/open/mips_sparse.py", line 218, in search_start
(doc_idxs, start_idxs), start_scores = self.search_sparse(query_start, doc_scores, doc_top_k)
File "/root/denspi/open/mips_sparse.py", line 168, in search_sparse
doc_group = self.get_doc_group(doc_idx)
File "/root/denspi/open/mips.py", line 121, in get_doc_group
if len(self.phrase_dumps) == 1:
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/h5py/_hl/group.py", line 262, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '3580546' doesn't exist)"
ERROR:tornado.access:500 GET /api?strat=sparse_first&query=pharmacy%20department%20and%20specialised%20areas%20 (127.0.0.1) 419.57ms

Issues in setting up demo for SQuAD 1.1 data

Hello there,
I am facing issue in setting up this code. Here is what I did:

I have downloaded Pretrained Model by running this command : "gsutil cp -r gs://denspi/v1-0/model .", and then created the Custom Phrase Index for "dev-v1.1" by running below command:
python run_piqa.py --do_dump --filter_threshold -2 --save_dir SAVE3_DIR/ --load_dir ROOT_DIR/model --metadata_dir ROOT_DIR/bert --data_dir ROOT_DIR/data/dev-v1.1 --predict_file 0:2 --output_dir ROOT_DIR/your_dump/phrase --dump_file 0-1.hdf5

After that I am serving the API and run the Demo by using following command :
python run_piqa.py --do_serve --load_dir ROOT_DIR/model --metadata_dir ROOT_DIR/bert --do_load --parallel --port 8000
python open/run_demo.py ROOT_DIR/dump ROOT_DIR/wikipedia --api_port 8000 --port 3000 --index_name 64_flat_SQ8 --sparse_type p

But the demo is not working properly. I have tested the demo by providing the questions from SQUAD 1.1 Dataset but it's not giving proper answers. Instead of expected answers, it looks like it gives random answers.

I am not able to understand why it is not providing accurate answers. Is there something which I have missed or doing wrong?

Is it compulsory to train the model on our own or the pre-trained model provided at "gs://denspi/v1-0/model ." will work instead of training our own?

Create one-command index->pred->eval routine

Enable one command routine for indexing, prediction, and evaluation.
This will go into `open/run_index_pred_eval.py'.

Then the entire evaluation process will be largely three stages:

  1. train model
  2. dump vectors
  3. index-pred-eval

neg training code is different from the paper

Currently the neg training routine (--train_neg in run_piqa.py) is different from what is described in the paper.

In the paper, we use 'no answer' logit to train on negative examples so we just don't have a separate neg training routine. In the code, we have a neg training routine that instead attaches the neg example to each positive example (whose question embeddings are similar) after normal training.

In the code, several noise injections are also used.

In practice, the strategy in the current code is better than that in the paper (no answer logit). The paper will be updated soon and this issue will be resolved.

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

After all the installations (faiss, drqa, and the two requirements.txt from this repo), run_index_pred_eval.py gives an error like below:

$ python open/run_index_pred_eval.py
/home/jinhyuk/github/kernel-sparse/dense
/data_nfs/camist002/data/dev-3.json
--para
--no_od
sampling from:
/home/jinhyuk/github/kernel-sparse/dense/phrase.hdf5
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.63it/s]
WARNING clustering 788 points to 256 centroids: please provide at least 9984 training points████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 341.87it/s]
Clustering 788 points in 481D to 256 clusters, redo 1 times, 10 iterations
Preprocessing in 0.00 s
INTEL MKL ERROR: /home/jinhyuk/miniconda3/envs/kesper/lib/python3.6/site-packages/faiss/../../../libmkl_avx2.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

Following the recommendation from here, the conda install nomkl numpy scipy scikit-learn numexpr command shows that there are some conflicts between the versions.

$ conda install nomkl numpy scipy scikit-learn numexpr
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package libopenblas conflicts for:
scikit-learn -> numpy[version='>=1.11.3,<2.0a0'] -> libopenblas[version='>=0.3.2,<0.3.3.0a0']
Package blas conflicts for:
mkl_fft -> numpy-base[version='>=1.0.6,<2.0a0'] -> blas[version='|1.0',build=openblas]
blas
scikit-learn -> blas[version='
||1.0',build='mkl|openblas|mkl|openblas']
nomkl -> blas=[build=openblas]
mkl_fft -> blas[version='
|1.0',build=mkl]
numexpr -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
numpy -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
mkl_random -> blas[version='|1.0',build=mkl]
faiss-cpu=1.5.2 -> numpy[version='>=1.11'] -> numpy-base==1.16.0=py36hde5b4d6_1 -> blas[version='
|1.0',build=openblas]
scipy -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
mkl_random -> numpy-base[version='>=1.0.2,<2.0a0'] -> blas[version='|1.0',build=openblas]
numpy-base -> blas[version='
|*|1.0',build='mkl|openblas|mkl|openblas']
faiss-cpu=1.5.2 -> blas=[build=mkl]
faiss-cpu=1.5.2 -> numpy[version='>=1.11'] -> blas==1.0=mkl

Any idea how to resolve this?

How could I reproduce the result for SQuAD 1.1?

Hi,

Thanks for your good work. I would like to reproduce the result for SQuAD 1.1 (as shown in Table 1 in the paper), but I am having some troubles.

First, I downloaded the Pretrained Model from "gs://denspi/v1-0/model" and then tried to eval on dev-v1.1 using: "python run_piqa.py --do_predict --output_dir tmp --do_load --load_dir model --predict_file dev-v1.1.json --do_eval --gt_file dev-v1.1.json --metadata_dir bert"

The predicted answer seems to be random span, resulting in a metric like: {"exact_match": 0.47303689687795647, "f1": 4.43806570152543}. 0.47% EM means something is totally wrong.

I wonder whether I did it correctly.

And if I want to train a model to reproduce the result by myself, since I cannot get the Pretrained Model work, is it enough to just run the first step in the training section (i.e. "python run_piqa.py --train_batch_size 12 --do_train --freeze_word_emb --save_dir $SAVE1_DIR")

Thanks and hope to get your advice

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.