Code Monkey home page Code Monkey logo

elastic_transformers's Introduction

ElasticTransformers

Semantic Elasticsearch with Sentence Transformers. We will use the power of Elastic and the magic of BERT to index a million articles and perform lexical and semantic search on them.

The purpose is to provide an ease-of-use way of setting up your own Elasticsearch with near state of the art capabilities of contextual embeddings / semantic search using NLP transformers.

Overview

The above setup works as follows

  • Set up an Elasticsearch server with Dockers
  • Collect the dataset
  • Use sentence-transformers to index them onto Elastic (takes about 3 hrs on 4 CPU cores)
  • Look at some comparison examples between lexical and semantic search

Setup

Set up your environment

My environment is called et and I use conda for this. Navigate inside the project directory

conda create --name et python=3.7  
conda install -n et nb_conda_kernels
conda activate et
pip install -r requirements.txt

Get the data

For this tutorial I am using A Million News Headlines by Rohk and place it in the data folder inside the project dir.

    elastic_transformers/
    ├── data/

You will find that the steps are otherwise pretty abstracted so you can also do this with your dataset of choice

Elasticsearch with Docker

Follow the instructions on setting up Elastic with Docker from Elastic's page here For this tutorial, you only need to run the two steps:

Features

The repo introduces the ElasiticTransformers class. Utilities which help create, index and query Elasticsearch indices which include embeddings

Initiate the connection links as well as (optionally) the name of the index to work with

et=ElasticTransformers(url='http://localhost:9300',index_name='et-tiny')

create_index_spec define mapping for the index. Lists of relevant fields can be provided for keyword search or semantic (dense vector) search. It also has parameters for the size of the dense vector as those can vary create_index - uses the spec created earlier to create an index ready for search

et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
et.create_index()

write_large_csv - breaks up a large csv file into chunks and iteratively uses a predefined embedding utility to create the embeddings list for each chunk and subsequently feed results to the index

et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

search - allows to select either keyword (‘match’ in Elastic) or semantic (dense in Elastic) search. Notably it requires the same embedding function used in write_large_csv

et.search(query='search these terms',
          field='headline_text',
          type='match',
          embedder=embed_wrapper, 
          size = 1000)

Usage

After successful setup, use the folling notebooks to make this all work

References

This repo combines together the following amazing works by brilliant people. Please check out their work if you haven't done so yet...

The ML part

The engineering part

elastic_transformers's People

Contributors

abinayam02 avatar md-experiments avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

elastic_transformers's Issues

Error running sample notebook Setting_up_ElasticTransformers.ipynb

Error when running

et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

Gives:

0it [00:00, ?it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-0855efb94892> in <module>
      2                   chunksize=1000,
      3                   embedder=embed_wrapper,
----> 4                   field_to_embed='headline_text')

~/notebooks/elastic_transformers/src/database.py in write_large_csv(self, file_path, index_name, chunksize, embedder, field_to_embed, index_field)
    176         for chunk in tqdm.tqdm(df_chunk):
    177             if embedder:
--> 178                 chunk[f'{field_to_embed}_embedding']=embedder(chunk[field_to_embed].values)
    179             chunk_ls=json.loads(chunk.to_json(orient='records'))
    180             self.write(chunk_ls,index_name,index_field=index_field)

<ipython-input-6-08e4eb605545> in embed_wrapper(ls)
      3     Helper function which simplifies the embedding call and helps lading data into elastic easier
      4     """
----> 5     results=bert_embedder.encode(ls, convert_to_tensor=True)
      6     results = [r.tolist() for r in results]
      7     return results

~/anaconda3/envs/et/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, is_pretokenized, device, num_workers)
    150 
    151             with torch.no_grad():
--> 152                 out_features = self.forward(features)
    153                 embeddings = out_features[output_value]
    154 

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
    115     def forward(self, input):
    116         for module in self:
--> 117             input = module(input)
    118         return input
    119 

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/envs/et/lib/python3.7/site-packages/sentence_transformers/models/BERT.py in forward(self, features)
     31     def forward(self, features):
     32         """Returns token_embeddings, cls_token"""
---> 33         output_states = self.bert(**features)
     34         output_tokens = output_states[0]
     35         cls_tokens = output_tokens[:, 0, :]  # CLS token is first token

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/envs/et/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
    802         # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
    803         # ourselves in which case we just need to make it broadcastable to all heads.
--> 804         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
    805 
    806         # If a 2D ou 3D attention mask is provided for the cross-attention

~/anaconda3/envs/et/lib/python3.7/site-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
    260             raise ValueError(
    261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
--> 262                     input_shape, attention_mask.shape
    263                 )
    264             )

ValueError: Wrong shape for input_ids (shape torch.Size([288])) or attention_mask (shape torch.Size([288]))

RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')

when I run Searching_with_ElasticTransformers.ipynb, this bug showed up:
import logging
logging.getLogger().setLevel(logging.DEBUG)

query='4G智能移动单兵'

print('CONTEXTUAL SEARCH RESULTS...')
df1=et.search(query,'Title',type='dense',embedder=embed_wrapper, size = 1000)
display(select_search_results(df1))

image
Then, What is the cause of this bug?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.