md-experiments / elastic_transformers Goto Github PK

Making BERT stretchy. Semantic Elasticsearch with Sentence Transformers

License: Apache License 2.0

Python 10.28% Jupyter Notebook 89.72%

nlp nlp-machine-learning artificial-intelligence transformers sentence-transformers search semantic-search elasticsearch

elastic_transformers's Introduction

ElasticTransformers

Semantic Elasticsearch with Sentence Transformers. We will use the power of Elastic and the magic of BERT to index a million articles and perform lexical and semantic search on them.

The purpose is to provide an ease-of-use way of setting up your own Elasticsearch with near state of the art capabilities of contextual embeddings / semantic search using NLP transformers.

Overview

The above setup works as follows

Set up an Elasticsearch server with Dockers
Collect the dataset
Use sentence-transformers to index them onto Elastic (takes about 3 hrs on 4 CPU cores)
Look at some comparison examples between lexical and semantic search

Setup

Set up your environment

My environment is called et and I use conda for this. Navigate inside the project directory

conda create --name et python=3.7  
conda install -n et nb_conda_kernels
conda activate et
pip install -r requirements.txt

Get the data

For this tutorial I am using A Million News Headlines by Rohk and place it in the data folder inside the project dir.

    elastic_transformers/
    ├── data/

You will find that the steps are otherwise pretty abstracted so you can also do this with your dataset of choice

Elasticsearch with Docker

Follow the instructions on setting up Elastic with Docker from Elastic's page here For this tutorial, you only need to run the two steps:

Features

The repo introduces the ElasiticTransformers class. Utilities which help create, index and query Elasticsearch indices which include embeddings

Initiate the connection links as well as (optionally) the name of the index to work with

et=ElasticTransformers(url='http://localhost:9300',index_name='et-tiny')

create_index_spec define mapping for the index. Lists of relevant fields can be provided for keyword search or semantic (dense vector) search. It also has parameters for the size of the dense vector as those can vary create_index - uses the spec created earlier to create an index ready for search

et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
et.create_index()

write_large_csv - breaks up a large csv file into chunks and iteratively uses a predefined embedding utility to create the embeddings list for each chunk and subsequently feed results to the index

et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

search - allows to select either keyword (‘match’ in Elastic) or semantic (dense in Elastic) search. Notably it requires the same embedding function used in write_large_csv

et.search(query='search these terms',
          field='headline_text',
          type='match',
          embedder=embed_wrapper, 
          size = 1000)

Usage

After successful setup, use the folling notebooks to make this all work

References

This repo combines together the following amazing works by brilliant people. Please check out their work if you haven't done so yet...

The ML part

The engineering part

elastic_transformers's People

Contributors

Stargazers

Watchers

elastic_transformers's Issues

Error running sample notebook Setting_up_ElasticTransformers.ipynb

Error when running

et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

Gives:

0it [00:00, ?it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-0855efb94892> in <module>
      2                   chunksize=1000,
      3                   embedder=embed_wrapper,
----> 4                   field_to_embed='headline_text')

~/notebooks/elastic_transformers/src/database.py in write_large_csv(self, file_path, index_name, chunksize, embedder, field_to_embed, index_field)
    176         for chunk in tqdm.tqdm(df_chunk):
    177             if embedder:
--> 178                 chunk[f'{field_to_embed}_embedding']=embedder(chunk[field_to_embed].values)
    179             chunk_ls=json.loads(chunk.to_json(orient='records'))
    180             self.write(chunk_ls,index_name,index_field=index_field)

<ipython-input-6-08e4eb605545> in embed_wrapper(ls)
      3     Helper function which simplifies the embedding call and helps lading data into elastic easier
      4     """
----> 5     results=bert_embedder.encode(ls, convert_to_tensor=True)
      6     results = [r.tolist() for r in results]
      7     return results

~/anaconda3/envs/et/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, is_pretokenized, device, num_workers)
    150 
    151             with torch.no_grad():
--> 152                 out_features = self.forward(features)
    153                 embeddings = out_features[output_value]
    154 

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
    115     def forward(self, input):
    116         for module in self:
--> 117             input = module(input)
    118         return input
    119 

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/envs/et/lib/python3.7/site-packages/sentence_transformers/models/BERT.py in forward(self, features)
     31     def forward(self, features):
     32         """Returns token_embeddings, cls_token"""
---> 33         output_states = self.bert(**features)
     34         output_tokens = output_states[0]
     35         cls_tokens = output_tokens[:, 0, :]  # CLS token is first token

~/anaconda3/envs/et/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~/anaconda3/envs/et/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
    802         # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
    803         # ourselves in which case we just need to make it broadcastable to all heads.
--> 804         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
    805 
    806         # If a 2D ou 3D attention mask is provided for the cross-attention

~/anaconda3/envs/et/lib/python3.7/site-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
    260             raise ValueError(
    261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
--> 262                     input_shape, attention_mask.shape
    263                 )
    264             )

ValueError: Wrong shape for input_ids (shape torch.Size([288])) or attention_mask (shape torch.Size([288]))

RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error')

when I run Searching_with_ElasticTransformers.ipynb, this bug showed up：
import logging
logging.getLogger().setLevel(logging.DEBUG)

query='4G智能移动单兵'

print('CONTEXTUAL SEARCH RESULTS...')
df1=et.search(query,'Title',type='dense',embedder=embed_wrapper, size = 1000)
display(select_search_results(df1))

Then, What is the cause of this bug?