Code Monkey home page Code Monkey logo

haystack-tutorials's Introduction

Haystack Tutorials

Green logo of a stylized white 'H' with the text 'Haystack, by deepset. Haystack 2.0 is live πŸŽ‰'Β Abstract green and yellow diagrams in the background.

Haystack is anΒ open source framework by deepsetΒ for building production-readyΒ LLM applications,Β retrieval-augmented generative pipelinesΒ andΒ state-of-the-art search systemsΒ that work intelligently over large document collections. It lets you quickly try out the latest models in natural language processing (NLP) while being flexible and easy to use.

This is the repository where we keep all the Haystack tutorials πŸ““ πŸ‘‡ These tutorials are also published to the Haystack Website.

To contribute to the tutorials, please check out our Contributing Guidelines.

Run Tutorials Nightly Publish tutorials on Haystack Home

Tutorials

Haystack 1.x

Haystack 2.0

Code Colab Code Colab
Build Your First Question Answering System Open In Colab Your First QA Pipeline with Retrieval-Augmentation Open In Colab
Fine Tune a Model on Your Data Open In Colab Generating Structured Output with Loop-Based Auto-Correction Open In Colab
Build a Scalable Question Answering System Open In Colab Serializing Pipelines Open In Colab
FAQ Style QA Open In Colab Preprocessing Different File Types Open In Colab
Evaluation Open In Colab Metadata Filtering Open In Colab
Better Retrieval via Embedding Retrieval Open In Colab Classifying Documents & Queries by Language Open In Colab
[OUTDATED] RAG Generator Open In Colab Creating a Hybrid Retrieval Pipeline Open In Colab
Preprocessing Open In Colab Build an Extractive QA Pipeline Open In Colab
DPR Training Open In Colab Evaluating RAG Pipelines Open In Colab
[OUTDATED] Knowledge Graph Open In Colab Building Pipelines with Conditional Routing Open In Colab
Pipelines Open In Colab Simplifying Pipeline Inputs with Multiplexer Open In Colab
[OUTDATED] Seq2SeqGenerator Open In Colab Embedding Metadata for Improved Retrieval Open In Colab
Question Generation Open In Colab Building a Chat Application with Function Calling Open In Colab
Query Classifier Open In Colab
Table QA Open In Colab
Document Classifier at Index Time Open In Colab
Make Your QA Pipelines Talk! Open In Colab
Generative Pseudo Labeling Open In Colab
Text-to-Image search Open In Colab
Using Haystack with REST API Download
Customizing PromptNode Open In Colab
Generative QA Pipeline with Retrieval-Augmentation Open In Colab
Answering Complex Questions with Agents Open In Colab
Building a Conversational Chat App Open In Colab
Customizing Agent to Chat with Your Documents Open In Colab
Creating a Hybrid Retrieval Pipeline Open In Colab

haystack-tutorials's People

Contributors

agnieszka-m avatar ai-ahmed avatar anakin87 avatar annthurium avatar bilgeyucel avatar bogdankostic avatar brandenchan avatar davidgerva avatar ju-gu avatar julian-risch avatar kolk avatar lalitpagaria avatar masci avatar mayankjobanputra avatar michelbartels avatar mkkuemmel avatar raphaelmerx avatar robpasternak avatar ryanrussell avatar seanryankeegan avatar seduerr91 avatar silvanocerza avatar sjrl avatar tanaysoni avatar tholor avatar timoeller avatar tstadel avatar tuanacelik avatar vblagoje avatar zansara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

haystack-tutorials's Issues

Fix formatting on all tutorials

Formatting of some tutorials is off.

Some examples:
image
image

This is a header issue, probably it will all be fixed by using the correct level of titles. Only h1 and h2 are possible for ToC.

This issue is better to be fixed after PR #44 is ready and the new format for tutorials is settled.

Audio Tutorial tests failing due to dependencies

The Audio tutorial (id=17) is failing in the nightlies. But it's failing because of a different issue than the issue we have on Colab.
However, the audio node is moving out to haystack-extras repo and will be installed via a different package so let's fix this tutorial or the test in conjunction with that node.
@ZanSara I let you make the call on whether we should skip this test for now
We will need help from you and possibly @silvanocerza to fix the tutorial once the package for haystack-extras is ready

Update test on tutorials

This should be done once this PR deepset-ai/haystack#5028 is merged and we have base images with every release.

  • Add metadata to each tutorial to set on which image it should be tested
  • Run nightly test on these versions
  • Run nightly tests on main as well to be cautious about upcoming changes
  • Release images should be used when a tutorial is created or updated (on PRs)

Any other ideas @silvanocerza?

Tutorial 4,11,15,16,17 stuck at setting up elasticsearch on colab

Describe the issue
When running tutorial 11 or 4 or 15 in GPU environment, the notebooks does not finish running the cell where elasticsearch is set up. 16,17 should have the same problem looking at the code but I didn't test them.

%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch -d

Tutorials 1, 2, 3, 6, 7, 8 worked for me though.

To Reproduce
Run tutorial 11 or 4 on colab with GPU environment (environment probably doesn't make a difference).

Expected behavior
Notebook should successfully set up elasticsearch service on colab and continue execution with the next cell.

What environment did you try to run the tutorial on?:

  • Colab

Additional context
No changes made to the tutorial code.

EvaluationResult.load() produces SyntaxError (Tutorial 5)

Describe the issue
When using EvaluationResult.load(), you get a SyntaxError:

Traceback (most recent call last):

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-21-bc256748e1c5>", line 1, in <module>
    saved_eval_result = EvaluationResult.load("../")

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in load
    node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/haystack/schema.py", line 1631, in <dictcomp>
    node_results = {file.stem: pd.read_csv(file, **read_csv_kwargs) for file in csv_files}

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)

  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read

  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows

  File "pandas/_libs/parsers.pyx", line 1051, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas/_libs/parsers.pyx", line 2139, in pandas._libs.parsers._apply_converter

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 62, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')

  File "/opt/homebrew/Caskroom/miniforge/base/envs/unipy/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,

  File "<unknown>", line unknown
    
    ^
SyntaxError: unexpected EOF while parsing

To Reproduce
Steps to reproduce the behavior: Run Tutorial 5

Expected behavior
Loading the previously saved evaluation result.

What environment did you try to run the tutorial on?:

image

FAISSDocumentStore issue when running Tutorial 6 and Tutorial 7 in the same Colab

Describe the bug
Following the tutorial 6 and tutorial 7 in sequence on the same Colab runtime results in a FAISSDocumentStore error due to the presence of the already-created faiss_document_store.db

Related to this issue deepset-ai/haystack#1903

Error message
FAISSDocumentStore: number of documents present in the SQL database does not match the number of embeddings in FAISS

Expected behavior
Either a note in the notebook to mention that this might be expected behaviour and/or additional if...else.. code that automatically handles this.

Additional context
None

To Reproduce
Run Tutorial 6, then copy the relevant bits of code from Tutorial 7 into the same Colab.

FAQ Check

**System: Google Colab

  • OS:
  • GPU/CPU: VT-100
  • Haystack version (commit or version number): 1.6.1rc0
  • DocumentStore: FAISSDocumentStore
  • Reader:
  • Retriever:

Run tests on all tutorials

Some tutorials (02_finetune_a_model_on_your_data.ipynb etc) are excluded from the tests. What's the reason for this? Can we add them to nightly tests as well?

About us section is missing in agent tutorial

The Agent tutorial has no "about section" but the other tutorials do. We should add such a section to the tutorial. Sth like:

About us
This Haystack notebook was made with love by deepset in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:

German BERT
GermanQuAD and GermanDPR
Get in touch: Twitter | LinkedIn | Discord | GitHub Discussions | Website

By the way: we're hiring!

New tutorial for: WhisperTranscriber

Describe the tutorial you would like to see here
A tutorial around WhisperTranscriber. An idea could be to transcribe Youtube videos and summarize them.
Haystack has WhisperTranscriber starting from v1.15

[x] I've checked the existing tutorials

Tutorial 6 - Make MilvusDocumentStore warning prominent

Describe the issue
Tutorial "Better Retrieval with Embedding Retrieval" gives an error when run with MilvusDocumentStore

---------------------------------------------------------------------------
ContextualVersionConflict                 Traceback (most recent call last)
[<ipython-input-4-b2f9ac6965f8>](https://localhost:8080/#) in <module>
      6 
      7 from haystack.utils import launch_milvus
----> 8 from haystack.document_stores import MilvusDocumentStore
      9 
     10 launch_milvus()

9 frames
[/usr/local/lib/python3.9/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
    798                 # Oops, the "best" so far conflicts with a dependency
    799                 dependent_req = required_by[req]
--> 800                 raise VersionConflict(dist, req).with_context(dependent_req)
    801 
    802             # push the new requirements onto the stack

ContextualVersionConflict: (grpcio 1.51.3 (/usr/local/lib/python3.9/dist-packages), Requirement.parse('grpcio<=1.48.0,>=1.47.0'), {'pymilvus'})

To Reproduce
Run "Better Retrieval with Embedding Retrieval"

Expected behavior
No error

What environment did you try to run the tutorial on?:

  • OS: Colab
  • Browser: Chrome
  • Haystack Version: 1.15.0rc

Additional context
N/A

Tutorial 14 is quite complicated to follow

As discussed with @mkkuemmel - Tutorial 14 is quite complicated to follow and explains a few things in parallel that could possibly be made simpler by splitting the tutorial into a few:

  • one to show Keyword vs Question/Statement classifier
  • Another to show Question vs Statement classifier

Or another idea would be to extend the markdown explanations in the current tutorial.

Framework

Hi! The docs look great! I was wondering if you'd be open enough to let us know what documentation framework you're using?

New tutorial for: How to use PromptNode in a pipeline

Describe the tutorial you would like to see here
As a follow up to the tutorial mentioned in #112 I suggest we have a second one where we:

  • Use the PromptNode as a node in a full pipeline
  • As an example, build a retrieval augmented qa pipeline
  • Display how you would use the Shaper

[x] I've checked the existing tutorials

Colab button missing from tutorials on GitHub

Describe the issue
With PR #40 the Colab button is no more visible when looking at the tutorial files on GitHub. In my opinion, we should add them again.

Here is the old view with the button:
https://github.com/deepset-ai/haystack-tutorials/blob/1b592d5791e711489a6d25a4ff8f0a7160b174d1/tutorials/01_Basic_QA_Pipeline.ipynb

And the new one without the button:
https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb

@TuanaCelik if we don't add the button linking to colab I'd assume that fewer users will try running the code.

As an alternative to adding the button, maybe we could at least add a link?

Add Evaluation tutorials for other Nodes

We have Tutorial 5 on Evaluation, working with QA and passage search eval.

The pipeline.eval is structured to work with other nodes as well, but it would be good to have examples that people can base their work on, e.g. for generative QA or tableQA.

Other methods like FAQ matching, query + doc classifiers, summarization and translation need different labels (I think). We should create dedicated tutorials for those. FAQ matching would be a good next candidate.

Add time to complete and 'created at' date to the index for each tutorial

We will need to include 'time to complete' and the 'date created' somehow to the frontmatter of each tutorial if we want to display these as tags on the tutorial overview page. I suggest having these as fields to add for a tutorial in the index.toml and then add them to the frontmatter in generate_markdowns.py

New tutorial for: Retriever training

Describe the tutorial you would like to see here

While we provide tutorials on GPL and DPR-training (-> unsupervised and supervised Retriever training), there is no tutorial on how to utilize the train method of the EmbeddingRetriever.
I think for many users, training the EmbeddingRetriever with their annotated data could be the most straightforward idea on how to improve performance (instead of using GPL or switching to DPR).

[x] I've checked the existing tutorials

Tutorial 09: Update to EmbeddingRetriever Training

Overview

With deepset-ai/haystack#2887, we replaced DPR with EmbeddingRetriever in Tutorial 06.

Now, we might want to do the same for Tutorial 09 which covers training (or fine-tuning) a DPR Retriever model.

Q1. Should we go ahead with this switch? Any reason keeping DPR might be better?

Alternatively, we could create one for each. I guess depends on which we want to demonstrate plus what we think might be valuable for users.

Training EmbeddingRetriever

Only the sentence-encoder variant of EmbeddingRetriever can be trained.

Its train method does some data setup and then calls the fit method on SentenceTransformer (from the sentence_transformers package).

Input data format is:

[
{”question”: …, β€œpos_doc”: …, β€œneg_doc”: …, β€œscore”: …}, 
... 
]

It uses MarginMSELoss (as part of the GPL procedure).

Q2. If we were to demonstrate its training, which data could be best to use? GPL et al. seem to use MSMARCO but then we need cross-encoder scores for the score above, right? So there doesn't seem to be a download-and-use form of dataset available?

RFC: @brandenchan @vblagoje @agnieszka-m (please loop in anyone else if necessary)
cc: @mkkuemmel

Broken Link in Tutorial 11

There is a broken link in tutorial 11 (pipeline tutorial) in the section about the TranslationWrapperPipeline:

translated search (TranslationWrapperPipeline) To find out more about these pipelines, have a look at our documentation

image

"Preprocessing" tutorial fails with `PIL` error

Describe the issue
Running the tutorial I get the following error:

AttributeError: module 'PIL.Image' has no attribute 'Resampling'

To Reproduce
Run this tutorial in colab and runs the cells in order until you see the error.

Expected behavior
No error :)

What environment did you try to run the tutorial on?:

  • OSX
  • Safari
  • Haystack main

Additional context
Add any other context about the problem here.

02_Finetune_a_model_on_your_data.ipynb: A wrong file download location crash the process

Describe the issue
Fine-tuning a Model on Your Own Data

Part 2:
Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)

s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

File download error cause the squad_small.json could not be downloaded.

To Reproduce
Just run with jupyter and cause error.

Expected behavior
Should execute without error.

What environment did you try to run the tutorial on?:

  • OS: Ubuntu 22.04 + miniconda + jupyter-lab
  • Browser: Chrome
  • Haystack Version: main stream

Additional context
should be changed to:

s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
fetch_archive_from_http(url=s3_url, output_dir="data/temp")
!cp data/temp/*.json .

Lab 16: Explain use of Elastic in Index section

Describe the tutorial you would like to see here
At this point in the tutorial we state that we're going to start Elastic but not why we're doing this. What's the objective of this Index section of the lab?

[x] I've checked the existing tutorials

Skip colab button

Provide an option to not having a colab button on top of the tutorial

image

ERROR - haystack.modeling.model.predictions - Invalid end offset >on tutorial notebook

Describe the issue

Hello. I'm testing the first tutorial as it is with around 5000 text files, some are 1 page some are 15 pages long.
When the answer is getting printed I get this error.
ERROR - haystack.modeling.model.predictions - Invalid end offset:



prediction = pipe.run(

    query="how many people attended the last concert?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}

)

Inferencing Samples: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 46/46 [00:38<00:00, 1.20 Batches/s]
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-26524, -26520) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-6105, -6102) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-24692, -24689) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32411, -32404) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-27332, -27325) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-32379, -32373) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30646, -30628) with a span answer.
ERROR - haystack.modeling.model.predictions - Invalid end offset:
(-30307, -30297) with a span answer.



**To Reproduce**
Tutorial 1, with 5000 text files, some are 1 page some are 15 pages long.

**Expected behavior**
with the default Game of Thrones dataset I didn't see this issue, can you please help me fix this? Many thanks.

**What environment did you try to run the tutorial on?:**
 - OS: Ubuntu 20
 - Firefox

import haystack

haystack.version
'1.11.0rc0'


**Additional context**
I suspect the problem is not specific to the first notebook. and that is why some unusual content gets printed as the result

Query Classifier Tutorial (#14) cannot be done

Describe the issue
Tutorial: https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/14_Query_Classifier.ipynb
An exception is thrown at lines with keyword_classifier.run(query=query)

...
File ~/miniconda3/envs/haystack/lib/python3.9/site-packages/haystack/pipelines/base.py:529, in Pipeline.run(self, query, file_paths, labels, documents, meta, params, debug)
    525 except Exception as e:
    526     # The input might be a really large object with thousands of embeddings.
    527     # If you really want to see it, raise the log level.
    528     logger.debug("Exception while running node '%s' with input %s", node_id, node_input)
--> 529     raise Exception(
    530         f"Exception while running node '{node_id}': {e}\nEnable debug logging to see the data that was passed when the pipeline failed."
    531     ) from e
    532 queue.pop(node_id)
    533 #

Exception: Exception while running node 'QueryClassifier': 'GradientBoostingClassifier' object has no attribute '_loss'

What environment did you try to run the tutorial on?:

  • OS: Ubuntu 22.04.1 LTS
  • Browser: chrome, brave
  • Haystack Version: 1.12.0rc0

Run GPU tutorials nightly

Problem

Currently only tutorials that can run on CPU are executed nightly. However, we noted that the excluded tutorials are very important, as the execute code areas that are not covered by any test (#2885, deepset-ai/haystack#2881, deepset-ai/haystack#2886).

We should setup self-hosted runners with GPU that are capable of running such tutorials, to ensure the same level of confidence than the other tutorials already have.

Related:

Tutorial 2 - Finetune a model : `fetch_archive_from_http` called twice with same output_dir

Describe the issue
In tutorial 2, Fine-tuning a Model on Your Own Data, tool fetch_archive_from_http is called twice with the same output_dir here and here with the same output_dir this can't work because it will only download if the folder is empty (see implementation and warning here).

More over path are incorrect when calling augment_squad.py.

I will push a PR with a fix.

To Reproduce
Steps to reproduce the behavior:

  • Simply run the code in Distill your model, it won't work without modifying it

Expected behavior
A clear and concise description of what you expected to happen :

  • it should download glove vectors AND squad_small.json

What environment did you try to run the tutorial on?:

  • OS: All
  • Browser : Chrome
  • Haystack Version : Latest

Additional context
Add any other context about the problem here.

Upgrade torch to 1.13.0

Describe the issue
Let's upgrade to torch 1.13.0 (and to 1.13.1 as soon as it is released) to prepare for speed improvements coming with version 2.0.

@mayankjobanputra tracked changes between 1.13.1 and 2.0. It seems that the changes from there aren't big. If we upgrade to 1.13 now it will hopefully allow us to make the jump to 2.0 quickly.

Additional context
Add any other context about the problem here.

MissingIDFieldWarning in Tutorial 18

Describe the issue
When generate_markdowns.py script is run, Tutorial 18 gives this warning:

/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/nbformat/__init__.py:92: MissingIDFieldWarning: Code cell is missing an id field, this will become a hard error in future nbformat versions. You may want to use `normalize()` on your notebooks before validations (available since nbformat 5.1.4). Previous versions of nbformat are fixing this issue transparently, and will stop doing so in the future.
  validate(nb)

I encountered this only with Tutorial 18.

To Reproduce
Run python scripts/generate_markdowns.py --index index.toml --notebooks tutorials/18_GPL.ipynb --output markdowns/

Expected behavior
Although I couldn't figure out why this happens, this warning might be something important. We should check it.

What environment did you try to run the tutorial on?:

  • OS: macOS, VSCode terminal
  • pip install --upgrade pip and pip install -r requirements.txt are run

New tutorial for: `PromptNode` Basics

Describe the tutorial you would like to see here
I will create a second issue for a more advanced tutorial. However, for the PromptNode basics I would suggest we start with:

  • Initializing a PromptNode with various models
  • The default PromptTemplates
  • Defining your own PromptTemplate

@agnieszka-m please add any input about how this can be 'task' oriented.

Additional context
I think using the PromptNode in a pipeline is slightly more advanced and could be too distracting to have an intro tutorial that covers this. So I suggest a separate one for that

[x] I've checked the existing tutorials

Create Tutorial on GermanQuAD and GermanDPR

We recently created a German Question Answering Dataset and also a German Dense Passage Retrieval dataset, along with trained models for each.

It would be great to have a tutorial (something along the lines of Tutorial 1) that allows users to start playing around with these models!

Failed loading pipeline component 'Retriever'. See the stacktrace above for more information.

Describe the issue
Trying to launch the Rest-api with Milvusdocumentstore and dpr Retrieval but im not able to run the query endpoint.
Im using a seperate script for indexing

While uploading my_pipeline.yml with python and launching pipeline.run(query) it works fine. but when im launching the Rest API the Query endpoint send this :
File "/opt/venv/lib/python3.10/site-packages/rest_api/controller/search.py", line 67, in _process_request
result = pipeline.run(query=request.query, params=params, debug=request.debug)
AttributeError: 'NoneType' object has no attribute 'run'

To Reproduce
pipeline.yml
components:

  • name: MilvusDocumentStore
    params:
    embedding_dim: 384
    index: custom_index
    type: MilvusDocumentStore
  • name: Retriever
    params:
    document_store: MilvusDocumentStore
    passage_embedding_model: sentence-transformers/all-MiniLM-L6-v2
    query_embedding_model: sentence-transformers/all-MiniLM-L6-v2
    type: DensePassageRetriever
  • name: Reader
    params:
    model_name_or_path: deepset/roberta-base-squad2
    type: TransformersReader
    pipelines:
  • name: query
    nodes:
    • inputs:
      • Query
        name: Retriever
    • inputs:
      • Retriever
        name: Reader
        version: 1.14.0

My Docker-compse.yml for The Rest API :

version: '3.5'
services:
haystack-api:
#build:
# context: .
# dockerfile: ./haystack.Dockerfile
image: "deepset/haystack:cpu"
volumes:
- ./rest_api/rest_api/pipeline:/opt/pipelines
ports:
- 8000:8000
restart: on-failure
environment:
- PIPELINE_YAML_PATH=/opt/pipelines/my_pipeline.yml
- TOKENIZERS_PARALLELISM=false
- HAYSTACK_TELEMETRY_ENABLED=false

  • Haystack Version [1.4]

Additional context
The Initialized send 200 with True, Maybe i have something missing in my docker-comopse or pipeline.yml but as i said the pipeline.yml works fine with a python script.

Create Tutorial on how to generate labels

Create a tutorial that guides users on different ways to generate their own annotations using the annotation tool, and also in evaluation mode in the streamlit.ui. This might take the form of a blog post or video that should be linked to from the repository.

FAQ tutorial - mismatch between code sample and instruction

Describe the issue
In the FAQ style QA tutorial, in "Init the document store", it says you need to specify the name of "text_field" in elasticsearch, but then in the code sample this field is not listed.

To Reproduce
Open the tutorial and go to section "Initiate the DocumentStore".

Expected behavior
The code sample matches the description.

What environment did you try to run the tutorial on?:

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Haystack Version [e.g. 22]

Additional context
Message from Julian: The parameter is optional. It's called content_field and it's default value is content . Here is a link to the code.
elasticsearch.py
:param content_field: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text").
https://github.com/[deepset-ai/haystack](https://github.com/deepset-ai/haystack)|deepset-ai/haystackdeepset-ai/haystack | Added by GitHub

Julian Risch
I'd suggest we leave out the bullet point
specify the name of our text_field in Elasticsearch that we want to return as an answer
from the tutorial and then it's okay.

evaluation

In the tutorial 09_DPR_training.ipynb, how to add evaluation indicators when fine-tuning DPR, is there any code provided? Like the acc\f1\loss in the tutorial09

Move tutorial datasets to new S3 bucket

With the new S3 bucket https://core-engineering.s3.eu-central-1.amazonaws.com/public/ and its public folder, we should move and possibly also rename all datasets used in the tutorials.

There are individual copies of some datasets for each tutorial to facilitate telemetry. We need to decide on a naming scheme. I would be okay with a number as a suffix just like we did until now but maybe we can come up with an alternative? The downside of the number is that it might stay in sync with the order of the tutorials on our website and the separation into beginner/intermediate/advanced tutorials.

This is how it's currently done: https://github.com/deepset-ai/haystack/blob/ddeaf2c98c157af1e26c637bcb563c6ea52fdcb7/haystack/telemetry.py#L187
"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip": "1",

What do you think? @brandenchan @bilgeyucel

Changes are needed in Haystack to make sure telemetry continues working. There is an issue for that in Haystack here: deepset-ai/haystack#3634

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.