gkamradt / langchain-tutorials Goto Github PK

View Code? Open in Web Editor NEW

6.6K 6.6K 1.9K 43.19 MB

Overview and tutorial of the LangChain Library

Jupyter Notebook 94.25% Python 5.29% Shell 0.46%

langchain-tutorials's People

Contributors

Stargazers

Watchers

Forkers

vvr-rao cruiserein22 tan2line kumar045 theama1 esraamadi newmedia2 agplusman pivkindan sergtaima hectorabyx dadoo-ai damujen ardabck ishankgp jcgmer rajagopal17 mac4281 christophe-garon 5ong techthiyanes c00renut bilalnawaz072 rpatil524 dendadon aldiakhou renoaldocosta mingleejiang jefedeoro konformal videogrammer hanalia reben80 rourkemind essejtobor steelblu ashahidul riteshji aidev97 refind-email nathankylesmith rakshit-ti updatedai jejanov fixingpixels cordo-van-saviour sdcodeman richardpeterson8 alekhyadaspet mammarai krishpop secretagentgit jjxu217 kumars99 octag0no gobozion joshuafortini2 jamesconway98 isamelb itrapnauskas fernandonula geniuszpp weryzebra-yue rahulm043 ahmedharbaoui balldekdee jianming admariner creyesbalza tooniez ssyzyg jason-luc jacobgoldenart bnodnarb saxoji znygithub rajvira10 faressouissi arnasltlt davidhutt lenowak avkumar sankeerthrao alwinraju abonia1 gabrielserrao e46humza taltaf913 orinthianblade worktimer theogbrand chips5 sujnesh-m narender-geniemode bmwas jiangcongtao andreabos asanchez75 wuaikaiyuan jovanta

langchain-tutorials's Issues

Additional questions on the summarisation tutorial

Hey there

Thanks for putting this together. I had the same conclusion regarding the summarisation of a large document, in terms of splitting, then embedding, and then ranking the sections and choosing the most relevant for a map_reduce.

However, I've been scouring the net and racking my brains to find a splitter that would work according to theme (eg. keyword density) or being able to identify chapter/section breaks without having to pre-define what the markup would look like.

Is there a python tool or form of analysis that can segment a text document into smaller part more intelligently than a character length breakpoint?

Thanks :)

NameError: name 'chain' is not defined

Hello! I am receiving this NameError after this line: chain.run(input_documents=docs, question=query)

NameError Traceback (most recent call last)
Cell In[32], line 1
----> 1 chain.run(input_documents=docs, question=query)

NameError: name 'chain' is not defined

In GPT4 I get this answer:

The NameError indicates that the interpreter is unable to find a defined variable or function named chain. This error occurs when the name is not defined in the current scope, or there is a typo in the name.

To fix the error, you need to ensure that the variable chain is defined in the current scope. Check to see if you have defined chain earlier in the code or in a different module that you may have forgotten to import.

In some cases, running the cells from the beginning after selecting "Restart & Clear Output" may resolve the issue [1]. It may also be helpful to review the documentation for the package or module you are using to ensure that you are using it correctly. Checking for any typos in the variable name may also help resolve the issue.

Overall, the NameError indicates that the interpreter is unable to find a defined variable or function. To fix the error, ensure that the name is defined in the current scope or imported from another module, and check for any typos in the name.

could not get the answer 61-- the largest prime number that is smaller than their age

when i run the notebook :https://github.com/gkamradt/langchain-tutorials/blob/main/getting_started/Quickstart%20Guide.ipynb

Chains: Combine LLMs and prompts in multi-step workflows

`
#!pip install google-search-results
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.llms import OpenAI

Load the model

llm = OpenAI(temperature=0)

Load in some tools to use

os.environ["SERPAPI_API_KEY"] = ""
tools = load_tools(["serpapi", "llm-math"], llm=llm)

Finally, let's initialize an agent with:

1. The tools

2. The language model

3. The type of agent we want to use.

agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

See list of agents types here

Now let's test it out!

agent.run("Who is the current leader of Japan? What is the largest prime number that is smaller than their age?")

here is the return:

Entering new AgentExecutor chain...
I need to find out who the leader of Japan is and then calculate the largest prime number that is smaller than their age.
Action: Search
Action Input: "current leader of Japan"
Observation: Fumio Kishida
Thought: I need to find out the age of the leader of Japan
Action: Search
Action Input: "age of Fumio Kishida"
Observation: 65 years
Thought: I need to calculate the largest prime number that is smaller than 65
Action: Calculator
Action Input: 65

ValueError Traceback (most recent call last)
Cell In[18], line 2
1 # Now let's test it out!
----> 2 agent.run("Who is the current leader of Japan? What is the largest prime number that is smaller than their age?")

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/chains/base.py:213, in Chain.run(self, *args, **kwargs)
211 if len(args) != 1:
212 raise ValueError("run supports only one positional argument.")
--> 213 return self(args[0])[self.output_keys[0]]
215 if kwargs and not args:
216 return self(kwargs)[self.output_keys[0]]

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/chains/base.py:116, in Chain.call(self, inputs, return_only_outputs)
114 except (KeyboardInterrupt, Exception) as e:
115 self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116 raise e
117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
118 return self.prep_outputs(inputs, outputs, return_only_outputs)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/chains/base.py:113, in Chain.call(self, inputs, return_only_outputs)
107 self.callback_manager.on_chain_start(
108 {"name": self.class.name},
109 inputs,
110 verbose=self.verbose,
111 )
112 try:
--> 113 outputs = self._call(inputs)
114 except (KeyboardInterrupt, Exception) as e:
115 self.callback_manager.on_chain_error(e, verbose=self.verbose)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/agents/agent.py:792, in AgentExecutor._call(self, inputs)
790 # We now enter the agent loop (until it returns something).
791 while self._should_continue(iterations, time_elapsed):
--> 792 next_step_output = self._take_next_step(
793 name_to_tool_map, color_mapping, inputs, intermediate_steps
794 )
795 if isinstance(next_step_output, AgentFinish):
796 return self._return(next_step_output, intermediate_steps)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/agents/agent.py:695, in AgentExecutor._take_next_step(self, name_to_tool_map, color_mapping, inputs, intermediate_steps)
693 tool_run_kwargs["llm_prefix"] = ""
694 # We then call the tool on the tool input to get an observation
--> 695 observation = tool.run(
696 agent_action.tool_input,
697 verbose=self.verbose,
698 color=color,
699 **tool_run_kwargs,
700 )
701 else:
702 tool_run_kwargs = self.agent.tool_run_logging_kwargs()

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/tools/base.py:107, in BaseTool.run(self, tool_input, verbose, start_color, color, **kwargs)
105 except (Exception, KeyboardInterrupt) as e:
106 self.callback_manager.on_tool_error(e, verbose=verbose_)
--> 107 raise e
108 self.callback_manager.on_tool_end(
109 observation, verbose=verbose_, color=color, name=self.name, **kwargs
110 )
111 return observation

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/tools/base.py:104, in BaseTool.run(self, tool_input, verbose, start_color, color, **kwargs)
102 try:
103 tool_args, tool_kwargs = _to_args_and_kwargs(tool_input)
--> 104 observation = self.run(*tool_args, **tool_kwargs)
105 except (Exception, KeyboardInterrupt) as e:
106 self.callback_manager.on_tool_error(e, verbose=verbose)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/agents/tools.py:31, in Tool._run(self, *args, **kwargs)
29 def _run(self, *args: Any, **kwargs: Any) -> str:
30 """Use the tool."""
---> 31 return self.func(*args, **kwargs)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/chains/llm_math/base.py:130, in LLMMathChain._call(self, inputs)
126 self.callback_manager.on_text(inputs[self.input_key], verbose=self.verbose)
127 llm_output = llm_executor.predict(
128 question=inputs[self.input_key], stop=["```output"]
129 )
--> 130 return self._process_llm_result(llm_output)

File /opt/miniconda3/envs/py38_langchain/lib/python3.8/site-packages/langchain/chains/llm_math/base.py:86, in LLMMathChain._process_llm_result(self, llm_output)
84 answer = "Answer: " + llm_output.split("Answer:")[-1]
85 else:
---> 86 raise ValueError(f"unknown format from LLM: {llm_output}")
87 return {self.output_key: answer}

ValueError: unknown format from LLM: This is not a math problem and cannot be translated into an expression that can be executed using Python's numexpr library.`

pinecone now closed to free users - can you show how to use this with another alt system

They just closed pinecone's free system to now a waiting list ( and it is $70/month for the next level) - can some explain how to get this working with an alternative system to pinecone.

Which python version does these program run?

Would be nice to lock the versions in the Requirement file version, i.e. via Pip freeze because the long chain is changing so much and breaking sometime.

predict_and_parse is depracated

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')
output = chain.predict_and_parse(text="please add 15 more units sold to 2023")['data']

printOutput(output)

Running this code block throws a TypeError : initial_value must be str or None, not dict.

I was able to get output in json format using:

chain.predict(text=text)["data"]

Ask question that can retrieve answer from multiple chunks

Suppose i have multiple chunks and I want to build an application where I ask questions that require it to fetch across multiple chunks. For example, I have detailed experience reports of a trek from 100 people and i want to query how many of them went prepared with a first aid kit and how many of them needed to use it. What type of chunking and retrieval is the most appropriate for it?

Langchain Cookbook Part 1: The VectorStore object not used in the VectorStores section

Thanks for the cookbook. Pretty insightful.

In the section for VectorStores (under Indexes), the embeddings of the text are created using
embeddings.embed_documents()
but the vectorstore (FAISS) class is imported but not used as:
db = FAISS.from_documents(texts, embeddings).

Maybe the section should include creation of the vectorstore and its usage

Feature Request: Add a source/citation

I know we can print the documents that match the question via Pinecone, but it would be great to be able to print a citation or source that was used to determine the final answer if this can be added as a feature? Great work btw.

How to control the length of final summary using map_reduce?

Sometimes I find the final summary is too short, how can make it longer?
Thanks!

Add 'pip install' step to top of notebooks

It would be helpful to explicitly add a !pip install langchain openai cell to the top of the LangChain Cookbooks. Otherwise, users have to play whack-a-mole with installing some packages as they work down the notebook.

TypeError: issubclass() arg 1 must be a class

I try to run cookbook 1 and get this error: TypeError: issubclass() arg 1 must be a class

Any idea how to solve this?

How to use map_reduce in RetrievalQA chain

I want to use RetrievalQA chain to achieve a QA about PDF，how I don't know why my response always be split although
it‘s total tokens just 1800 or 2500 such as

so I want to use a map_reduce，but it give me a error

and my code is
`
top_matches = vector_db.similarity_search_with_score(query=question, k=int(top_k))
top_matches_contexts = " "
print("top_matches",top_matches)

print("top_k",top_k)
for i, val in enumerate(top_matches):
    top_matches_contexts += "{}、{}\n".format(i+1, val[0].page_content)
if top_matches_contexts == []:
    return "请详细描述你的问题"

top_matches_contexts = remove_spaces_and_newlines(top_matches_contexts)



global language
query = prompt.format(query=question, reference=top_matches_contexts,key_word=key_word,language=language)


# chat_history,top_matches_contexts = deal_total_token(chat_history,query)


qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.0,openai_api_key="XXXXX"), chain_type="map_reduce",
                                       retriever=vector_db.as_retriever())

result = qa_chain({"query": query,"chat_history": chat_history})
count_tokens(qa_chain,query)

Mongo Loader

Hi. I have a quick question. Do you have any example with the SimpleMongo DB loader? have you tested this connector?
I am finding difficulties to connect on Mongo db with username and password as per documentation is using only host and port. And also can i use the html reader to read a website and using llang agent to store content locally?

Code Understanding Use case, running in limit of max tokens

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4114 tokens (3858 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.

Not sure how to reduce the max_tokens, or prompt size.

Unclear input and passed variables

Hi, I am goint through your email tutorial and one I like it a lot however one thing remains unclear and it is left without any comment. Perhpas it would be good to clarify?

Input variables are 'input_documents', 'company' etc.. but map template uses 'text' as well as reduce template uses 'text'

Which text is it? is it the same value? I guess not, but it is not mentioned and actually this is the only part of your code that is leaving me with some questions.

Is it recognized by position?

Embed_with_retry in 4.0 seconds as it raised RateLimitError following "Ask A Book Questions.ipynb"

Running the exact same code as "Ask A Book Questions.ipynb", I ran into the following error:

Diagnosis (they all return the above error):

I topped up my OpenAI account for $20. The request didn't charge a single dollar from my account. I guess it's not my account's problem.
I reduced the size of the PDF to one sentence "This is a stupidly small file". I guess it's not due to the size of the texts.

What is the problem?

If I have some mapping already, how let the LLM know and use existing data?

in Clean and Standardize Data section. If I have some mapping data already. And the amount exceed token limit so i can not use prompt example. Is there any method to let LLM know these datas and use them?

ValidationError: 1 validation error for FewShotPromptTemplate example_selector instance of BaseExampleSelector expected (type=type_error.arbitrary_type; expected_arbitrary_type=BaseExampleSelector)

Anyone know how I fix this error?

ValidationError: 1 validation error for FewShotPromptTemplate
example_selector
instance of BaseExampleSelector expected (type=type_error.arbitrary_type; expected_arbitrary_type=BaseExampleSelector)

SSL Error in example Ask A Book Questions

Hi there, thanks for solving my issue about loading PDF. I came across another issue and suspect it may relate to some python packages version.

I am trying Ask A Book Questions tutorial and get below error when executing this line: docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

Traceback (most recent call last):
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/connection.py", line 362, in connect
    self.sock = ssl_wrap_socket(
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/util/ssl_.py", line 386, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: UNEXPECTED_RECORD] unexpected record (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/urllib3/util/retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLError(1, '[SSL: UNEXPECTED_RECORD] unexpected record (_ssl.c:1129)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/serena/Documents/langchain-tutorials/data_generation/chatPDF.py", line 33, in <module>
    docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/vectorstores/pinecone.py", line 235, in from_texts
    embeds = embedding.embed_documents(lines_batch)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/embeddings/openai.py", line 269, in embed_documents
    return self._get_len_safe_embeddings(texts, engine=self.deployment)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/embeddings/openai.py", line 188, in _get_len_safe_embeddings
    encoding = tiktoken.model.encoding_for_model(self.model)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken/model.py", line 75, in encoding_for_model
    return get_encoding(encoding_name)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken/registry.py", line 63, in get_encoding
    enc = Encoding(**constructor())
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken_ext/openai_public.py", line 64, in cl100k_base
    mergeable_ranks = load_tiktoken_bpe(
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken/load.py", line 114, in load_tiktoken_bpe
    contents = read_file_cached(tiktoken_bpe_file)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken/load.py", line 46, in read_file_cached
    contents = read_file(blobpath)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/tiktoken/load.py", line 24, in read_file
    return requests.get(blobpath).content
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 563, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLError(1, '[SSL: UNEXPECTED_RECORD] unexpected record (_ssl.c:1129)')))

Appreciate your help in advance!

AttributeError: 'tuple' object has no attribute 'page_content' when running a `load_summarize_chain` on an my Document generated from PyPDF Loader

Code:

loader_book = PyPDFLoader("D:/PaperPal/langchain-tutorials/data/The Attention Merchants_ The Epic Scramble to Get Inside Our Heads ( PDFDrive ) (1).pdf")
test = loader_book.load()
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
chain.run(test[0])

I get the following error even when the test[0] is a Document object

> Entering new MapReduceDocumentsChain chain...
Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?6f60f6d3-3206-4586-b2b2-d8a0f86e1aa0)---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[d:\PaperPal\langchain-tutorials\chains\Chain](file:///D:/PaperPal/langchain-tutorials/chains/Chain) Types.ipynb Cell 19 in ()
----> [1](vscode-notebook-cell:/d%3A/PaperPal/langchain-tutorials/chains/Chain%20Types.ipynb#X16sZmlsZQ%3D%3D?line=0) chain.run(test[0])

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:213](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:213), in Chain.run(self, *args, **kwargs)
    211     if len(args) != 1:
    212         raise ValueError("`run` supports only one positional argument.")
--> 213     return self(args[0])[self.output_keys[0]]
    215 if kwargs and not args:
    216     return self(kwargs)[self.output_keys[0]]

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:116](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:116), in Chain.__call__(self, inputs, return_only_outputs)
    114 except (KeyboardInterrupt, Exception) as e:
    115     self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 116     raise e
    117 self.callback_manager.on_chain_end(outputs, verbose=self.verbose)
    118 return self.prep_outputs(inputs, outputs, return_only_outputs)

File [c:\Users\mail2\anaconda3\lib\site-packages\langchain\chains\base.py:113](file:///C:/Users/mail2/anaconda3/lib/site-packages/langchain/chains/base.py:113), in Chain.__call__(self, inputs, return_only_outputs)
    107 self.callback_manager.on_chain_start(
    108     {"name": self.__class__.__name__},
    109     inputs,
    110     verbose=self.verbose,
    111 )
...
--> 141         [{**{self.document_variable_name: d.page_content}, **kwargs} for d in docs]
    142     )
    143     return self._process_results(results, docs, token_max, **kwargs)

AttributeError: 'tuple' object has no attribute 'page_content'

NewConnectionError

MaxRetryError: HTTPSConnectionPool(host='controller.pinecone_api_env.pinecone.io', port=443): Max retries exceeded with url: /databases (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe2b3ed8250>: Failed to establish a new connection: [Errno -2] Name or service not known'))

ValidationError: 1 validation error for SQLDatabaseToolkit

I got this error. The version of langchain is 0.0.169. I want to know how to fix this error.

ValidationError Traceback (most recent call last)
Cell In[2], line 3
1 # db = SQLDatabase.from_uri("sqlite:///../../../../notebooks/Chinook.db")
2 db = SQLDatabase.from_uri("sqlite:///../../notebooks/Chinook.db")
----> 3 toolkit = SQLDatabaseToolkit(db=db)
4 agent_executor = create_sql_agent(
5 llm = OpenAI(temperature=0, openai_api_key="sk-xxx"),
6 toolkit=toolkit,
7 verbose=True
8 )

File ~/miniconda3/envs/aigc/lib/python3.10/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.init()

ValidationError: 1 validation error for SQLDatabaseToolkit
llm
field required (type=value_error.missing)

Ask a Book Questions - errors in Pinecone integration

https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Ask%20A%20Book%20Questions.ipynb

The notebook is out of sync with the current version of Pinecone. Here are some thoughts:

from langchain.vectorstores import Chroma, Pinecone: I think it's better to install langchain_pinecone and use from langchain_pinecone import PineconeVectorStore: https://python.langchain.com/docs/integrations/vectorstores/pinecone
Using the latest Pinecone client installed by pip install pinecone-client, you'll need to change import pinecone to from pinecone import Pinecone. When creating an index, you'll also need to import the ServerlessSpec or PodSpec class.
With an up-to-date client, initialization is a little different and doesn't include environment. Instead of:

pinecone.init(
     api_key=PINECONE_API_KEY,  # find at app.pinecone.io
     environment=PINECONE_API_ENV  # next to api key in console
)

you use:

pc = Pinecone(api_key=PINECONE_API_KEY)

Also, the notebook seems to assume a Pinecone index already exists. If you want to point people at guidance on creating indexes, you could use https://docs.pinecone.io/docs/manage-indexes#create-a-serverless-index.

fix spelling mistake in README.md

TextSplitter in different languages

https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/5%20Levels%20Of%20Summarization%20-%20Novice%20To%20Expert.ipynb

For summarization methods above level 3, the best practice is not to use RecursiveCharacterTextSplitter, but TokenTextSplitter, because the number of tokens corresponding to the same length of string intercepted varies greatly from language to language.

text_splitter_by_char = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
text_splitter_by_token = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)

If this is not taken into account, errors exceeding the max token count are likely to occur when processing text in multiple languages.

I have tested the number of tokens used for the same family of patents, in different languages:

English (US10901237B2)=21823 (100%)
Simplified Chinese (CN112904591A)=30901 (142%)
Traditional Chinese (TW201940135A)=36530 (167%)
Korean (KR20190089752A)=42644 (195%)
Japanese (JP2019128599A)=51430 (236%)

Question on Pinecone index in 'Ask A Book Questions.ipynb'

Hey Gregory,

Thank you for the great series on YouTube. I have a question regarding the notebook 'Ask A Book Questions.ipynb' that you used to demonstrate querying some custom knowledge from PDF files.

In the 11th cell, you used a code to load the vectors into Pinecone:

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

Subsequently, you used docsearch again in your query:

query = "What are examples of good data science teams?"
docs = docsearch.similarity_search(query, include_metadata=True)

My question is would this be using the index from Pinecone? In your example here, you've loaded the vector into Pinecone earlier so the data is already in docsearch but for a use case where you would want to read the index directly without loading any documents from Pinecone, would you use from_existing_index instead? E.g.:

docsearch = Pinecone.from_existing_index(pinecone_index_name, embeddings)

'UnstructuredPDFLoader' is not defined

Hallo
If I try to execute the tutorial either on colab or local I always get the following error
NameError: name 'UnstructuredPDFLoader' is not defined
even if I install all packages as shown on https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html

UnstructuredPDFLoader zipfile.BadZipFile: File is not a zip file

Hi there, I was trying Ask a book question tutorial. However I was stuck in the third line
data = loader.load().
Do you have any idea why it says my document was not a zip file? It is loading a PDF actually.
here is the stacktrace:

Traceback (most recent call last):
  File "/Users/serena/Documents/langchain-tutorials/data_generation/chatPDF.py", line 5, in <module>
    data = loader.load()
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/unstructured.py", line 61, in load
    elements = self._get_elements()
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/langchain/document_loaders/pdf.py", line 27, in _get_elements
    from unstructured.partition.pdf import partition_pdf
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/pdf.py", line 19, in <module>
    from unstructured.partition.text import partition_text
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text.py", line 16, in <module>
    from unstructured.partition.text_type import (
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/partition/text_type.py", line 21, in <module>
    from unstructured.nlp.tokenize import pos_tag, sent_tokenize, word_tokenize
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 32, in <module>
    _download_nltk_package_if_not_present(package_name, package_category)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
    nltk.find(f"{package_category}/{package_name}")
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 555, in find
    return find(modified_name, paths)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 542, in find
    return ZipFilePathPointer(p, zipentry)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 394, in __init__
    zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/serena/Library/Python/3.9/lib/python/site-packages/nltk/data.py", line 935, in __init__
    zipfile.ZipFile.__init__(self, filename)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

The class CallbackManager had moved from langchain.callbacks.base to langchain.callbacks.manager

When I ran the code for 'With Streaming' in ChatAPI + LangChain Basics.ipynb, I encountered an error: 'cannot import name 'CallbackManager' from 'langchain.callbacks.base'.'
Upon further investigation in the LangChain documentation, I discovered that the package containing CallbackManager has been modified.

Typo in the Cookbook part 1

Your vectorstore store your embeddings (☝️) and make "the" easily searchable
I guess it should be: "Your vectorstore store your embeddings (☝️) and make "them" easily searchable" :)
Thanks

Run 5 levels of summarization-level 3 mapreduce code got an error

when i run level 3 map reduce code, i got an error like :

ValueError: OpenAIChat currently only supports single prompt, got

I think it comes form code:

output = summary_chain.run(docs)

i have search from the interenet, and still not find a solution, so how can i solved this.

my local environment:

Python 3.9.5 (default, May 18 2021, 12:31:01)
langchain 0.0.167
openai 0.27.6

ValidationError: 1 validation error for LLMChain

I have had this model working and it is great but now im getting different error messages.

ValidationError Traceback (most recent call last)
Cell In[72], line 2
1 llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
----> 2 chain = load_qa_chain(llm, chain_type="stuff")

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/question_answering/init.py:218, in load_qa_chain(llm, chain_type, verbose, callback_manager, **kwargs)
213 if chain_type not in loader_mapping:
214 raise ValueError(
215 f"Got unsupported chain type: {chain_type}. "
216 f"Should be one of {loader_mapping.keys()}"
217 )
--> 218 return loader_mapping[chain_type](
219 llm, verbose=verbose, callback_manager=callback_manager, **kwargs
220 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/langchain/chains/question_answering/init.py:63, in _load_stuff_chain(llm, prompt, document_variable_name, verbose, callback_manager, **kwargs)
54 def _load_stuff_chain(
55 llm: BaseLanguageModel,
56 prompt: Optional[BasePromptTemplate] = None,
(...)
60 **kwargs: Any,
61 ) -> StuffDocumentsChain:
62 _prompt = prompt or stuff_prompt.PROMPT_SELECTOR.get_prompt(llm)
---> 63 llm_chain = LLMChain(
64 llm=llm, prompt=prompt, verbose=verbose, callback_manager=callback_manager
65 )
66 # TODO: document prompt
67 return StuffDocumentsChain(
68 llm_chain=llm_chain,
69 document_variable_name=document_variable_name,
(...)
72 **kwargs,
73 )

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.init()

ValidationError: 1 validation error for LLMChain
prompt
none is not an allowed value (type=type_error.none.not_allowed)

ChatGPT4 response:

The error message indicates a ValidationError due to an invalid value for the prompt argument in the LLMChain constructor. Specifically, the error message states that None is not an allowed value for prompt. This error can occur if the prompt argument is not properly specified when creating an instance of the LLMChain class.

To fix this error, the prompt argument should be properly specified when creating an instance of the LLMChain class. This can be done by providing a valid value for the prompt argument that is not None. Additionally, the error message suggests that the value None is not an allowed value for the prompt argument, so it is important to consult the documentation or source code of the LLMChain class to determine what values are valid for the prompt argument.

Setup "Ask A Book Questions" for Google Colab

Hello,
let me first of all say, you have created a great tutorial on how to create a Q&A engine for any pdf-document based knowledge base. I love it!!!
I setup a Google Colab notebook to replicate your tutorial and came across quite a few issues during the environment setup.
The following screenshot shows all setup tasks needed to make it run successfully on Google Colab. I hope other readers find this useful.

Minor markdown change in Twitter Reply Bot ipynb file

How to report a mistake in documentation?

Hello lang-chain devs,

How do I report a mistake in documentation? I was not sure if this is the right forum for pointing this out.

The sentence in documentation should probably read:

A text embedding model takes a piece of text as input and returns a numerical representation of that text in the form of a list of floats.

The original documentation is missing "returns a".