Code Monkey home page Code Monkey logo

chatpdf-minimal-demo's Introduction

chatpdf-minimal-demo

chatpdf 的最小实现,和文章对话 | mvp of chatpdf

此项目目的在于研究 chatpdf 实现原理 | this project aims at learn how to build app like chatpdf

screenshot

实现原理 | process flow

  • 文章切片到段落 | split articles into pieces
  • 通过 OpenAI 的 embedding 接口将每个段落转换为 embedding | convert each piece into embedding with OpenAI
  • 将提问的问题转换为 embedding | convert user question into embedding
  • 把问题的 embedding 比较所有段落 embedding 得到近似程度并排序 | compare question embedding with all the embeddings of pieces and sort the result
  • 把和提问(语义)最接近的一个或几个段落作为上下文,通过 OpenAI 的对话接口得到最终的答案 | use the nearest (meaning) piece(s) as context and ask ChatGPT for the final answer
                   article/pdf                                    question
    /            /                        \                           |
piece1         piece2  ...........       piece(N)                     |
   |             |                         |                          |
embedding1     embedding2     ......     embeddingN                   |
   |             |                         |                          |
 --X-------------X---------.....-----------X-----------------   question_embedding  
   |             |                         |                          |
question       question                  question                     |
distance1      distance2                 distance(N)                  |
                 |                                                    |
               pick nearest piece                                     |
                  \                                                   /
                           \                          /
                               construct prompt
                                     |
                              get answer from ChatGPT

使用 | usage

docker compose up

chatpdf-minimal-demo's People

Contributors

kadgang avatar postor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatpdf-minimal-demo's Issues

处理45k的文件出错

文本文件45k,分成了12段,每个段落4k不到,上传成功进入聊天模式之后,提问出错如下:

chatpdf-minimal-demo-app-1  | Error: Command failed with exit code 1: python3 py/process-article.py ask embeddings/57e7926127a3aca90f890cd63981d4a4 What is the name and label of the button with the text "联系
客服"?
chatpdf-minimal-demo-app-1  | Traceback (most recent call last):
chatpdf-minimal-demo-app-1  |   File "/app/py/process-article.py", line 153, in <module>
chatpdf-minimal-demo-app-1  |     [prompt,answer] = ask(question,obj["embeddings"],obj["sources"])
chatpdf-minimal-demo-app-1  | ValueError: not enough values to unpack (expected 2, got 0)
chatpdf-minimal-demo-app-1  |     at makeError (file:///app/node_modules/execa/lib/error.js:59:11)
chatpdf-minimal-demo-app-1  |     at handlePromise (file:///app/node_modules/execa/index.js:119:26)
chatpdf-minimal-demo-app-1  |     at processTicksAndRejections (node:internal/process/task_queues:96:5)
chatpdf-minimal-demo-app-1  |     at async file:///app/node_modules/@shack-js/runner-express/esm/index.js:56:37 {
chatpdf-minimal-demo-app-1  |   shortMessage: 'Command failed with exit code 1: python3 py/process-article.py ask embeddings/57e7926127a3aca90f890cd63981d4a4 What is the name and label of the button with the text "联系客服"?',
chatpdf-minimal-demo-app-1  |   command: 'python3 py/process-article.py ask embeddings/57e7926127a3aca90f890cd63981d4a4 What is the name and label of the button with the text "联系客服"?',
chatpdf-minimal-demo-app-1  |   escapedCommand: 'python3 "py/process-article.py" ask "embeddings/57e7926127a3aca90f890cd63981d4a4" "What is the name and label of the button with the text \\"联系客服\\"?"',
chatpdf-minimal-demo-app-1  |   exitCode: 1,
chatpdf-minimal-demo-app-1  |   signal: undefined,
chatpdf-minimal-demo-app-1  |   signalDescription: undefined,
chatpdf-minimal-demo-app-1  |   stdout: '',
chatpdf-minimal-demo-app-1  |   stderr: 'Traceback (most recent call last):\n' +
chatpdf-minimal-demo-app-1  |     '  File "/app/py/process-article.py", line 153, in <module>\n' +
chatpdf-minimal-demo-app-1  |     '    [prompt,answer] = ask(question,obj["embeddings"],obj["sources"])\n' +
chatpdf-minimal-demo-app-1  |     'ValueError: not enough values to unpack (expected 2, got 0)',
chatpdf-minimal-demo-app-1  |   failed: true,
chatpdf-minimal-demo-app-1  |   timedOut: false,
chatpdf-minimal-demo-app-1  |   isCanceled: false,
chatpdf-minimal-demo-app-1  |   killed: false
chatpdf-minimal-demo-app-1  | }

关于chatgpdf的回答生成过程

Hi,非常感谢你们的分享!
但是关于最后一步,chatpdf是如何生成回答的,我有一些疑惑:可以看到,chatpdf的回答内容是较为局限在pdf本身的,它不会给出pdf中没有看到的回答。以Why does BERT keep the masked tokens unchanged for 10%这一问题为例(我upload的pdf是bert的论文),chatpdf给出的回答如下:
截屏2023-03-09 上午10 48 06

但是chatgpt却能给出相对而言比较准确的结果:
image

我个人的直观感受是,为了确保回答内容是reliable的,它不会回答pdf中没有的内容。这就有点像newBing的回答策略一样()

docker-compose up 提示错误

docker-compose up 提示错误
ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services: 'app'

我把docker-compose.yml 的app 这一行删掉好像可以了,已经在构建镜像了,但是我不知道原因

install scikit-learn error

os: windows11 wsl ubuntu20.04
执行 docker compose up 出错如下

 => ERROR [ 8/13] RUN pip3 install --no-cache scikit-learn                                                       192.1s
------
 > [ 8/13] RUN pip3 install --no-cache scikit-learn:
#0 1.361 Collecting scikit-learn
#0 1.726   Downloading scikit-learn-1.2.2.tar.gz (7.3 MB)
#0 2.563      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.3/7.3 MB 8.8 MB/s eta 0:00:00
#0 3.370   Installing build dependencies: started
#0 104.7   Installing build dependencies: still running...
#0 169.4   Installing build dependencies: still running...
#0 191.5   Installing build dependencies: finished with status 'error'
#0 191.5   error: subprocess-exited-with-error
#0 191.5
#0 191.5   × pip subprocess to install build dependencies did not run successfully.
#0 191.5   │ exit code: 1
#0 191.5   ╰─> [79 lines of output]
#0 191.5       Ignoring numpy: markers 'python_version == "3.10" and platform_system == "Windows" and platform_python_implementation != "PyPy"' don't match your environment
#0 191.5       Collecting setuptools
#0 191.5         Using cached setuptools-67.6.1-py3-none-any.whl (1.1 MB)
#0 191.5       Collecting wheel
#0 191.5         Using cached wheel-0.40.0-py3-none-any.whl (64 kB)
#0 191.5       Collecting Cython>=0.29.24
#0 191.5         Using cached Cython-0.29.33-cp39-cp39-musllinux_1_1_x86_64.whl (2.1 MB)
#0 191.5       Collecting oldest-supported-numpy
#0 191.5         Using cached oldest_supported_numpy-2022.11.19-py3-none-any.whl (4.9 kB)
#0 191.5       Collecting scipy>=1.3.2
#0 191.5         Downloading scipy-1.10.1.tar.gz (42.4 MB)
#0 191.5            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.4/42.4 MB 9.3 MB/s eta 0:00:00
#0 191.5         Installing build dependencies: started
#0 191.5         Installing build dependencies: still running...
#0 191.5         Installing build dependencies: still running...
#0 191.5         Installing build dependencies: finished with status 'done'
#0 191.5         Getting requirements to build wheel: started
#0 191.5         Getting requirements to build wheel: finished with status 'done'
#0 191.5         Installing backend dependencies: started
#0 191.5         Installing backend dependencies: finished with status 'done'
#0 191.5         Preparing metadata (pyproject.toml): started
#0 191.5         Preparing metadata (pyproject.toml): finished with status 'error'
#0 191.5         error: subprocess-exited-with-error
#0 191.5
#0 191.5         × Preparing metadata (pyproject.toml) did not run successfully.
#0 191.5         │ exit code: 1
#0 191.5         ╰─> [40 lines of output]
#0 191.5             + meson setup --prefix=/usr /tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d /tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d/.mesonpy-w0ie64vn/build --native-file=/tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d/.mesonpy-native-file.ini -Ddebug=false -Doptimization=2
#0 191.5             The Meson build system
#0 191.5             Version: 1.0.1
#0 191.5             Source dir: /tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d
#0 191.5             Build dir: /tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d/.mesonpy-w0ie64vn/build
#0 191.5             Build type: native build
#0 191.5             Project name: SciPy
#0 191.5             Project version: 1.10.1
#0 191.5             C compiler for the host machine: cc (gcc 10.3.1 "cc (Alpine 10.3.1_git20210424) 10.3.1 20210424")
#0 191.5             C linker for the host machine: cc ld.bfd 2.35.2
#0 191.5             C++ compiler for the host machine: c++ (gcc 10.3.1 "c++ (Alpine 10.3.1_git20210424) 10.3.1 20210424")
#0 191.5             C++ linker for the host machine: c++ ld.bfd 2.35.2
#0 191.5             Cython compiler for the host machine: cython (cython 0.29.33)
#0 191.5             Host machine cpu family: x86_64
#0 191.5             Host machine cpu: x86_64
#0 191.5             Compiler for C supports arguments -Wno-unused-but-set-variable: YES
#0 191.5             Compiler for C supports arguments -Wno-unused-function: YES
#0 191.5             Compiler for C supports arguments -Wno-conversion: YES
#0 191.5             Compiler for C supports arguments -Wno-misleading-indentation: YES
#0 191.5             Compiler for C supports arguments -Wno-incompatible-pointer-types: YES
#0 191.5             Library m found: YES
#0 191.5
#0 191.5             ../../meson.build:63:0: ERROR: Unknown compiler(s): [['gfortran'], ['flang'], ['nvfortran'], ['pgfortran'], ['ifort'], ['ifx'], ['g95']]
#0 191.5             The following exception(s) were encountered:
#0 191.5             Running `gfortran --version` gave "[Errno 2] No such file or directory: 'gfortran'"
#0 191.5             Running `gfortran -V` gave "[Errno 2] No such file or directory: 'gfortran'"
#0 191.5             Running `flang --version` gave "[Errno 2] No such file or directory: 'flang'"
#0 191.5             Running `flang -V` gave "[Errno 2] No such file or directory: 'flang'"
#0 191.5             Running `nvfortran --version` gave "[Errno 2] No such file or directory: 'nvfortran'"
#0 191.5             Running `nvfortran -V` gave "[Errno 2] No such file or directory: 'nvfortran'"
#0 191.5             Running `pgfortran --version` gave "[Errno 2] No such file or directory: 'pgfortran'"
#0 191.5             Running `pgfortran -V` gave "[Errno 2] No such file or directory: 'pgfortran'"
#0 191.5             Running `ifort --version` gave "[Errno 2] No such file or directory: 'ifort'"
#0 191.5             Running `ifort -V` gave "[Errno 2] No such file or directory: 'ifort'"
#0 191.5             Running `ifx --version` gave "[Errno 2] No such file or directory: 'ifx'"
#0 191.5             Running `ifx -V` gave "[Errno 2] No such file or directory: 'ifx'"
#0 191.5             Running `g95 --version` gave "[Errno 2] No such file or directory: 'g95'"
#0 191.5             Running `g95 -V` gave "[Errno 2] No such file or directory: 'g95'"
#0 191.5
#0 191.5             A full log can be found at /tmp/pip-install-rcwoveti/scipy_62fc21d83ad647f3b1cac28d15cb5c4d/.mesonpy-w0ie64vn/build/meson-logs/meson-log.txt
#0 191.5             [end of output]
#0 191.5
#0 191.5         note: This error originates from a subprocess, and is likely not a problem with pip.
#0 191.5       error: metadata-generation-failed
#0 191.5
#0 191.5       × Encountered error while generating package metadata.
#0 191.5       ╰─> See above for output.

PDF中的表格如何处理?

chatpdf也可以精准回答表格中的数值问题,但是解析出来PDF中的表格,怎么切片和向量化?

how to generate question automatically?

we can split articles into pieces and then convert them to embeddings using openAI.
However, how can we generate questions automatically?
we only have articles, we don't have questions. manually write?

发送问题后,没有回复,如何解决?chatgpt说时网络问题。

openai.error.APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x70a2dfd4d430>: Failed to establish a new connection: [Errno 110] Operation timed out'))

这个错误信息提示了在使用 OpenAI API 连接时出现了错误,并且似乎是由于连接超时而导致的。可能的原因是网络问题,或者 OpenAI 服务器出现了问题。

以下是一些您可以尝试的解决方法:

检查网络连接:请确保您的计算机已连接到互联网,并且可以访问其他网站。如果您使用的是 VPN 或代理服务器,请尝试暂时禁用它们,然后再次运行代码。

检查 OpenAI 服务器状态:请访问 OpenAI 网站并检查是否有任何公告或通知。如果 OpenAI 服务器出现了问题,您可能需要等待一段时间,然后再次尝试。

增加连接超时时间:如果您的网络连接较慢或不稳定,可能需要增加连接超时时间。您可以通过设置 timeout 参数来实现这一点,例如:
`
import openai
openai.api_key = "YOUR_API_KEY"
response = openai.Embedding.create(engine="davinci", text="Hello, world!", timeout=30)

`

请问你是如何解决的?谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.