xuwenhao / geektime-ai-course Goto Github PK
View Code? Open in Web Editor NEWJupyter Notebooks for Geektime AI Course
License: MIT License
Jupyter Notebooks for Geektime AI Course
License: MIT License
其实画个图会好理解很多。
首先我们知道余弦距离的计算公式:cos(θ) = (a·b) / (||a|| ||b||)
公式可以分为两部分:两个向量的点积,除以两个向量模的乘积。
两个向量模的乘积很好理解,对应上图就是红色向量 a 的长度乘以黄色向量 b 的长度。
而两个向量的点积,其实就是一个向量的长度乘以另一个向量在其方向上的投影长度,对应上图就是棕色向量的长度乘以黄色向量的长度。
如果两个向量的指向基本相同时,如下图
此时向量 a 的长度和向量 a 在 b 的投影的长度几乎相等,根据余弦距离公式,cos(θ) 就趋近于 1
如果两个向量的指向基本相反,如下图
此时向量 a 的长度和向量 a 在 b 的投影的长度也几乎相等,但因为投影和 b 方向相反,所以点积是负的,根据余弦距离公式,cos(θ) 就趋近于 -1
如果向量 a 和 b 成 90度,则 a 在 b 方向上的投影为 0,根据余弦距离公式,cos(θ) 就为 0
总结一下就是,两个向量方向越一致,cos(θ) 就越趋近于 1;两个向量方向越相反,cos(θ) 就越趋近于 -1;两个向量方向为 90 度,则 cos(θ) 为 0。根据这个规律,余弦距离可以被用来表示向量之间的相似度,余弦距离越接近 1,则表示两个向量方向越一致,即两个向量越相似,反之亦然。当余弦距离为 0,则表示两个向量无相关性。
pip install + remote link.requirements.txt 失败
conda环境怎么在colab上部署?
data文件夹加下,缺失 toutiao_cat_data_10k_with_embeddings.csv 文件
直接使用conda env update --file conda-env.yml
会报如下类似错误:
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
- cffi==1.15.1=py310h6c40b1e_3
- nbformat==5.7.0=py310hecd8cb5_0
- ...
如果你使用vscode可以使用如下命令替换掉(开启正则)
不过即使这样后,还有几个包有异常:
Solving environment: failed
ResolvePackageNotFound:
- appnope=0.1.2
- x264==1!157.20191217=h1de35cc_0
- libgfortran=5.0.0
手工处理一下。另外个别包如libgfortran还是会报错:
ERROR: No matching distribution found for libgfortran
看是否修复一下,方便一键构建环境呢:)
@xuwenhao 最新的llama-index 不再支持GPTSimpleVectorIndex
https://github.com/xuwenhao/geektime-ai-course/blob/main/10_llama_index_to_read_a_book.ipynb
import openai, os
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
openai.api_key = os.environ.get("OPENAI_API_KEY")
documents = SimpleDirectoryReader('./data/mr_fujino').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
index.save_to_disk('index_mr_fujino.json')
上面的代码会报错,GPTSimpleVectorIndex要改成GPTVectorStoreIndex
需要改成下面的代码
import openai, os
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader
openai.api_key = os.environ.get("OPENAI_API_KEY")
documents = SimpleDirectoryReader('./data/mr_fujino').load_data()
index = GPTVectorStoreIndex.from_documents(documents)
index.storage_context.persist('index_mr_fujino.json')
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
notebook 7.2.0 requires jupyterlab<4.3,>=4.2.0, but you have jupyterlab 4.0.9 which is incompatible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.