emcf / thepipe Goto Github PK

View Code? Open in Web Editor NEW

670.0 7.0 45.0 3.64 MB

Feed PDFs, URLs, Slides, YouTube, GitHub, and more into Vision-Language models with one line of code ⚡

Home Page: https://thepi.pe

License: MIT License

Python 67.26% C++ 0.14% CSS 0.21% C 0.08% Jupyter Notebook 31.72% TypeScript 0.60%

multimodal pdf vision-transformer large-language-models web youtube gpt-4 scrapers

thepipe's Introduction

The Pipe

English | 中文

Feed PDFs, URLs, Slides, YouTube videos, Word docs and more into Vision-Language models with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a hosted API at thepi.pe, or it can be set up locally if you have the the compute.

Features 🌟

Extracts text and visuals from files or web pages 📚
Outputs chunks optimized for multimodal LLMs and RAG frameworks 🖼️
Interpret complex PDFs, web pages, docs, videos, data, and more 🧠
Auto-compress prompts exceeding your chosen token limit 📦
Works even with missing file extensions, in-memory data streams 💾
Works with codebases, git repos, and custom integrations 🌐
Multi-threaded ⚡️

Getting Started 🚀

The Pipe can read a wide array of file types, and thus has many dependencies that must be installed separately. It also requires a strong machine for good response times. For this reason, we host it as an API that works out-of-the-box.

First, install The Pipe.

pip install thepipe_api

The Pipe is available as a hosted API, or it can be set up locally. An API key is recommended for out-of-the-box functionality (alternatively, see the local installation section). Ensure the THEPIPE_API_KEY environment variable is set. Don't have a key yet? Get one here.

Now you can extract comprehensive text and visuals from any file:

from thepipe_api import thepipe
messages = thepipe.extract("example.pdf")

Or websites:

messages = thepipe.extract("https://example.com")

Then feed it into GPT-4V like so:

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages = messages,
)

You can also use The Pipe from the command line. Here's how to recursively extract from a directory, matching only files containing a substring (in this example, typescript files) and ignore files containing other substrings (in this example, anything in the "tests" folder):

thepipe path/to/folder --match tsx --ignore tests

Supported File Types 📚

Source Type	Input types	Token Compression 🗜️	Image Extraction 👁️	Notes 📌
Directory	Any `/path/to/directory`	✔️	✔️	Extracts from all files in directory, supports match and ignore patterns
Code	`.py`, `.tsx`, `.js`, `.html`, `.css`, `.cpp`, etc	✔️ (varies)	❌	Combines all code files. `.c`, `.cpp`, `.py` are compressible with ctags, others are not
Plaintext	`.txt`, `.md`, `.rtf`, etc	✔️	❌	Regular text files
PDF	`.pdf`	✔️	✔️	Extracts text and images of each page; can use AI for extraction of table data and images within pages
Image	`.jpg`, `.jpeg`, `.png`	❌	✔️	Extracts images, uses OCR if text_only
Spreadsheet	`.csv`, `.xls`, `.xlsx`	✔️	❌	Extracts data from spreadsheets; converts each row into a JSON formatted string
Jupyter Notebook	`.ipynb`	❌	✔️	Extracts code, markdown, and images from Jupyter notebooks
Microsoft Word Document	`.docx`	✔️	✔️	Extracts text and images from Word documents
Microsoft PowerPoint Presentation	`.pptx`	✔️	✔️	Extracts text and images from PowerPoint presentations
Video	`.mp4`, `.mov`, `.wmv`	✔️	✔️	Extracts frames and audio transcript from videos in per-minute chunks.
Audio	`.mp3`, `.wav`	✔️	❌	Extracts text from audio files; supports speech-to-text conversion
Website	URLs (inputs starting with `http`, `https`, `ftp`)	✔️	✔️	Extracts text from web page along with image (or images if scrollable); text-only extraction available
GitHub Repository	GitHub repo URLs (inputs starting with `https://github.com` or `https://www.github.com`)	✔️	✔️	Extracts from GitHub repositories; supports branch specification
YouTube Video	YouTube video URLs (inputs starting with `https://youtube.com` or `https://www.youtube.com`)	✔️	✔️	Extracts frames and transcript from YouTube videos in per-minute chunks.
ZIP File	`.zip`	✔️	✔️	Extracts contents of ZIP files; supports nested directory extraction

How it works 🛠️

The input source is either a file path, a URL, or a directory. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible list of multimodal messages representing chunks of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. The messages returned should look like this:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

If you want to feed these messages directly into the model, it is important to be mindful of the token limit. OpenAI does not allow too many images in the prompt (see discussion here), so long files should be extracted with text_only=True to avoid this issue, while long text files should either be compressed or embedded in a RAG framework.

The text and images from these messages may also be prepared for a vector database with thepipe.core.create_chunks_from_messages or for downstream use with RAG frameworks. LiteLLM can be used to easily integrate The Pipe with any LLM provider.

It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, opt-in AI table, equation, and figure extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.

Local Installation 🛠️

The Pipe handles a wide array of complex filetypes, and thus requires installation of many different packages to function. It also requires a very capable machine for good response times. For this reason, we host it as an API that works out-of-the-box. To use The Pipe locally for free instead, you will need playwright, ctags, pytesseract, pytube and the remaining local python requirements, which differ from the more lightweight API requirements:

git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt

Tip for windows users: Install the python-libmagic binaries with pip install python-magic-bin. Ensure the tesseract-ocr binaries and the ctags binaries are in your PATH. For YouTube video extraction to function consistently, you will need to modify your pytube installation to send a valid user agent header (I know, it's complicated. See this issue for more).

Now you can use The Pipe with Python:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)

or from the command line:

thepipe path/to/folder --match .tsx --ignore tests

Arguments are:

source (required): can be a file path, a URL, or a directory path.
local (optional): Use the local version of The Pipe instead of the hosted API.
match (optional): Substring to match files in the directory. Regex is not yet supported.
ignore (optional): Substring to ignore files in the directory. Regex is not yet supported.
limit (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed. This may not work as expected with the API, as it is in active development.
ai_extraction (optional): Extract tables, figures, and math from PDFs using our extractor. Incurs extra costs.
text_only (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

Sponsors

Thank you to Cal.com for sponsoring this project. Contact [email protected] for sponsorship information.

thepipe's People

Contributors

Stargazers

Watchers

Forkers

ibehnam sikkgit wolvend jinnotgin austinnguyen89 shammyfiveducks csandeep cyberena jonariley nodatafound techthiyanes acumenix safarlabs rp12 blancos13 postnik0706 dexterlagan collinsomniac neobrainz b08240 keyman9848 liuxing9848 achembarpu dbv111m aadya1603 shoff bsheese tonyapuzzo simonheros gfranxman ritesh1137 yuanjie-ai touristshaun xen0net octag0no ehzawad leroyg igorams1 fdiinc robertsomo clintondavi yukiman76 mf mencelot cygwynd

thepipe's Issues

Add .ino functionality for GitHub repos related to arduino

Feature requests 🔨

Accepting requests features in this thread, please feel free to suggest!
The roadmap so far includes:

Cloud storage extraction (Google Drive, OneDrive)
E-Commerce platform extraction (Amazon)
Markdown formatted extraction (PDF to Markdown, URL to Markdown, etc)

Some videos (without audio) fail to extract

error 'NoneType' object has no attribute 'write_audiofile' occuring on line video.subclip(start_time, end_time).audio.write_audiofile(audio_path, codec='pcm_s16le')

Could probably fix with a simple none check

audio = video.subclip(start_time, end_time).audio
if audio is None:
  transcription = None
else:
  audio.write_audiofile(audio_path, codec='pcm_s16le')
  ...

file type scanning

Thoughts on a scan feature that prints file types of the directory/file selected for Piping without extracting any data? It would be clearer what file types are causing failure if something isn't supported.

Audio transcript extraction

Looking to support mp3, wav

Audio is not standard in commercial multimodal models today in 2024. Because of this, I am also looking to transcribe audio to text, probably via Whisper.

Video frame + transcript extraction

Looking to support extraction of mp4, mov, webm, avi files as well as youtube for a Vision-Language model (not a video model)

Video and audio is not standard in commercial multimodal models today. Because of this, I am looking to transcribe audio.

`ai_extraction=True` not working locally

Hi! Not sure if this is a bug or a feature, but I'd love to use the ai_extraction option to improve the handling of PDF documents. However, enabling this option overwrites the local=True option.

MWE:

from thepipe.thepipe_api import thepipe 
source = 'example.pdf'
messages = thepipe.extract(source, local=True, verbose=True, ai_extraction=True)

Throws the error:
Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.

It works without enabling ai_extraction, but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs.
As a workaround, I adapted the extract_pdf function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on the fitz library, but I am no expert in this package).

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_images(full=True)
                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list:
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

Swap Whisper Version

I was looking at your pipeline and thought you might be better served by using https://github.com/Vaibhavs10/insanely-fast-whisper or allow a bit of wiggle room in your framework to allow an optional parameter for feeding in a seperate processor for video transcription problems. This is over an order of magnitude improvement on vanilla whisper and has cpu/gpu modes. You may want to just allow a whole pipeline to be fed to futureproof this particular endpoint to new tooling

No longer working after addition of THEPIPE_API_KEY

I have added the env var THEPIPE_API_KEY to my .env, .bashrc and at the commandline. It is not getting accepted.

{response['error']}")
ValueError: Valid environment variable THEPIPE_API_KEY not found. You may need to restart if you have set your API key. Visit https://thepi.pe/docs to learn more.

add syntax to match multiple patterns with match/ignore functionality.

currently ignore only accepts one file type to be ignored

Error when trying to Pipe Linkedin profile

thepipe https://www.linkedin.com/in/spencer-reitsma-8a3938151/
Extracting from website...
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Python312\Scripts\thepipe.exe_main.py", line 7, in
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\thepipe.py", line 60, in main
chunks = extractor.extract_from_source(source=args.source, match=args.match, ignore=args.ignore, limit=args.limit, verbose=args.verbose, ai_extraction=args.ai_extraction, text_only=args.text_only, local=args.local)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\extractor.py", line 57, in extract_from_source
return extract_url(url=source, text_only=text_only, local=local)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\extractor.py", line 292, in extract_url
raise ValueError(f"{response['error']}")
ValueError: Page.evaluate: Execution context was destroyed, most likely because of a navigation
PS D:\Downloads\Project Templates for reference only>

Running "Locally"

Multiple Questions:
What are the resources recommend/required for local extraction?

When running locally can you provide us the option to expose a port and receive POST requests? That way we can have an on prem machine that can work interchangeably with your API for client machines.

Make docker image

Apologies there is no docker image yet! 😅
I am on the case.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.