Code Monkey home page Code Monkey logo

thepipe's Introduction

English | 中文

codecov python-gh-action Website get API Join discord

Feed PDFs, URLs, Slides, YouTube videos, Word docs and more into Vision-Language models with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a hosted API at thepi.pe, or it can be set up locally if you have the the compute.

Science assistant demo

Features 🌟

  • Extracts text and visuals from files or web pages 📚
  • Outputs chunks optimized for multimodal LLMs and RAG frameworks 🖼️
  • Interpret complex PDFs, web pages, docs, videos, data, and more 🧠
  • Auto-compress prompts exceeding your chosen token limit 📦
  • Works even with missing file extensions, in-memory data streams 💾
  • Works with codebases, git repos, and custom integrations 🌐
  • Multi-threaded ⚡️

Getting Started 🚀

The Pipe can read a wide array of file types, and thus has many dependencies that must be installed separately. It also requires a strong machine for good response times. For this reason, we host it as an API that works out-of-the-box.

First, install The Pipe.

pip install thepipe_api

The Pipe is available as a hosted API, or it can be set up locally. An API key is recommended for out-of-the-box functionality (alternatively, see the local installation section). Ensure the THEPIPE_API_KEY environment variable is set. Don't have a key yet? Get one here.

Now you can extract comprehensive text and visuals from any file:

from thepipe_api import thepipe
messages = thepipe.extract("example.pdf")

Or websites:

messages = thepipe.extract("https://example.com")

Then feed it into GPT-4V like so:

response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages = messages,
)

Just call OpenAI

You can also use The Pipe from the command line. Here's how to recursively extract from a directory, matching only files containing a substring (in this example, typescript files) and ignore files containing other substrings (in this example, anything in the "tests" folder):

thepipe path/to/folder --match tsx --ignore tests

Supported File Types 📚

Source Type Input types Token Compression 🗜️ Image Extraction 👁️ Notes 📌
Directory Any /path/to/directory ✔️ ✔️ Extracts from all files in directory, supports match and ignore patterns
Code .py, .tsx, .js, .html, .css, .cpp, etc ✔️ (varies) Combines all code files. .c, .cpp, .py are compressible with ctags, others are not
Plaintext .txt, .md, .rtf, etc ✔️ Regular text files
PDF .pdf ✔️ ✔️ Extracts text and images of each page; can use AI for extraction of table data and images within pages
Image .jpg, .jpeg, .png ✔️ Extracts images, uses OCR if text_only
Spreadsheet .csv, .xls, .xlsx ✔️ Extracts data from spreadsheets; converts each row into a JSON formatted string
Jupyter Notebook .ipynb ✔️ Extracts code, markdown, and images from Jupyter notebooks
Microsoft Word Document .docx ✔️ ✔️ Extracts text and images from Word documents
Microsoft PowerPoint Presentation .pptx ✔️ ✔️ Extracts text and images from PowerPoint presentations
Video .mp4, .mov, .wmv ✔️ ✔️ Extracts frames and audio transcript from videos in per-minute chunks.
Audio .mp3, .wav ✔️ Extracts text from audio files; supports speech-to-text conversion
Website URLs (inputs starting with http, https, ftp) ✔️ ✔️ Extracts text from web page along with image (or images if scrollable); text-only extraction available
GitHub Repository GitHub repo URLs (inputs starting with https://github.com or https://www.github.com) ✔️ ✔️ Extracts from GitHub repositories; supports branch specification
YouTube Video YouTube video URLs (inputs starting with https://youtube.com or https://www.youtube.com) ✔️ ✔️ Extracts frames and transcript from YouTube videos in per-minute chunks.
ZIP File .zip ✔️ ✔️ Extracts contents of ZIP files; supports nested directory extraction

How it works 🛠️

The input source is either a file path, a URL, or a directory. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible list of multimodal messages representing chunks of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. The messages returned should look like this:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

If you want to feed these messages directly into the model, it is important to be mindful of the token limit. OpenAI does not allow too many images in the prompt (see discussion here), so long files should be extracted with text_only=True to avoid this issue, while long text files should either be compressed or embedded in a RAG framework.

The text and images from these messages may also be prepared for a vector database with thepipe.core.create_chunks_from_messages or for downstream use with RAG frameworks. LiteLLM can be used to easily integrate The Pipe with any LLM provider.

It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, opt-in AI table, equation, and figure extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.

Demo

Local Installation 🛠️

The Pipe handles a wide array of complex filetypes, and thus requires installation of many different packages to function. It also requires a very capable machine for good response times. For this reason, we host it as an API that works out-of-the-box. To use The Pipe locally for free instead, you will need playwright, ctags, pytesseract, pytube and the remaining local python requirements, which differ from the more lightweight API requirements:

git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt

Tip for windows users: Install the python-libmagic binaries with pip install python-magic-bin. Ensure the tesseract-ocr binaries and the ctags binaries are in your PATH. For YouTube video extraction to function consistently, you will need to modify your pytube installation to send a valid user agent header (I know, it's complicated. See this issue for more).

Now you can use The Pipe with Python:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)

or from the command line:

thepipe path/to/folder --match .tsx --ignore tests

Arguments are:

  • source (required): can be a file path, a URL, or a directory path.
  • local (optional): Use the local version of The Pipe instead of the hosted API.
  • match (optional): Substring to match files in the directory. Regex is not yet supported.
  • ignore (optional): Substring to ignore files in the directory. Regex is not yet supported.
  • limit (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed. This may not work as expected with the API, as it is in active development.
  • ai_extraction (optional): Extract tables, figures, and math from PDFs using our extractor. Incurs extra costs.
  • text_only (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

Sponsors

Book us with Cal.com

Thank you to Cal.com for sponsoring this project. Contact [email protected] for sponsorship information.

thepipe's People

Contributors

achembarpu avatar emcf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

thepipe's Issues

Feature requests 🔨

Accepting requests features in this thread, please feel free to suggest!
The roadmap so far includes:

  • Cloud storage extraction (Google Drive, OneDrive)
  • E-Commerce platform extraction (Amazon)
  • Markdown formatted extraction (PDF to Markdown, URL to Markdown, etc)

Some videos (without audio) fail to extract

error 'NoneType' object has no attribute 'write_audiofile' occuring on line video.subclip(start_time, end_time).audio.write_audiofile(audio_path, codec='pcm_s16le')

Could probably fix with a simple none check

audio = video.subclip(start_time, end_time).audio
if audio is None:
  transcription = None
else:
  audio.write_audiofile(audio_path, codec='pcm_s16le')
  ...

file type scanning

Thoughts on a scan feature that prints file types of the directory/file selected for Piping without extracting any data? It would be clearer what file types are causing failure if something isn't supported.

Audio transcript extraction

Looking to support mp3, wav

Audio is not standard in commercial multimodal models today in 2024. Because of this, I am also looking to transcribe audio to text, probably via Whisper.

Video frame + transcript extraction

Looking to support extraction of mp4, mov, webm, avi files as well as youtube for a Vision-Language model (not a video model)

Video and audio is not standard in commercial multimodal models today. Because of this, I am looking to transcribe audio.

`ai_extraction=True` not working locally

Hi! Not sure if this is a bug or a feature, but I'd love to use the ai_extraction option to improve the handling of PDF documents. However, enabling this option overwrites the local=True option.

MWE:

from thepipe.thepipe_api import thepipe 
source = 'example.pdf'
messages = thepipe.extract(source, local=True, verbose=True, ai_extraction=True)

Throws the error:
Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.

It works without enabling ai_extraction, but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs.
As a workaround, I adapted the extract_pdf function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on the fitz library, but I am no expert in this package).

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_images(full=True)
                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list:
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

Swap Whisper Version

I was looking at your pipeline and thought you might be better served by using https://github.com/Vaibhavs10/insanely-fast-whisper or allow a bit of wiggle room in your framework to allow an optional parameter for feeding in a seperate processor for video transcription problems. This is over an order of magnitude improvement on vanilla whisper and has cpu/gpu modes. You may want to just allow a whole pipeline to be fed to futureproof this particular endpoint to new tooling

Error when trying to Pipe Linkedin profile

thepipe https://www.linkedin.com/in/spencer-reitsma-8a3938151/
Extracting from website...
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Python312\Scripts\thepipe.exe_main
.py", line 7, in
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\thepipe.py", line 60, in main
chunks = extractor.extract_from_source(source=args.source, match=args.match, ignore=args.ignore, limit=args.limit, verbose=args.verbose, ai_extraction=args.ai_extraction, text_only=args.text_only, local=args.local)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\extractor.py", line 57, in extract_from_source
return extract_url(url=source, text_only=text_only, local=local)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Spenc\AppData\Roaming\Python\Python312\site-packages\thepipe_api\extractor.py", line 292, in extract_url
raise ValueError(f"{response['error']}")
ValueError: Page.evaluate: Execution context was destroyed, most likely because of a navigation
PS D:\Downloads\Project Templates for reference only>

Running "Locally"

Multiple Questions:
What are the resources recommend/required for local extraction?

When running locally can you provide us the option to expose a port and receive POST requests? That way we can have an on prem machine that can work interchangeably with your API for client machines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.