Code Monkey home page Code Monkey logo

cambioml / uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering Goto Github PK

View Code? Open in Web Editor NEW
116.0 5.0 41.0 40.83 MB

LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!

Home Page: https://www.cambioml.com

License: Apache License 2.0

Python 99.61% Shell 0.39%
data-cleaning generative-ai llm

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering's Introduction

๐ŸŒŠ uniflow

License Apache-2.0 pypi_status Commit activity Slack

uniflow provides a unified LLM interface to extract and transform and raw documents.

  • Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.
  • LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including

โ“ The Problems to Tackle

Uniflow addresses two key challenges in preparing LLM training data for ML scientists:

  • first, extracting legacy documents like PDFs and Word files into clean text, which LLMs can learn from, is tricky due to complex PDF layouts and missing information during extraction; and
  • second, the labor-intensive process of transforming extracted data into a format suitable for training LLMs, which involves creating datasets with both preferred and rejected answers for each question to support feedback-based learning techniques.

Hence, we built Uniflow, a unified LLM interface to extract and transform and raw documents.

๐ŸŒฑ Use Cases

Uniflow aims to help every data scientist generate their own privacy-perserved, ready-to-use training datasets for LLM finetuning, and hence make finetuning LLMs more accessible to everyone:rocket:.

Check Uniflow hands-on solutions:


๐Ÿ’ป Installation

Installing uniflow takes about 5-10 minutes if you follow the 3 steps below:

  1. Create a conda environment on your terminal using:

    conda create -n uniflow python=3.10 -y
    conda activate uniflow  # some OS requires `source activate uniflow`
    
  2. Install the compatible pytorch based on your OS.

    • If you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.
      pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1
      
    • If you are on a CPU instance,
      pip3 install torch
      
  3. Install uniflow:

    pip3 install uniflow
    
    • (Optional) If you are running one of the following OpenAI flows, you will have to set up your OpenAI API key. To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:

      OPENAI_API_KEY=YOUR_API_KEY
      
    • (Optional) If you are running the HuggingfaceModelFlow, you will also need to install the transformers, accelerate, bitsandbytes, scipy libraries:

      pip3 install transformers accelerate bitsandbytes scipy
      
    • (Optional) If you are running the LMQGModelFlow, you will also need to install the lmqg and spacy libraries:

      pip3 install lmqg spacy
      

Congrats you have finished the installation!

๐Ÿ‘จโ€๐Ÿ’ป Dev Setup

If you are interested in contributing to us, here are the preliminary development setups.

conda create -n uniflow python=3.10 -y
conda activate uniflow
cd uniflow
pip3 install poetry
poetry install --no-root

AWS EC2 Dev Setup

If you are on EC2, you can launch a GPU instance with the following config:

  • EC2 g4dn.xlarge (if you want to run a pretrained LLM with 7B parameters)
  • Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04) Alt text
  • EBS: at least 100G Alt text

API keys

If you are running one of the following OpenAI flows, you will have to set up your OpenAI API key.

To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:

OPENAI_API_KEY=YOUR_API_KEY

๐Ÿ“œ Uniflow Manual

Overview

To use uniflow, follow of three main steps:

  1. Pick a Config
    This determines the LLM and the different configurable parameters.

  2. Construct your Prompts
    Construct the context that you want to use to prompt your model. You can configure custom instructions and examples using the PromptTemplate class.

  3. Run your Flow
    Run the flow on your input data and generate output from your LLM.

Note: We're currently building have Preprocessing flows as well to help process data from different sources, such as pdf, html, Markdown, and more.

1. Config

The Config determines which LLM is used and how the input data is serialized and deserialized. It also has parameters that are specific to the LLM.

Here is a table of the different pre-defined configurations you can use and their corresponding LLMs:

Config LLM
Config gpt-3.5-turbo-1106
OpenAIConfig gpt-3.5-turbo-1106
HuggingfaceConfig mistralai/Mistral-7B-Instruct-v0.1
LMQGConfig lmqg/t5-base-squad-qg-ae

You can run each config with the defaults, or you can pass in custom parameters, such as temperature or batch_size to the config for your use case. See the advanced custom configuration section for more details.

2. Prompting

By default, uniflow is set up to generate questions and answers based on the Context you pass in. To do so, it has a default instruction and few-shot examples that it uses to guide the LLM.

Here is the default instruction:

Generate one question and its corresponding answer based on the last context in the last example. Follow the format of the examples below to include context, question, and answer in the response

Here are the default few-shot examples:

    context="The quick brown fox jumps over the lazy brown dog.",
    question="What is the color of the fox?",
    answer="brown."

    context="The quick brown fox jumps over the lazy black dog.",
    question="What is the color of the dog?",
    answer="black."

To run with these default instructions and examples, all you need to do is pass in a list of Context objects to the flow. uniflow will then generate a custom prompt with the instructions and few-shot examples for each Context object to send to the LLM. See the Running the flow section for more details.

Context

The Context class is used to pass in the context for the LLM prompt. A Context consists of a context property, which is a string of text.

To run uniflow with the default instructions and few-shot examples, you can pass in a list of Context objects to the flow. For example:

from uniflow.op.prompt import Context

data = [
    Context(
        context="The quick brown fox jumps over the lazy brown dog.",
    ),
    ...
]

client.run(data)

For a more detailed overview of running the flow, see the Running the flow section.

PromptTemplate

If you want to run with a custom prompt instruction or few-shot examples, you can use the PromptTemplate object. It has instruction and example properties.

Property Type Description
instruction str Detailed instructions for the LLM
examples List[Context] The few-shot examples.

You can overwrite any of the defaults as needed.

To see an example of how to use the PromptTemplate to run uniflow with a custom instruction, few-shot examples, and custom Context fields to generate a summary, check out the openai_pdf_source_10k_summary notebook

Running the Flow

Once you've decided on your Config and prompting strategy, you can run the flow on the input data.

  1. Import the uniflow Client, Config, and Context objects.

    from uniflow.flow.client import TransformClient
    from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
    from uniflow.op.prompt import Context
    
  2. Preprocess your data in to chunks to pass into the flow. In the future we will have Preprocessing flows to help with this step, but for now you can use a library of your choice, like pypdf, to chunk your data.

    raw_input_context = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]
    
  3. Create a list of Context objects to pass your data into the flow.

    data = [
        Context(context=c)
        for c in raw_input_context
    ]
    
  4. [Optional] If you want to use a customized instruction and/or examples, create a PromptTemplate.

    from uniflow.op.prompt import PromptTemplate
    
    guided_prompt = PromptTemplate(
    instruction="Generate a one sentence summary based on the last context below. Follow the format of the examples below to include context and summary in the response",
    few_shot_prompt=[
        Context(
            context="When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That's no problem for someone on the manager's schedule. There's always something coming on the next hour; the only question is what. But when someone on the maker's schedule has a meeting, they have to think about it.",
            summary="Meetings disrupt the productivity of those following a maker's schedule, dividing their time into impractical segments, while those on a manager's schedule are accustomed to a continuous flow of tasks.",
        ),
    ],
    )
    
  5. Create a Config object to pass into the Client object.

    config = TransformOpenAIConfig(
        prompt_template=guided_prompt,
        model_config=OpenAIModelConfig(
            response_format={"type": "json_object"}
        ),
    )
    client = TransformClient(config)
    
  6. Use the client object to run the flow on the input data.

    output = client.run(data)
    
  7. Process the output data. By default, the LLM output will be a list of output dicts, one for each Context passed into the flow. Each dict has a response property which has the LLM response, as well as any errors. For example output[0]['output'][0] would look like this:

    {
        'response': [{'context': 'It was a sunny day and the sky color is blue.',
        'question': 'What was the color of the sky?',
        'answer': 'blue.'}],
        'error': 'No errors.'
    }
    

Examples

For more examples, see the example folder.

Advanced Custom Configuration

You can also configure the flows by passing custom configurations or arguments to the Config object if you want to further tune specific parameters like the the LLM model, number of threads, the temperature, and more.

Every configuration has the following parameters:

Parameter Type Description
prompt_template PromptTemplate The template to use for the guided prompt.
num_threads int The number of threads to use for the flow.
model_config ModelConfig The configuration to pass to the model.

You can further configure the model_config by passing in one of the Model Configs with custom parameters.

Model Config

The Model Config is a configuration that is passed to the base Config object and determines which LLM model is used and has parameters that are specific to the LLM model.

ModelConfig

The base config is called ModelConfig and has the following parameters:

Parameter Type Default Description
model_name str gpt-3.5-turbo-1106 OpenAI site

OpenAIModelConfig

The OpenAIModelConfig inherits from the ModelConfig and has the following additional parameters:

Parameter Type Default Description
num_calls int 1 The number of calls to make to the OpenAI API.
temperature float 1.5 The temperature to use for the OpenAI API.
response_format Dict[str, str] {"type": "text"} The response format to use for the OpenAI API. Can be "text" or "json"

HuggingfaceModelConfig

The HuggingfaceModelConfig inherits from the ModelConfig, but overrides the model_name parameter to use the mistralai/Mistral-7B-Instruct-v0.1 model by default.

Parameter Type Default Description
model_name str mistralai/Mistral-7B-Instruct-v0.1 Hugging Face site
batch_size int 1 The batch size to use for the Hugging Face API.

LMQGModelConfig

The LMQGModelConfig inherits from the ModelConfig, but overrides the model_name parameter to use the lmqg/t5-base-squad-qg-ae model by default.

Parameter Type Default Description
model_name str lmqg/t5-base-squad-qg-ae Hugging Face site
batch_size int 1 The batch size to use for the LMQG API.

Custom Configuration Example

Here is an example of how to pass in a custom configuration to the Client object:

from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, OpenAIModelConfig
from uniflow.op.prompt import Context


contexts = ["It was a sunny day and the sky color is blue.", "My name is bobby and I am a talent software engineer working on AI/ML."]

data = [
    Context(
        context=c
    )
    for c in contexts
]

config = OpenAIConfig(
  num_threads=2,
  model_config=OpenAIModelConfig(
    model_name="gpt-4",
    num_calls=2,
    temperature=0.5,
  ),
)
client = TransformClient(config)
output = client.run(data)

As you can see, we are passing in a custom parameters to the OpenAIModelConfig to the OpenAIConfig configurations according to our needs.

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering's People

Contributors

boqiny avatar c-sirui avatar callmenafiy avatar cambioml avatar cluckrookie avatar frank-suwen avatar goldmermaid avatar jojortz avatar llauraa23 avatar panzy-18 avatar riboyuan99 avatar sayazhang avatar sdddell avatar seisserenata avatar vicshi06 avatar zhihanchen03 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering's Issues

Extract Client run returns error related to "partial" in extract_pdf notebook

Python 3.10
nougat-ocr 0.1.17

Code before error

%reload_ext autoreload
%autoreload 2
import sys
sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")
import os
import pandas as pd
from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig
from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig
from uniflow.op.prompt_schema import GuidedPrompt, Context
dir_cur = os.getcwd()
pdf_file = "1408.5882_page-1.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

data = [
    {"pdf": input_file},
]

config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    )
)
nougat_client = ExtractClient(config)

output = nougat_client.run(data)
output

output prints [{'error': "name 'partial' is not defined"}]

huggingface_model_neuron failed on inf2.8x with batch_size= 1 or 2

code: uniflow-llm-text-data-cleaning-cluster/example/transform/huggingface_model_neuron.ipynb
Error message
[{'error': '(256, 4)',
'traceback': 'Traceback (most recent call last):\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/flow/server.py", line 159, in _run_flow\n output = f(input_list)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/flow/flow.py", line 36, in call\n nodes = self.run(nodes)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/flow/transform/transform_huggingface_flow.py", line 45, in run\n return self._model_op(nodes)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/op/model/model_op.py", line 40, in call\n value_dict = self._model.run(value_dict)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/op/model/abs_llm_processor.py", line 72, in run\n data = self._model_server(serialized_data)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/op/model/model_server.py", line 469, in call\n data = self._pipeline(data)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/uniflow/op/model/neuron_utils.py", line 283, in neuron_infer\n sample_output = model.generate(\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers_neuronx/generation_utils.py", line 45, in generate\n return super().generate(*args, **kwargs)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context\n return func(*args, **kwargs)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers/generation/utils.py", line 1525, in generate\n return self.sample(\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers/generation/utils.py", line 2622, in sample\n outputs = self(\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl\n return forward_call(*input, **kwargs)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers_neuronx/generation_utils.py", line 33, in forward\n out_logits = self.model(input_ids, cache_ids, start_ids)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl\n return forward_call(*input, **kwargs)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers_neuronx/mistral/model.py", line 144, in forward\n logits = self._forward(hidden, cache_ids, start_ids, last_token_id, curr_window_start, neuron_config=self.neuron_config)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers_neuronx/base.py", line 358, in _forward\n logits = self.context(hidden, *args, neuron_config=neuron_config)\n File "/home/ec2-user/aws_neuron_venv_pytorch/lib64/python3.9/site-packages/transformers_neuronx/base.py", line 188, in context\n model = self.decoder_lm_head_for_context[estimate, batch_size]\nKeyError: (256, 4)\n'},
...

Bug: RecursiveSplitter removes all spaces

๐Ÿ› Describe the bug

If I manually set the chunk_size in RecursiveSplitter to 50, it would remove all blank spaces.

Running this input:

One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer.
You can't understand the world without understanding the concept of superlinear returns. And if you're ambitious you definitely should, because this will be the wave you surf on.
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

I got this output:

[[{'output': [{'text': ["OneofthemostimportantthingsIdidn'tunderstandabout", 'theworldwhenIwasachildisthedegreetowhichthereturns', 'forperformancearesuperlinear.', 'Teachersandcoachesimplicitlytoldusthereturnswere', 'linear', '"Yougetout,"Iheardathousandtimes,"whatyouputin."', 'Theymeantwell,butthisisrarelytrue', "Ifyourproductisonlyhalfasgoodasyourcompetitor's,", "youdon'tgethalfasmanycustomers", 'You get no customers, and you go out of business.', "It'sobviouslytruethatthereturnsforperformanceare", 'superlinearinbusiness', 'Somethinkthisisaflawofcapitalism,andthatifwe', 'changedtherulesitwouldstopbeingtrue', 'Butsuperlinearreturnsforperformanceareafeatureof', "theworld,notanartifactofruleswe'veinvented", 'Weseethesamepatterninfame,power,militaryvictories,', 'knowledge,andevenbenefittohumanity', 'In all of these, the rich get richer.', "Youcan'tunderstandtheworldwithoutunderstandingthe", 'conceptofsuperlinearreturns', "Andifyou'reambitiousyoudefinitelyshould,because", 'thiswillbethewaveyousurfon.', "OneofthemostimportantthingsIdidn'tunderstandabout", 'theworldwhenIwasachildisthedegreetowhichthereturns', 'forperformancearesuperlinear.', 'Teachersandcoachesimplicitlytoldusthereturnswere', 'linear', '"Yougetout,"Iheardathousandtimes,"whatyouputin."', 'Theymeantwell,butthisisrarelytrue', "Ifyourproductisonlyhalfasgoodasyourcompetitor's,", "youdon'tgethalfasmanycustomers", 'You get no customers, and you go out of business.']}]}]]

Versions

0.0.30

Request: Customized Chunk Size for RecursiveSplitter

๐Ÿš€ The feature, motivation and pitch

Currently, there is no way to customize the chunk size variable for RecursiveSplitter. The default is 1024 characters. If we are able to supply a customized chunk size when defining config, that would be great.

Alternatives

No response

Additional context

No response

Majority vote give wrong label due to model inconsistency.

In example/rater/generated_answer.ipynb. For a input of which true label is equivalent, model sometimes generate accept or reject. So majority vote can give wrong vote.

Input:

("Vitamin C (also known as ascorbic acid and ascorbate) is a water-soluble vitamin found in citrus and other fruits, berries and vegetables, also sold as a dietary supplement and as a topical serum ingredient to treat melasma (dark pigment spots) and wrinkles on the face.",
"Is Vitamin C water-soluble?",
"Yes, Vitamin C is a very water-soluble vitamin.",
"Yes, Vitamin C can be dissolved in water well."), # Equally good

Run:

config2 = RaterForGeneratedAnswerOpenAIGPT3p5Config()
config2.model_config.num_call = 3
config2.model_config.temperature = 0.9

with OpScope(name="TextFlow"):
    client2 = RaterClient(config2)

output = client2.run(data)
pprint.pprint(output)

Ouput:

{'output': [{'average_score': 0.0,
              'error': 'No errors.',
              'majority_vote': 'reject',
              'response': ['explanation: The grounding answer is better '
                           'because it directly states that Vitamin C is "very '
                           'water-soluble," while the generated answer is more '
                           'vague in saying that it "can be dissolved in water '
                           'well."\n'
                           'label: reject',
                           'explanation: Both the grounding answer and the '
                           'generated answer correctly state that Vitamin C is '
                           'water-soluble, so they are equivalent.\n'
                           'label: equivalent',
                           'explanation: The generated answer is better '
                           'because it accurately states that Vitamin C is '
                           'water-soluble, which aligns with the information '
                           'provided in the context.\n'
                           'label: accept'],
              'scores': [-1.0, 0.0, 1.0],
              'votes': ['reject', 'equivalent', 'accept']}],

Here 'majority_vote': 'reject' is wrong.

In extract_pdf_nougat_qa.ipynb, ExtractClient(config) gets 'NoneType' object error

๐Ÿ› Describe the bug

In extract_pdf_nougat_qa.ipynb,
https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_pdf_nougat_qa.ipynb
, running this block, it gets 'NoneType' object error:

data = [
    {"filename": input_file},
]

config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "facebook/nougat-small",
        batch_size = 2
    ),
    splitter=PARAGRAPH_SPLITTER,
)
nougat_client = ExtractClient(config)

Error:

TypeError Traceback (most recent call last)
Cell In[13], line 12
1 data = [
2 {"filename": input_file},
3 ]
5 config = ExtractPDFConfig(
6 model_config=NougatModelConfig(
7 model_name = "facebook/nougat-small",
(...)
10 splitter=PARAGRAPH_SPLITTER,
11 )
---> 12 nougat_client = ExtractClient(config)
14 # output = nougat_client.run(data)

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/client.py:28, in ExtractClient.init(self, config)
21 """Client constructor
22
23 Args:
24 config (Config): Config for the flow
25
26 """
27 self._config = config
---> 28 self._server = ExtractServer(asdict(self._config))

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/server.py:49, in ExtractServer.init(self, config)
47 for i in range(self.num_thread):
48 with OpScope(name="thread
" + str(i)):
---> 49 self._flow_queue.put(self._flow_cls(**kwargs))

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/flow/extract/extract_pdf_flow.py:33, in ExtractPDFFlow.init(self, model_config, splitter)
24 """Extract PDF Flow Constructor.
25
26 Args:
27 model_config (Dict[str, Any]): Model config.
28 splitter (str): Splitter to use. Defaults to "".
29 """
30 super().init()
31 self._extract_pdf_op = ExtractPDFOp(
32 name="extract_pdf_op",
---> 33 model=CvModel(
34 model_config=model_config,
35 ),
36 )
37 self._process_pdf_op = ProcessPDFOp(name="process_pdf_op")
38 self._split_op = SplitterOpsFactory.get(splitter)

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/cv/model.py:28, in CvModel.init(self, model_config)
19 def init(
20 self,
21 model_config: Dict[str, Any],
22 ) -> None:
23 """Initialize Preprocess Model class.
24
25 Args:
26 model_config (Dict[str, Any]): Model config.
27 """
---> 28 super().init(prompt_template=None, model_config=model_config)

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/abs_model.py:29, in AbsModel.init(self, prompt_template, model_config)
22 """Initialize Model class.
23
24 Args:
25 prompt_template (PromptTemplate): Guided prompt template.
26 model_config (Dict[str, Any]): Model config.
27 """
28 model_server_cls = ModelServerFactory.get(model_config["model_server"])
---> 29 self._model_server = model_server_cls(prompt_template, model_config)
30 self._prompt_template = prompt_template
31 self._num_samples = 1

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/uniflow/op/model/cv/model_server.py:36, in NougatModelServer.init(self, prompt_template, model_config)
34 self._model_config = NougatModelConfig(**self._model_config)
35 self.dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
---> 36 self.processor = NougatProcessor.from_pretrained(
37 self._model_config.model_name, torch_dtype=self.dtype
38 )
39 self.model = VisionEncoderDecoderModel.from_pretrained(
40 self._model_config.model_name, torch_dtype=self.dtype
41 )
42 self.device = "cuda" if torch.cuda.is_available() else "cpu"

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/processing_utils.py:465, in ProcessorMixin.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, **kwargs)
462 if token is not None:
463 kwargs["token"] = token
--> 465 args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
466 processor_dict, kwargs = cls.get_processor_dict(pretrained_model_name_or_path, **kwargs)
468 return cls.from_args_and_dict(args, processor_dict, **kwargs)

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/processing_utils.py:511, in ProcessorMixin._get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
508 else:
509 attribute_class = getattr(transformers_module, class_name)
--> 511 args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
512 return args

File /opt/conda/envs/uniflow/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:409, in AutoImageProcessor.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
407 return image_processor_class.from_dict(config_dict, **kwargs)
408 elif image_processor_class is not None:
--> 409 return image_processor_class.from_dict(config_dict, **kwargs)
410 # Last try: we use the IMAGE_PROCESSOR_MAPPING.
411 elif type(config) in IMAGE_PROCESSOR_MAPPING:

TypeError: 'NoneType' object is not callable

image

Versions

Collecting environment information...
PyTorch version: 2.4.0.dev20240317+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1055-aws-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping: 7
CPU MHz: 2499.998
BogoMIPS: 4999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 2 MiB
L3 cache: 35.8 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton==3.0.0+989adb9a29
[pip3] torch==2.4.0.dev20240317+cu121
[conda] numpy 1.26.4 pypi_0 pypi
[conda] pytorch-triton 3.0.0+989adb9a29 pypi_0 pypi
[conda] torch 2.4.0.dev20240317+cu121 pypi_0 pypi

Error: Unable to access Inf2 metadata service

TLDR
I am running the script huggingface_model_neuron.ipynb on AWS EC2 Inf2.xlarge and running client.run() returns Neuron model can only run on the AWS EC2 inf2 instance series.

Tracing through the neuron_utils.py code, it seems that get_instance_type() not returns 200.


On terminal:

ubuntu@ip-172-31-0-162:~/uniflow$ neuron-ls
instance-type: inf2.xlarge
instance-id: i-0927e14c03abb4ba5
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
+--------+--------+--------+---------+

Run huggingface_model_neuron.ipynb:

config = TransformHuggingFaceConfig(
    prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(batch_size=4, model_name='mistralai/Mistral-7B-Instruct-v0.2', neuron=True))
client = TransformClient(config)

Returns
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.