Code Monkey home page Code Monkey logo

llmlingua's Introduction

LLMLingua

LLMLingua Series | Effectively Deliver Information to LLMs via Prompt Compression

| Project Page | LLMLingua | LongLLMLingua | LLMLingua-2 | LLMLingua Demo | LLMLingua-2 Demo |

LLMLingua_demo.mp4

News

  • 🧩 LLMLingua has been integrated into Prompt flow, a streamlined tool framework for LLM-based AI applications.
  • 🦚 We're excited to announce the release of LLMLingua-2, boasting a 3x-6x speed improvement over LLMLingua! For more information, check out our paper, visit the project page, and explore our demo.
  • 👾 LLMLingua has been integrated into LangChain and LlamaIndex, two widely-used RAG frameworks.
  • 🤳 Talk slides are available in AI Time Jan, 24.
  • 🖥 EMNLP'23 slides are available in Session 5 and BoF-6.
  • 📚 Check out our new blog post discussing RAG benefits and cost savings through prompt compression. See the script example here.
  • 🎈 Visit our project page for real-world case studies in RAG, Online Meetings, CoT, and Code.
  • 👨‍🦯 Explore our './examples' directory for practical applications, including LLMLingua-2, RAG, Online Meeting, CoT, Code, and RAG using LlamaIndex.

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LongLLMLingua mitigates the 'lost in the middle' issue in LLMs, enhancing long-context information processing. It reduces costs and boosts efficiency with prompt compression, improving RAG performance by up to 21.4% using only 1/4 of the tokens.

LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.

🎥 Overview

Background

  • Ever encountered the token limit when asking ChatGPT to summarize lengthy texts?
  • Frustrated with ChatGPT forgetting previous instructions after extensive fine-tuning?
  • Experienced high costs using GPT3.5/4 API for experiments despite excellent results?

While Large Language Models like ChatGPT and GPT-4 excel in generalization and reasoning, they often face challenges like prompt length limits and prompt-based pricing schemes.

Motivation for LLMLingua

Now you can use LLMLingua, LongLLMLingua, and LLMLingua-2!

These tools offer an efficient solution to compress prompts by up to 20x, enhancing the utility of LLMs.

  • 💰 Cost Savings: Reduces both prompt and generation lengths with minimal overhead.
  • 📝 Extended Context Support: Enhances support for longer contexts, mitigates the "lost in the middle" issue, and boosts overall performance.
  • ⚖️ Robustness: No additional training needed for LLMs.
  • 🕵️ Knowledge Retention: Maintains original prompt information like ICL and reasoning.
  • 📜 KV-Cache Compression: Accelerates inference process.
  • 🪃 Comprehensive Recovery: GPT-4 can recover all key information from compressed prompts.

Framework of LLMLingua

Framework of LongLLMLingua

Framework of LLMLingua-2

PS: This demo is based on the alt-gpt project. Special thanks to @Livshitz for their valuable contribution.

If you find this repo helpful, please cite the following papers:

@inproceedings{jiang-etal-2023-llmlingua,
    title = "{LLML}ingua: Compressing Prompts for Accelerated Inference of Large Language Models",
    author = "Huiqiang Jiang and Qianhui Wu and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.825",
    doi = "10.18653/v1/2023.emnlp-main.825",
    pages = "13358--13376",
}
@article{jiang-etal-2023-longllmlingua,
    title = "{L}ong{LLML}ingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression",
    author = "Huiqiang Jiang and Qianhui Wu and and Xufang Luo and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    url = "https://arxiv.org/abs/2310.06839",
    journal = "ArXiv preprint",
    volume = "abs/2310.06839",
    year = "2023",
}
@article{wu2024llmlingua2,
    title = "{LLML}ingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression",
    author = "Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Ruhle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang",
    url = "https://arxiv.org/abs/2403.12968",
    journal = "ArXiv preprint",
    volume = "abs/2403.12968",
    year = "2024",
}

🎯 Quick Start

1. Installing LLMLingua:

To get started with LLMLingua, simply install it using pip:

pip install llmlingua

2. Using LLMLingua Series Methods for Prompt Compression:

With LLMLingua, you can easily compress your prompts. Here’s how you can do it:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)

# > {'compressed_prompt': 'Question: Sam bought a dozen boxes, each with 30 highlighter pens inside, for $10 each box. He reanged five of boxes into packages of sixlters each and sold them $3 per. He sold the rest theters separately at the of three pens $2. How much did make in total, dollars?\nLets think step step\nSam bought 1 boxes x00 oflters.\nHe bought 12 * 300ters in total\nSam then took 5 boxes 6ters0ters.\nHe sold these boxes for 5 *5\nAfterelling these  boxes there were 3030 highlighters remaining.\nThese form 330 / 3 = 110 groups of three pens.\nHe sold each of these groups for $2 each, so made 110 * 2 = $220 from them.\nIn total, then, he earned $220 + $15 = $235.\nSince his original cost was $120, he earned $235 - $120 = $115 in profit.\nThe answer is 115',
#  'origin_tokens': 2365,
#  'compressed_tokens': 211,
#  'ratio': '11.2x',
#  'saving': ', Saving $0.1 in GPT-4.'}

## Or use the phi-2 model,
llm_lingua = PromptCompressor("microsoft/phi-2")

## Or use the quantation model, like TheBloke/Llama-2-7b-Chat-GPTQ, only need <8GB GPU memory.
## Before that, you need to pip install optimum auto-gptq
llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

To try LongLLMLingua in your scenarios, you can use

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor()
compressed_prompt = llm_lingua.compress_prompt(
    prompt_list,
    question=question,
    rate=0.55,
    # Set the special parameter for LongLLMLingua
    condition_in_question="after_condition",
    reorder_context="sort",
    dynamic_context_compression_ratio=0.3, # or 0.4
    condition_compare=True,
    context_budget="+100",
    rank_method="longllmlingua",
)

To try LLMLingua-2 in your scenarios, you can use

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True, # Whether to use llmlingua-2
)
compressed_prompt = llm_lingua.compress_prompt(prompt, rate=0.33, force_tokens = ['\n', '?'])

## Or use LLMLingua-2-small model
llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True, # Whether to use llmlingua-2
)

3. Advanced usage - Structured Prompt Compression:

Split text into sections, decide on whether to compress and its rate. Use <llmlingua></llmlingua> tags for context segmentation, with optional rate and compress parameters.

structured_prompt = """<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? Items I believe are 11, three, 14, 16 and 28, I believe.</llmlingua><llmlingua, compress=False>
Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group in the City Manager Department by $200 to provide a contribution to the Friends of the Long Beach Public Library. Item 12 is communication from Councilman Super Now. Recommendation to increase appropriation in the special advertising and promotion fund group and the city manager's department by $10,000 to provide support for the end of summer celebration. Item 13 is a communication from Councilman Austin. Recommendation to increase appropriation in the general fund group in the city manager department by $500 to provide a donation to the Jazz Angels . Item 14 is a communication from Councilman Austin. Recommendation to increase appropriation in the general fund group in the City Manager department by $300 to provide a donation to the Little Lion Foundation. Item 16 is a communication from Councilman Allen recommendation to increase appropriation in the general fund group in the city manager department by $1,020 to provide contribution to Casa Korero, Sew Feria Business Association, Friends of Long Beach Public Library and Dave Van Patten. Item 28 is a communication. Communication from Vice Mayor Richardson and Council Member Muranga. Recommendation to increase appropriation in the general fund group in the City Manager Department by $1,000 to provide a donation to Ron Palmer Summit. Basketball and Academic Camp.</llmlingua><llmlingua, compress=False>
Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.</llmlingua>"""
compressed_prompt = llm_lingua.structured_compress_prompt(structured_prompt, instruction="", question="", rate=0.5)
print(compressed_prompt['compressed_prompt'])

# > Speaker 4:. And can we do the functions for content? Items I believe are11,,116 28,.
# Speaker 0: a from Council on Price to increase the fund group the Manager0 provide a the the1 is Councilman Super Now. the special group the provide the summerman a the Jazzels a communication from Councilman Austin. Recommendation to increase appropriation in the general fund group in the City Manager department by $300 to provide a donation to the Little Lion Foundation. Item 16 is a communication from Councilman Allen recommendation to increase appropriation in the general fund group in the city manager department by $1,020 to provide contribution to Casa Korero, Sew Feria Business Association, Friends of Long Beach Public Library and Dave Van Patten. Item 28 is a communication. Communication from Vice Mayor Richardson and Council Member Muranga. Recommendation to increase appropriation in the general fund group in the City Manager Department by $1,000 to provide a donation to Ron Palmer Summit. Basketball and Academic Camp.
# Speaker 4: We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.

4. Learning More:

To understand how to apply LLMLingua and LongLLMLingua in real-world scenarios like RAG, Online Meetings, CoT, and Code, please refer to our examples. For detailed guidance, the documentation provides extensive recommendations on effectively utilizing LLMLingua.

5. Data collection and model training of LLMLingua-2:

To train the compressor on your custom data, please refer to our data_collection and model_training.

Frequently Asked Questions

For more insights and answers, visit our FAQ section.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

llmlingua's People

Contributors

bobchao avatar davidberenstein1957 avatar eltociear avatar gmaliar avatar grcharles avatar iofu728 avatar kexplo avatar microsoft-github-operations[bot] avatar microsoftopensource avatar qianhuiwu avatar siyunzhao avatar speuce avatar wlsdml1114 avatar yasyf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llmlingua's Issues

Params to use for compressing Dialogues

Hi,

Thanks for this amazing piece of work. I was trying to use this framework to compress a prompt, which has a dialogue between two people as context & I was trying to compress the dialogue alone. I leave the instruction & question uncompressed.

So, far, even with low compression ratios like 0.1-0.15, I'm seeing significant deviation in outputs for the compressed prompt in comparison to the original, uncompressed prompt. In fact, the compressed prompt spitted out also tends to be unintelligible in quite a few places. I was using the same params as you do here, although I'm not entirely sure what context_budget does exactly.

Also, I currently pass the dialogue in as a str. Would it make any difference segmented it line by line and passing it in as a List[str]?

The dialogue has speaker roles like Agent: & Customer: that are dropped sometimes after compressing, is there a way I can make sure some tokens are never dropped? I'm guessing you do that using force_context_ids? Does this param take input_ids after tokenizing? I'm confused.

Do you have any suggestions on what would be a good/optimal param setting to compress dialogues?

[design] Interface Design

Purpose

Engine Interface

# Example of engine interface
  • get_ppl: Includes Perplexity (PPL) and Contrastive PPL calculation, supports KV-cache.
  • get_relevance_rank: Returns the relevance ranking between context and question.

Core Interface

  • coarse_level_compression_in_document: Compresses the document/demon

ation in a coarse level, allocating the budget and dynamic compression ratio.

  • coarse_level_compression_in_sentence: Performs coarse-level compression of sentences.
  • iterative_token_level_compression: Compresses the prompt at the token level.
  • subsequence_recover: Recovers based on the subsequence relationship.

Wrapper Interface

  • compress_prompt: Returns the compressed prompt.

Support for remote LLM through API

Hi team,

Due to computing resources needed to run this, it would be nice if you can also add an option where user can give url_endpoint and api_key for a remote REST API to use for this instead of downloading the model from HuggingFace.

llama_index and LLMLingua PromptCompressor inconsistency

Hi,

When calling LongLLMLinguaPostprocessor function on the example https://github.com/microsoft/LLMLingua/blob/main/examples/RAGLlamaIndex.ipynb I got an error

TypeError: PromptCompressor.__init__() got an unexpected keyword argument 'use_auth_token'

The call got to this line
https://github.com/run-llama/llama_index/blob/4c922a1ff7e2d204c13bb926e1596669a937916a/llama_index/postprocessor/longllmlingua.py#L56
to call

self._llm_lingua = PromptCompressor(
    model_name=model_name,
    device_map=device_map,
    use_auth_token=use_auth_token,
    open_api_config=open_api_config,
)

which uses use_auth_token but the function PromptCompressor on this line

model_config: dict = {},

do not accept use_auth_token.

There seems some inconsistency between llama_index and LLMLingua
Anything I could help?

No improvemence when apply LongLLMLingua after retrieval.

In my situation, I have a retrieved list, and each item in the list contains a positive context and 19 negative contexts. After obtaining the list, I wanna use LongLLMLingua for reranking. However, I didn't see any improvements, which means that the MRR@n and recall@n remain the same. Could you give some advice to improve the reranking performance?

Some questions about parameters?

What a GOOD work for PROMPT COMPRESSION! BUT I have some question about parameters in code or paper.

1.What is the granular control coefficient parameter 'k' from the LLMLingua paper in this code? I didn't find this parameter from LLMLingua paper in this code, I guess 'context_budget' (default is '+100') and this parameter have the same meaning. Is it true?

2.By the way, I didn't find the pre-defined compression rates for instruction and question from the LLMLingua paper in this code, too.

3.In this code, 'token_budget_ratio' is a parameter for 'Budget ratio for sentence-level Prompt Compression' (default is 1.4). But I do not find this parameter in the LLMLingua paper —— 1.4 or token_budget_ratio.

Thank YOU!

"IndexError: list index out of range" when compressing prompt

Hello,

I have been testing this library to be potentially included in our product & have constructed this small example to demonstrate the error I've repeatedly been getting:

from llmlingua import PromptCompressor

# Running with 'cpu'; on Mac without GPU
llm_lingua = PromptCompressor(device_map="cpu")

instruction = "You are a chatbot designed to answer questions about AI & ethics"
prompt = "Provide an in-depth exploration of the role of AI in enhancing the security and effectiveness of educational technology. Discuss the potential risks, such as data breaches and biased algorithms, the ethical considerations in using AI to monitor and assess student performance, and the best practices for safeguarding student data and privacy while leveraging AI to personalize and enhance the learning experience."

compressed_prompt = llm_lingua.compress_prompt(
    context = [],
    instruction = instruction,
    question = prompt
)

print (compressed_prompt)

I get the following error:

Traceback (most recent call last):
  File "/Users/PATH/research/test_prompt_compression.py", line 9, in <module>

  File "/Users/PATH/Library/Python/3.9/lib/python/site-packages/llmlingua/prompt_compressor.py", line 224, in compress_prompt
    context = self.iterative_compress_prompt(
  File "/Users/elangert/Library/Python/3.9/lib/python/site-packages/llmlingua/prompt_compressor.py", line 776, in iterative_compress_prompt
    for delta_end, ratio in iterative_ratios[idx]:
IndexError: list index out of range

I have tested this with prompts up to 200 tokens (& as low as 10 tokens) & have tried setting iterative_size as low as single digits - all settings result in the same error above.

Please let me know any additional information you may need from me to run this down.

Thanks!

Which version of openai should be installed to reproduce gsm8k with llmlingua?

I installed the version 0.27.4 for runing the code examples/CoT.ipynb
some error raised when running the following line

request_data = {
    "prompt": prompt,
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\n\n",
}
response = openai.Completion.create(
    model="gpt-3.5-turbo-0301",
    **request_data,
)
print(json.dumps(response, indent=4))
openai.error.InvalidRequestErrorpython-BaseException
: This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?

I update openai to version 0.28.1, the error also exists. Updating newer version doesn't work.
So I change the code according to the error. It seems gpt-3.5-torbo-0301 only used for ChatCompletion.

request_data = {
    "messages": [{"role": "system", "content": ""}, {"role": "user", "content": prompt}],
    "max_tokens": 400,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": "\n\n",
}
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0301",
    **request_data,
)

The finally output is 0.439, far from 0.78+.

num_q 1319 correct 579 ratio 0.4390

Is there any suggestion for me?

  1. Should I use older openai or newer openai?
  2. Older openai may not support newer models, so I prefer to use newer openai (for more widely testing) to reproduce the result. But the result of gpt-3.5-torbo-0301 seems not good.

Using web-hosted model for inference

Currently the NousResearch/Llama-2-7b-chat-hf model appears to be running locally on my machine, which can take quite a while for long prompts. I'd like to use more AI-optimized hardware to speed this process up.

Is it possible to use a web-hosted version of the model, or use a different web-hosted model entirely?

autogen compressible agent integration

Hello!

I put LLM Lingua into Autogen as part of a compressible agent microsoft/autogen#1005

Basically functional but too slow on my mac book with llama2 to really test.

I figured I'd try phi-2 but it didn't return past_key_values, I have no clue if that is a dead end or fixable :-)

Would appreciate any input on effectively using LLM Lingua as the compressor for gpt agents.

Thanks!

The specific parameter settings in the compressor for reproduce NQ

Very nice work! I am trying to replicate the results of longllmlingua on a Natural Questions dataset, but there may be some discrepancies between the results and those in the paper due to unclear values that should be set for each parameter of the compressor. So I would like to inquire the specific parameter settings in the compressor?

run local error

i run in local,
error is:
Loading checkpoint shards:
屏幕快照 2024-01-21 下午4 48 08

Tasks

Support for llama.cpp or exl2

Hi, this is an interesting project. I would like to use this with llama.cpp (llama-cpp-python more specifically), but when I had a look at the code I wasn't able to switch out the model loafers (I got stuck on the attention mask).
Are you planning on officially integrating more model loaders/formats?

how can i use it in langchain?

code like this:

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
kc = RetrievalQA.from_llm(llm=qwllm, retriever=compression_retriever, prompt=prompt)

How to setup LLMLingua with localhost?

Hello, how do I set up LLMLingua with a self-hosted localhost server? Is there a tutorial? Thanks.

Tasks

No tasks being tracked yet.

Question about past_key_values

Does manipulations with past_key_values present only to increase speed?
Can I remove code which operates past_key_values without lowering the performance?

Can you provide the NaturalQuestions test dataset?

The LongLLMLingua evaluate performance on "NaturalQuestions", but I cannot find any source to download it, can you provide this test dataset?

(I try to communicate the author of "Lost in Middle", but I haven't received a reply yet.)

When I use chinese llama, the compressed prompt has garbled code

model name: Llama2-Chinese-7b-Chat

instruction = "请你对以下文本进行摘要"
question = ""
input:依达拉奉右莰醇是一种新型的神经保 护剂,包括依达拉奉和右莰醇,一种在动物 缺血性卒中模型中具有抗炎作用的食品添加 剂。这项研究旨在评估与依达拉奉相比,依 达拉奉右莰醇醇在治疗急性缺血性卒中 (AIS)患者中的安全性和有效性。\n方法 在这项多中心,随机,双盲,多剂量, 主动对照的II期临床试验中,卒中发作后48小 时内AIS患者被随机分配(1:1:1:1)低剂 量( 12.5毫克),中剂量(37.5毫克)或高剂 量(62.5毫克)依达拉奉右莰醇组,以及一个 活动对照组,每12小时静脉输注30毫克依达 拉奉,连续14天。主要疗效结果是改良的 Rankin量表(mRS)分数在90天时≤1的比例 以及美国国立卫生研究院卒中量表(NIHSS) 评分从基线到随机分组后14天的变化。安全 结果包括治疗后90天内的任何不良事件。\n结果 纳入疗效分析的385例患者中,随机分为 低剂量组94例,中剂量组97例,高剂量组98 例,对照组96例。在90天的mRS评分 (mRS≤1,p = 0.4054)或14天的NIHSS评分 变化(p = 0.6799)的四组之间没有观察到显 著差异。但是,在中等剂量组(69.39%)和 高剂量组(65.63%)中,在90天时mRS评分 ≤1的患者所观察到的百分比高于对照组 (60.64%)。在四组中,没有发现严重不良事件的显著差异(p = 0.3815)。结论 与单独使用依达拉奉相比,依达拉奉 右莰醇在所有剂量下均安全且耐受性良好, 尽管在90天时未观察到功能性转归明显改 善。', 'paragraph': '摘要\n背景 依达拉奉右莰醇是一种新型的神经保 护剂,包括依达拉奉和右莰醇,一种在动物 缺血性卒中模型中具有抗炎作用的食品添加 剂。这项研究旨在评估与依达拉奉相比,依 达拉奉右莰醇醇在治疗急性缺血性卒中 (AIS)患者中的安全性和有效性。\n方法 在这项多中心,随机,双盲,多剂量, 主动对照的II期临床试验中,卒中发作后48小 时内AIS患者被随机分配(1:1:1:1)低剂 量( 12.5毫克),中剂量(37.5毫克)或高剂 量(62.5毫克)依达拉奉右莰醇组,以及一个 活动对照组,每12小时静脉输注30毫克依达 拉奉,连续14天。主要疗效结果是改良的 Rankin量表(mRS)分数在90天时≤1的比例 以及美国国立卫生研究院卒中量表(NIHSS) 评分从基线到随机分组后14天的变化。安全 结果包括治疗后90天内的任何不良事件。\n结果 纳入疗效分析的385例患者中,随机分为 低剂量组94例,中剂量组97例,高剂量组98 例,对照组96例。在90天的mRS评分 (mRS≤1,p = 0.4054)或14天的NIHSS评分 变化(p = 0.6799)的四组之间没有观察到显 著差异。但是,在中等剂量组(69.39%)和 高剂量组(65.63%)中,在90天时mRS评分 ≤1的患者所观察到的百分比高于对照组 (60.64%)。在四组中,没有发现严重不良事件的显著差异(p = 0.3815)。结论 与单独使用依达拉奉相比,依达拉奉 右莰醇在所有剂量下均安全且耐受性良好, 尽管在90天时未观察到功能性转归明显改 善。
介绍\n卒中是导致死亡和成年后残疾的主要原因, 在**产生巨大的经济负担。但是,许多神 经保护剂在治疗急性缺血性卒中(AIS)的患 者中没有显示出任何益处,因此寻找新的方 法势在必行。依达拉奉是一种有效的自由基 清除剂,被中日两国卒中护理指南推荐用于 AIS治疗。依达拉奉清除自由基,例如羟基 自由基(·OH),一氧化氮自由基(NO·)和 过氧亚硝酸盐阴离子(ONOO-)以此缓解脑 水肿并抑制迟发性神经元死亡。然而,脑缺 血性损伤极为复杂,涉及自由基和炎症反应。 右莰醇可抑制炎症相关蛋白的产生或表达, 并防止脑损伤或损伤。依达拉奉右莰醇是一 种新型神经保护剂,它以4:1的比例包含依 达拉奉和右莰醇,可能具有更好的治疗效果。 依达拉奉和右2-樟脑酚之间存在互补性。 依达拉奉联合右莰醇的药理研究表明,与单 独使用依达拉奉相比,依达拉奉右莰醇具有 协同作用和更长的治疗时间窗,表明依达拉 奉右莰醇对脑缺血的保护作用优于市售的依 达拉奉。当前的多中心,随机,主动对照, 双盲研究旨在验证依达拉奉右莰醇对AIS患者的疗效和安全性。', 'paragraph': '介绍\n卒中是导致死亡和成年后残疾的主要原因, 在**产生巨大的经济负担。但是,许多神 经保护剂在治疗急性缺血性卒中(AIS)的患 者中没有显示出任何益处,因此寻找新的方 法势在必行。依达拉奉是一种有效的自由基 清除剂,被中日两国卒中护理指南推荐用于 AIS治疗。依达拉奉清除自由基,例如羟基 自由基(·OH),一氧化氮自由基(NO·)和 过氧亚硝酸盐阴离子(ONOO-)以此缓解脑 水肿并抑制迟发性神经元死亡。然而,脑缺 血性损伤极为复杂,涉及自由基和炎症反应。 右莰醇可抑制炎症相关蛋白的产生或表达, 并防止脑损伤或损伤。依达拉奉右莰醇是一 种新型神经保护剂,它以4:1的比例包含依 达拉奉和右莰醇,可能具有更好的治疗效果。 依达拉奉和右2-樟脑酚之间存在互补性。 依达拉奉联合右莰醇的药理研究表明,与单 独使用依达拉奉相比,依达拉奉右莰醇具有 协同作用和更长的治疗时间窗,表明依达拉 奉右莰醇对脑缺血的保护作用优于市售的依 达拉奉。当前的多中心,随机,主动对照, 双盲研究旨在验证依达拉奉右莰醇对AIS患者的疗效和安全性。
方法\n研究设计\n 该研究被设计为2013年5月至2015年2月在**28个中心 进行的II期,多中心,随机,双盲,多剂量,主动对照的 临床试验。患者在获得知情同意后被分配治疗。
output:
{'compressed_prompt': "请你对以上文本进行摘要\n\n��达������种的保,,种在物 �中的这�����中)\n法 项�,�, 动�的II�后小内A者配)�(()�����组及 活动每0克�,�是改良的in(m)数0时1的例 及院�中 分从到。�后 入,分 量组8的分RSp.)或HSS 9)的有在6在时 者的在 与达�� �在�时到�。 '要�达�保,, 的 达��在中\n项 小内配)��的in时及 分从。4的分RS,的的有者的在� �在�到�改\n多�被,�以�� 生��一和右��单有间于 � '�神于右。\n方法\n研究设计\n 该研究被设计为2013年5月至2015年2月在**28个中心 进行的II期,多中心,随机,双盲,多剂量,主动对照的 临床试验。患者在获得知情同意后被分配治疗。\n\n", 'origin_tokens': 2787, 'compressed_tokens': 306, 'ratio': '9.1x', 'saving': ', Saving $0.1 in GPT-4.'}

Tasks

No tasks being tracked yet.

Why no integration with Langchain till now?

I can see there is no docs available for community to use LLMLingua with Langchain. We have with Llama Index but not langchain. Is there any example how one can use this in Langchain for RAG use cases?

Speed Up Compression

First of all, thank you for this fantastic project. I was wondering if there are any parameters that help with the speed of the compression, currently using TheBloke/Llama-2-7b-Chat-GPTQ but seems to be slow with default parameters even for text that is not really that long.

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

OR

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : This repository will ship as Open Source or go public
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

Need more help? 🖐️

How to run in linux machine (CPU without GPU)

Is there any method to run LLMLingua in linux cpu machine. I am trying to load this using:
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(device_map="mps")

but it taking so much amount to load.

LLMLingua and LongLLMLingua parameters question

I read the issue #7, #12 and #49,
I guess the right papameters that LLMLingua uses are :
prompt = compressor.compress_prompt( context=xxx, instruction=xxx, question=xxx, ratio=0.75, iterative_size=100, context_budget="*2", ) ,
and LongLLMLingua is
prompt = compressor.compress_prompt( context=xxx, instruction=xxx, question=xxx, ratio=0.75, iterative_size=200, condition_compare=True, condition_in_question='after_condition', rank_method='longllmlingua', reorder_context='sort', dynamic_context_compression_ratio=0.3, context_budget="*2", )
I have some questions:

  1. In #7 , you said context_budget should be *1.3 or +300 in LongLLMLingua, and #12 ,you said context_budget should be +200 in LongLLMLingua, so I am confused by the setting of context_budget, and meanwhile, in the LLMLingua and LongLLMLingua papers, the context_budget seems to be *2 (in #49). So I want to know how to set context_budget in LLMLingua and LongLLMLingua.
  2. in #49, you said context_budget and token_budget_ratio can be considered part of the control coefficient parameter k. Can I think I just need to control the context_budget? Because in #7 and #12, you do not change the token_budget_ratio.
  3. What does dynamic_context_compression_ratio parameter correspond to in LongLLMLingua paper? I do not find it in the implementation details.
  4. The most important thing I want to know is whether the parameters I describes above are correct, and I want to get your LLMLingua and LongLLMLingua parameters you used actually in the paper, and i want to use TRUE parameters to run LLMLingua and LongLLMLingua experiments.

Please forgive my too long questions, because your LLMLingua and LongLLMLingua work are very interesting for me!
Looking forward to your reply.

Is the code for LongLLMLingua out?

When I look through 2 of the example, it seems like regular LLMLingua is being used. Is LongLLMLingua out yet? I only see the readme paper updates.

Sorry if I've missed anything obvious!

Exploring the Possibility of Porting LLMLingua to JVM Languages (Java/Kotlin)

Hello guys, awesome project here!

I'm reaching out to inquire about the possibility of porting LLMLingua to JVM languages, particularly Kotlin. Are there any plans for supporting JVM languages in the near future? I believe that making LLMLingua available in these languages could significantly broaden its reach and utility.

I understand that porting a library, especially one as sophisticated as LLMLingua, can be a substantial undertaking. In this regard, I have a few questions:

  1. How tightly coupled is the current implementation with Python-specific libraries and features?
  2. Are there aspects of LLMLingua that you foresee being particularly challenging to port to a JVM language?
  3. Would you be open to community contributions for this "portability"?

I am very interested in contributing to this effort. My background includes working with both Python and Kotlin, and I am keen to assist in making LLMLingua accessible to a wider range of developers.

compress_prompt Reports Error: AttributeError: 'NoneType' object has no attribute 'device'

llm_lingua = PromptCompressor(model_name="Baichuan2-main/Baichuan2-7B-Chat", model_config=model_config)
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'prompt' is not defined
compressed_prompt = llm_lingua.compress_prompt('你好', instruction="", question="", target_token=200)
/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
File "", line 1, in
File "/PDF-OCR/AIGC/EngineeringQA/Baichuan2-main/LLMLingua/llmlingua/prompt_compressor.py", line 252, in compress_prompt
context = self.iterative_compress_prompt(
File "/PDF-OCR/AIGC/EngineeringQA/Baichuan2-main/LLMLingua/llmlingua/prompt_compressor.py", line 734, in iterative_compress_prompt
loss, past_key_values = self.get_ppl(
File "/PDF-OCR/AIGC/EngineeringQA/Baichuan2-main/LLMLingua/llmlingua/prompt_compressor.py", line 105, in get_ppl
response = self.model(
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 719, in forward
outputs = self.model(
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 494, in forward
layer_outputs = decoder_layer(
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 306, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 238, in forward
proj = self.W_pack(hidden_states)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 344, in forward
state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2080, in transform
prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

The script for LongChat to reproduce the LongLLMLingua

Very appreciate your awesome work and efforts for the easy-to-use code.

The provided example uses OpenAI's GPT3.5 with the OpenAI API. Is there a plan to provide the evaluation script using longchat-13b-16k to reproduce LongLLMLingua?

PromptCompressor error - OpenAIGPTLMHeadModel.forward() got an unexpected keyword argument 'past_key_values'

I am trying to use OpenAI GPT-2 model for prompt compression. However, getting error "OpenAIGPTLMHeadModel.forward() got an unexpected keyword argument 'past_key_values'". Has anyone faced/facing similar issue and how can this be fixed. Thanks.

`!pip install llmlingua
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
model_name = "openai-gpt",
device_map="cpu",
)
prompt = """
The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.
Human: Hello, who are you?
AI: I am an AI created by OpenAI. How can I help you today?
Human: I'd like to cancel my subscription.
AI: I'm sorry to hear that. What is your subscription number?
Human: 123456
AI: Thank you. Your subscription has been cancelled.
Human: Thank you. Goodbye!
AI: Goodbye!
"""
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="", target_token=200)
`

AssertionError: Torch not compiled with CUDA enabled

Hi I tried to run LLMLingua using dolphin-2.6-phi-2 but I got
AssertionError: Torch not compiled with CUDA enabled

PS C:\Users\DefaultUser> python "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py"
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards:   0%|                                                                 | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\Public\Coding\LLMLingua\LLMLingua_test1.py", line 12, in <module>
    llm_lingua = LocalPromptCompressor()
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 27, in __init__
    self.load_model(model_name, device_map, model_config)
  File "C:\Program Files\Lib\site-packages\llmlingua\local_prompt_compressor.py", line 57, in load_model
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\transformers\modeling_utils.py", line 3706, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\transformers\modeling_utils.py", line 4116, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\transformers\modeling_utils.py", line 778, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "C:\Program Files\Lib\site-packages\accelerate\utils\modeling.py", line 347, in set_module_tensor_to_device
    new_value = value.to(device)
                ^^^^^^^^^^^^^^^^
  File "C:\Program Files\Lib\site-packages\torch\cuda\__init__.py", line 289, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

force_context_ids parameter not behaving as expected

Hello,
I've recently created a small service that intercepts calls from my frontend to my LLM backend to compress the incoming prompts. To retain some crucial parts, I wanted to make use of the force_context_ids, but as it turns out, I'm getting an error I can't further investigate.

File "C:\Users\micro\miniconda3\envs\LLMLinguaMITM\Lib\site-packages\llmlingua\prompt_compressor.py", line 200, in compress_prompt
context, dynamic_ratio = self.control_context_budget(
^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)

I use this method to search the list of strings for the context parameter for matches against the regex and add them to a list of integers that represent their position in that first list. The list of integers is then handed to the compress_prompt() call.

forced_context_ids = []

for i, string in enumerate(split_prompts):
    if re.match(config['force_context'], string):
        forced_context_ids.append(i)

The full code can be found here: https://github.com/Dakraid/LLMLinguaMITM/blob/main/main.py

During debugging, the list of integers seemed fine and matched the expected positions in the list of strings for the context. Any further help or insight would be appreciated.

CUDA out of memory

I have 4 GPUs RTX A5000 with 24GB memory each, but when I run the example code:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

I get the error:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It seems not able to run it on multiple GPUs.

Some problem about code

Very nice work! I am reading the code. May I ask if the distribution alignment section is missing from the code? Did you directly use the NousResearch/Llama-2-7b chat hf model as the compressor? Can this model be considered aligned with GPT3.5 and LongChat?

version 0.2.0 iteration plan

Estimated Release Date: 3/12
Release Manager: @suiguoxin
Schedule:

  • Design Review: 1/19
  • Coding: 3/5
  • Testing: 3/12

Features

  • P0 Feature Planning @iofu728 @lunaqiu ETA: 1.16
  • P0 Interface Definition: Engine < Core < Wrapper < Applications #52 @SiyunZhao ETA: 1.16
  • P0 Layered refactor @SiyunZhao @iofu728
  • fixed/customized/accurate/target/max compression ratio
  • P1 Support customized compression spec, such as user specified segment boundary and compression ratio
    • P0 Support using <llmlingua ratio=?? compress=??> </llmlingua> to identify compression segment boundaries
    • Support preserving essential characters
  • bug fix TBD After Interface Refactoring
  • P1 Support more models, small LMs e.g., Phi2 ETA: 2 days #67
  • P0 Support pure json interface & doc @SiyunZhao #120
  • PR (Ch, 1000 words) @lunaqiu @iofu728 1.17

Backlog

  • P1 exp: target comp ratio v.s. real comp ratio on specific data
  • >token level, < sentence level, list different mappings and design interface P1 word level compression #4
  • P1 Support more / faster engines #41, including llama_cpp, FasterTransformer, vLLM ETA: TBD
    • survey which engines to support
  • P2 Documentation and examples
    • Supported models and experiment results (with compressor throughput) after a faster engine supported

keyError 'llama' when trying to running PromptCompressor()

here is that stack trace. I can't for the life of me figure out what the source of this error is.
{
"name": "KeyError",
"message": "'llama'",
"stack": "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[2], line 4
1 # testing prompt compression using llmlingua
2 from llmlingua import PromptCompressor
----> 4 compressor = PromptCompressor(device_map="cpu")
5 compressed_prompt = compressor.compressed_prompt(prompt= text, instructions="identify the last unanswered question the trascript", question="",tartget_token =350)

File c:\Users\conrad.liburd\Anaconda3\envs\tapas\lib\site-packages\llmlingua\prompt_compressor.py:27, in PromptCompressor.init(self, model_name, device_map, model_config, open_api_config)
20 def init(
21 self,
22 model_name: str = "NousResearch/Llama-2-7b-hf",
(...)
25 open_api_config: dict = {},
26 ):
---> 27 self.load_model(model_name, device_map, model_config)
28 self.retrieval_model = None
29 self.retrieval_model_name = None

File c:\Users\conrad.liburd\Anaconda3\envs\tapas\lib\site-packages\llmlingua\prompt_compressor.py:40, in PromptCompressor.load_model(self, model_name, device_map, model_config)
38 if "trust_remote_code" not in model_config:
39 model_config["trust_remote_code"] = trust_remote_code
---> 40 config = AutoConfig.from_pretrained(
41 model_name, trust_remote_code=trust_remote_code
42 )
43 tokenizer = AutoTokenizer.from_pretrained(
44 model_name, trust_remote_code=trust_remote_code
45 )
46 if model_config.get("pad_to_left", True):

File c:\Users\conrad.liburd\Anaconda3\envs\tapas\lib\site-packages\transformers\models\auto\configuration_auto.py:917, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
915 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs)
916 elif "model_type" in config_dict:
--> 917 config_class = CONFIG_MAPPING[config_dict["model_type"]]
918 return config_class.from_dict(config_dict, **unused_kwargs)
919 else:
920 # Fallback: use pattern matching on the string.
921 # We go from longer names to shorter names to catch roberta before bert (for instance)

File c:\Users\conrad.liburd\Anaconda3\envs\tapas\lib\site-packages\transformers\models\auto\configuration_auto.py:623, in _LazyConfigMapping.getitem(self, key)
621 return self._extra_content[key]
622 if key not in self._mapping:
--> 623 raise KeyError(key)
624 value = self._mapping[key]
625 module_name = model_type_to_module_name(key)

KeyError: 'llama'"
}

Understanding the interplay between `ratio` and `iterative_size`

Thank you for the interesting work, and making the code easily accessible. I have some confusion on the relationship between the ratio and iterative_size parameters.

In the case I am interested, there is a single demonstration that I want to compress using only the token-level compression approach. I've noticed that, in general, the final ratio between the compressed and original length can vary quite a bit for large enough' ratio' values. However, I noticed that when I make the iterative_size parameter small, e.g. 10, the final compressed ratio is more truthful to the value specified for the ratio parameter.

I'm confused as to why this is the case. From the paper, my understanding was that \gamma_j threshold for segment s_j (whose length is defined by the iterative_size parameter), was based primarily on the ratio parameter. Meaning that, regardless of the iterative_size, LLMLingua would always prune ratio percentage of the tokens in that segment.

Any clarifications of this would be useful, including where in the code \gamma_j is computed.

Output for High Token Languages like Japanese

While the concept is promising, especially for High Token Languages like Japanese, I've encountered a significant encoding issue.

Steps to Reproduce:
Input a Japanese text prompt into LLMLingua for compression.
Observe the output, which should be a compressed version of the original prompt.
Expected Behavior:
The compressed output should retain the original Japanese characters without any encoding errors.

Actual Behavior:
The output contains a mix of unrecognized characters along with some correct Japanese script. This mixed encoding makes the compressed prompt unusable when passed into GPT-4.
A
B

llama instead of gpt

Just a few questions about using LLMLingua.

  1. How do I adjust the code so that I am using Llama instead of GPT?
  2. The reason I am using Llama instead of GPT is because I don't want my data to be sent to any other company's server. Using Llama, is my prompt or data being sent to some server?

Question about LongLLMLingua token-level compression

Thanks for sharing your interesting research.
I'm reproducing LongLLMLingua, and I'm reaching out to ask about token-level prompt compression.
I understood that after sorting for 20 documents, pruning based on context budget and token-level compression based on dynamic ratio is performed. However, it seems that for the last document, e.g., 13th in sorted 5-16-10-13th documents, token-level compression is performed and then suddenly stops. I suspect a similar cause, when I put only one document, set sentence_level_filtering and context_level_filtering to False and compress, it sometimes does not perform compression regardless of the target_token parameter. I wonder if this is intended.

Here's an example of the input and output.

Input command

compressed_prompt = llm_lingua.compress_prompt(  
        demonstration_str.split("\n"),   
        instruction=instruction,
        question=question, 
        target_token=500,  
        condition_compare=True,   
        condition_in_question='after',    
        rank_method='longllmlingua',
        use_sentence_level_filter=False,  
        context_budget="+100",    
        dynamic_context_compression_ratio=0.4,    
        reorder_context="sort")

Here, the demonstration, instruction and question are from the 1st example of the test set where the question is "who got the first nobel prize in physics."

And here is the compressed output I got.


Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [6](Title: Nobel Prize in Physics) rendered by the discovery of the remarkable rays (or x-rays). This award is administered by the Nobel Foundation and widely regarded as the most prestigious award that a scientist can receive in physics. It is presented in Stockholm at an annual ceremony on 10 December, the anniversary of Nobel's death. Through 2018, a total of 209 individuals have been awarded the prize. Only three women (1.4% of laureates) have won the Nobel Prize in Physics: Marie Curie in 1903, Maria Goeppert Mayer in 1963, and Donna Strickland in 2018. Alfred Nobel, in his last will and testament, stated that his

Document [16](Title in Physics) death (1833–1896' portrait also appears on the obverse of Peace Prize and the Medal for the Prize Economics slightly different design. The on the reverse of var according to awarding the prize. The sides of Nobel Prize medals Chemistry and the same Nature, as a Goddess, whose veil is held up by the Genius of Science. These medals the ones for Physiology/Medicine and Literature designed by Erik Lindberg in 1902 laure receive a dipl directly from the
1: ofates in) The Nobel Prize in in 1 to Wilhelm Rö Germany,5 SEK is 770 SEK December 207 John Bardeen la twicein95 9. Skłod-Curie won Priz for103 and chem11. William Bragg was, until0, the young the195 at . women won the prize: MariappertM (963 of 207, the
1 Prize) A group including writers, against, having. Some, including Burton Feldman, have criticised this prize because they consider Prudhomme a mediocre poet. Feldman's explanation is that most of the Academy members preferred Victorian literature and thus selected a Victorian poet. The first Physiology or Medicine Prize went to the German physiologist and microbiologist Emil von Behring. During the 1890s, von Behring developed an antitoxin to treat diphtheria, which until then was causing thousands of deaths each year. The first Nobel Peace Prize went to the Swiss

Question: who got the first nobel prize in physics
Answer:


The bold italic part denotes the last document survived, where I think the biggest ratio of compression should be in place.



Below is the second example when compressing only one document.

This is my input command.

compressed_prompt = llm_lingua.compress_prompt(
        demonstration_str.split("\n"),
        instruction=instruction,
        question=question,
        target_token=20,
        condition_compare=True,
        condition_in_question='after',
        use_sentence_level_filter=False,
        use_context_level_filter=False, )

Output:

[Original prompt]
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1](Title: Philadelphia Eagles) The Philadelphia Eagles are a professional American football franchise based in Philadelphia, Pennsylvania. The Eagles compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) East division. They are Super Bowl champions, having won Super Bowl LII, their fourth NFL title, after winning in 1948, 1949, and 1960.

Question: when is the last time the philadelphia won the superbowl
Answer:
############################################################################
[Compressed prompt]
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1](Title: Philadelphia Eagles) The Philadelphia Eagles are a professional American football franchise based in Philadelphia, Pennsylvania. The Eagles compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) East division. They are Super Bowl champions, having won Super Bowl LII, their fourth NFL title, after winning in 1948, 1949, and 1960.

Question: when is the last time the philadelphia won the superbowl
Answer:

-----------
Statistics: {'origin_tokens': 127, 'compressed_tokens': 127, 'ratio': '1.0x', 'origin_tokens_context': 87, 'compressed_tokens_context': 87, 'ratio_context': '1.0x'}

In the second example, changing the target_token parameter yielded the same results.

What the setting of parameters needed to reproduce the LongLLMLingua?

Because the parameter's name in the paper does not match the API compress_prompt calling parameter's name.
So what the setting of parameters needed to reproduce the LongLLMLingua?

The following are the parameter settings I speculate:

prompt  = compressor.compress_prompt(
    context=documents,
    instruction=instruction,
    question=question,
    ratio=0.75, # for 4x speedup
    iterative_size=200,
    condition_compare=True,
    condition_in_question='after',
    rank_method='longllmlingua',
    reorder_context='two_stage',
    dynamic_context_compression_ratio=0.25,
    context_budget="*2.0",
)

Out of Memory error with llm_lingua

Using the following code

llm_lingua = PromptCompressor(
model_name = "meta-llama/Llama-2-7b-hf",
device_map = "cuda:0",
use_auth_token = False,
open_api_config = {},
)
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question=question, target_token=200)

However, I am getting an error, out of memory:
image

Since, model is loading to only 1 partition of the GPU and not using all the 4. What changes should I do?

image

RuntimeError: The expanded size of the tensor (181) must match the existing size (211) at non-singleton dimension 0

I use Qwen-7B ,then get error:
Traceback (most recent call last):
File "/qwen/test.py", line 22, in
compressed_prompt = llm_lingua.compress_prompt(
File "/usr/local/lib/python3.10/dist-packages/llmlingua/prompt_compressor.py", line 253, in compress_prompt
context = self.iterative_compress_prompt(
File "/usr/local/lib/python3.10/dist-packages/llmlingua/prompt_compressor.py", line 749, in iterative_compress_prompt
past_loss[ready_end : end - 1] = loss
RuntimeError: The expanded size of the tensor (181) must match the existing size (211) at non-singleton dimension 0. Target sizes: [181]. Tensor sizes: [211]

Out of Memory Error with Llama-2-7b

I am using following command to test it:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
model_name = "NousResearch/Llama-2-7b-hf",
device_map = "cuda:0",
use_auth_token = False,
open_api_config = {},
)
compressed_prompt = llm_lingua.compress_prompt(prompt, instruction="", question="some question", target_token=200)

However it is giving me out of memory error:
Screenshot from 2023-10-17 16-09-39

Probably reason is, model is getting loaded to only 1 partition and not all the partitions. Llama-2-7b-hf otherwise runs perfectly in the machine. It is just not working with this library. What changes would you suggest?
Screenshot from 2023-10-17 16-07-30

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.