Code Monkey home page Code Monkey logo

loogle's Issues

groundtruth逻辑是否有问题

你好,在pred_gpt_models.py中get_pre()方法是否有bug?

def get_pred(model, data_instance, tokenizer, max_length, max_gen, prompt_format, device):

ans, groundtruth = [], []
preds = {}
raw_inputs = data_instance['input']
if data_instance['qa_pairs'] == 'none':
    preds['qa_pairs'] = data_instance['qa_pairs']
    json_obj = {'input': raw_inputs}

    prompt = prompt_format.format(**json_obj)
    tokenized_prompt = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0]
    if len(tokenized_prompt) > max_length:
        half = int(max_length/2)
        prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True)+tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)
    
    
    input_ids = tokenizer(prompt, truncation=True, return_tensors="pt").input_ids.to(device)
    context_length = input_ids.shape[-1]
    with torch.no_grad():
        output = model.generate(input_ids,max_new_tokens=max_gen,temperature=1.0,num_beams=1,do_sample=False,repetition_penalty=float(2))[0]
    pred = tokenizer.decode(output[context_length:], skip_special_tokens=True)

    ans.append(pred)
    groundtruth.append(raw_inputs)

这里的groundtruth怎么是raw_inputs?

Prompt format for different models

Hi! I have read the codes for open source model evaluation. I noticed that, different from some existing benchmarks such as LongBench or L-Eval, there is not prompt customization part for different models (e.g. the prompt format of vicuna series is different from the original LlaMa-2). For fair comparison, do you think such customization should be added to the codes?

Insufficient A100 memory for overly long context.

Thank you for your outstanding work, but I encountered the following problem during testing.
The single A100 (80GB) card has insufficient memory when predicting with overly long context. I am very curious how you solve this tricky problem.
I saw in your code:

if len(tokenized_prompt) > max_length:
    half = int(max_length/2)
    prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True) + tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)

Does this approach affect the accuracy of the evaluation?
Thank you again.

Question about model selection

我看论文里选了llama拓展到32k长度的做摘要评估,然后其他的一些longllama,gpt之类的可能多少都有指令微调过,已经有了对相应任务的理解,不确定你们选的这个llama32k是不是以language model的形式拓展长度的,如果是这样,怎么确定比较公平性哇?
或者有没有考虑引入llama-chat版本还有一些其他的指令微调且长度拓展的llama模型做评估哦

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.