noahshinn / reflexion Goto Github PK

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning

License: MIT License

Python 90.07% Shell 2.06% Jupyter Notebook 7.87%

reflexion's Introduction

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning

This repo holds the code, demos, and log files for Reflexion: Language Agents with Verbal Reinforcement Learning by Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao.

We have released the LeetcodeHardGym here

To Run: reasoning (HotPotQA)

We have provided a set of notebooks to easily run, explore, and interact with the results of the reasoning experiments. Each experiment consists of a random sample of 100 questions from the HotPotQA distractor dataset. Each question in the sample is attempted by an agent with a specific type and reflexion strategy.

Setup

To get started:

Clone this repo and move to the HotPotQA directory:

git clone https://github.com/noahshinn/reflexion && cd ./hotpotqa_runs

Install the module dependencies into your environment:

pip install -r requirements.txt

Set OPENAI_API_KEY environment variable to your OpenAI API key:

export OPENAI_API_KEY=<your key>

Agent Types

Agent type is determined by the notebook you choose to run. The available agent types include:

ReAct - ReAct Agent
CoT_context - CoT Agent given supporting context about the question
CoT_no_context - CoT Agent given no supporting context about the question

The notebook for each agent type is located in the ./hotpot_runs/notebooks directory.

Reflexion Strategies

Each notebook allows you to specify the reflexion strategy to be used by the agents. The available reflexion strategies, which are defined in an Enum, include:

ReflexionStrategy.NONE - The agent is not given any information about its last attempt.
ReflexionStrategy.LAST_ATTEMPT - The agent is given its reasoning trace from its last attempt on the question as context.
ReflexionStrategy.REFLEXION - The agent is given its self-reflection on the last attempt as context.
ReflexionStrategy.LAST_ATTEMPT_AND_REFLEXION - The agent is given both its reasoning trace and self-reflection on the last attempt as context.

To Run: decision-making (AlfWorld)

Clone this repo and move to the AlfWorld directory

git clone https://github.com/noahshinn/reflexion && cd ./alfworld_runs

Specify the run parameters in ./run_reflexion.sh. num_trials: number of iterative learning steps num_envs: number of task-environment pairs per trial run_name: the name for this run use_memory: use persisting memory to store self-reflections (turn off to run a baseline run) is_resume: use logging directory to resume a previous run resume_dir: the logging directory from which to resume the previous run start_trial_num: if resume run, then the trial number of which to start

Run the trial

./run_reflexion.sh

The logs will be sent to ./root/<run_name>.

Another Note

Due to the nature of these experiments, it may not be feasible for individual developers to rerun the results as GPT-4 has limited access and significant API charges. All runs from the paper and additional results are logged in ./alfworld_runs/root for decision-making, ./hotpotqa_runs/root for reasoning, and ./programming_runs/root for programming

Other Notes

Check out the code for the original code here

Read a blog post here

Check out an interesting type-prediction implementation here: OpenTau

For all questions, contact [email protected]

Cite

@misc{shinn2023reflexion,
      title={Reflexion: Language Agents with Verbal Reinforcement Learning}, 
      author={Noah Shinn and Federico Cassano and Edward Berman and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao},
      year={2023},
      eprint={2303.11366},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

reflexion's People

Contributors

Stargazers

Watchers

Forkers

riccitensor fabiorizzomatos cmarcond cat-stack-boop incephalia leeseomin kylixc tornikeo standardgalactic kemolo iceskycn kavinstewart srikalyan cesarlelezard aychtang rogervaas jackcloudman joskid herom-s keyboardcartel maddyonline phronology techthiyanes pengge turintech sisyphus192 zzmjohn codeaudit emptydiagram rednax67 ocwebb bellyfat jawond tonyxia2016 rizwan09 cinderzhang andikanaffiah typesdigital bagoviggo techventurebuilder xiaoxiaoma549 wizrds samching bassemfg eqeiland venkatesansubramanian80 nicnomadi toufunao admariner soltrinox houshijie-2020 ardentlekhey xiongjun19 yijunwu dougdotcon safurai-ai nashid martinrooty mondocosm ckqqqq rmasiniexpert sorendunn automata-studio kiljoy001 lightningralf meet-ai deadbranch-forkarchive jjhw epinnock n-h00 enkaybit jxzhangjhu cczhgit craigbasson jfontestad liamdgray xwild seshakiran evelynmitchell silverbeard6659 lgs ludoplex jsham042 azuber adamwierzbinski kidbillionaire derkodex kobishaha jayglookr my-basement mivanovitch ycsun2017 vsionary14 chanhee-luke catspunch azure-arc-0 dingchaoz turangojayev dapper-magician coinhubx

reflexion's Issues

Potential unexpected behavior in executor (that uses the`global`)

I was running the script https://github.com/noahshinn/reflexion/blob/main/programming_runs/run_reflexion_codellama_multi.sh
with CodeLLaMA model, simply change the codellama to codellama-7b

CUDA_VISIBLE_DEVICES=$1 python main.py \
  --run_name "reflexion_codellama_$1" \
  --root_dir "root" \
  --dataset_path ./benchmarks/humaneval-py.jsonl \
  --strategy "reflexion" \
  --language "py" \
  --model "codellama-7b" \
  --pass_at_k "1" \
  --max_iters "2" \
  --verbose | tee ./logs/reflexion_codellama_$1

The performance is around 30~40% though it will get killed in the middle of the experiment.

Observation

Some predictions present unexpected behavior while it is correct prediction is_solved = True. In the following prediction (which is also attached here)

The first program generated actually failed the generated test cases (NOTE: these test cases actually generated by CodeLLaMA-7B), although this program passes the ground-truth test cases (eventually).
After reflection, the model actually generate nothing (i.e., None).
So, suppose we should take "" as our final output, right?
But the "is_solved" is actually assigned to True.

Reason

Because we always use the function globals(), even though the program after reflexion has nothing, globals() already contain the program in the first attempt.

Thus, as we do not clear the information in globals() every time, potentially this could lead to some unexpected behavior.

I have been a bit confused here, hope the authors can clarify more on this.

About the action limitation on Webshop

Hello, this is a brilliant work~
May I ask about this code in https://github.com/noahshinn024/reflexion/blob/49485efc8fa2514dd98edfb8427ed27ad06fb391/webshop_runs/webshop_trial.py#L168

            elif button == 'Next >':
                assert False # ad hoc page limitation
                assert self.sessions[session]['page_type'] == 'search'
                self.sessions[session]['page_num'] += 1
            elif button == '< Prev':
                assert self.sessions[session]['page_type'] in ['search', 'item_sub', 'item']
                if self.sessions[session]['page_type'] == 'search':
                    assert False

the original Webshop env seems implemented the [Next] and [Prev] buttion, why here set to False?
I would appreciate if you could give me answer.

Always fails to execute the action, and it responds with 'nothing happens'.

For example, like this.This is from my trial_0.log：

go to cabinet 3
On the cabinet 3, you see a cup 1.
think: I found a cup in cabinet 3. I can use it to clean the plate. Next, I need to put the plate in countertop 1.
OK.
【> take cup 1 from cabinet 3
Nothing happens.】
It seems there is no cup in cabinet 3. Let's try another way to clean the plate.
Nothing happens.
go to cabinet 4
On the cabinet 4, you see nothing.

go to drawer 2
Nothing happens.
go to drawer 2
On the drawer 2, you see a pencil 1.
think: I need to keep looking for the bowl. Let's try drawer 3.
Nothing happens.
【go to drawer 3
Nothing happens.
go to drawer 3
Nothing happens.】

I would like to know if I am doing something wrong, and I would be very grateful if you could help me solve this issue.

Can't reproduce HumanEval score

follow programming_runs/run_reflexion.sh

get 0.77-0.83 scores for multi trials.

About the prompt for reflection

Thanks for your great work!

I have a question regarding the code. In prompts.py, the prompt for reflection (https://github.com/noahshinn024/reflexion/blob/612e616603650397d4060117de4578658626deb1/hotpotqa_runs/prompts.py#L117C1-L124C15) is :
REFLECT_INSTRUCTION = """You are an advanced reasoning agent that can improve based on self refection. You will be given a previous reasoning trial in which you were given access to an Docstore API environment and a question to answer. You were unsuccessful in answering the question either because you guessed the wrong answer with Finish[], or you used up your set number of reasoning steps. In a few sentences, Diagnose a possible reason for failure and devise a new, concise, high level plan that aims to mitigate the same failure. Use complete sentences.
Here are some examples:
{examples}
Previous trial:
Question: {question}{scratchpad}
Reflection:"""

There are a few things in the prompt that confuse me:
(1) It seems that this prompt is designed for unsuccessful trails (in the prompt, there is "You were unsuccessful in answering the question either because ..."). What about successful trails? Does this system reflect on successful trails and summarize the successful experiences?
(2) For the {examples} part, I think it was from the REFLECTIONS in fewshots.py (https://github.com/noahshinn024/reflexion/blob/612e616603650397d4060117de4578658626deb1/hotpotqa_runs/fewshots.py#L68C1-L106C4). However, in the REFLECTIONS, it only shows Question, Actions, Thoughts, and the corresponding Reflection, where it doesn't show the results of these actions. It means that the LLM doesn't know whether the actions are successful or not. Then how can the LLM reflect on it?
(3) In the prompt, there is nothing after "Previous trial:", and both {question} and {scratchpad} follow "Question:" (https://github.com/noahshinn024/reflexion/blob/612e616603650397d4060117de4578658626deb1/hotpotqa_runs/prompts.py#L121C1-L122C33). Is this format correct?

Please integrate Reflexion into GPT-Engineer

Please integrate Reflexion into GPT-Engineer. A Reflexion powered fork of GPT-Engineer would be amazing.

https://github.com/AntonOsika/gpt-engineer

[Feature Request]: Gymnasium compatibility

Hi, I was wondering if it would be possible to add compatibility with Gymnasium. It is a maintained fork of openai gym and is designed as a drop-in replacement (import gym -> import gymnasium as gym), which maintains backwards compatibility. Beyond just bugfixes, many RL environments have updated or are only available in Gymnasium (see Third Party Environments) as well as most RL training libraries (rllib, tianshou, CleanRL, stable-baselines3).

It looks like you currently use Gym v26 which is the most recent version and should be very simple to update to Gymnasium. I see the Leetcode-Hard Gym environment uses Gym as well, but taking a quick look through it doesn't seem like it would be difficult to update either.

We would love to add an official tutorial using Reflexion as a way of interacting with Gymnasium environments (or multi-agent PettingZoo envs, as we already have a LangChain tutorial for example).

I would also personally be interested in helping add examples to this repo testing out standard Gymnasium or PettingZoo environments, don't think it would take much modification but just making this issue first to see if it's something you'd be interested in contributors helping with.

For information about upgrading and compatibility, see migration guide and gym compatibility. The main difference between V26 and Gymnasium I believe is that env.reset() does not take the return_info argument, but it seems your codebases don't use that anyways.

I'm not getting any results in the native webshop environment. Can you please help me understand what might be wrong?

For the webshop_runs,I set up a native webshop env.After I run the run_reflexion.sh,I get zero results for all instructions.
The directory of data:

The web_agent_site.utils.py file:

I follow the webshop(https://github.com/princeton-nlp/WebShop) setup.
In step 5.I execute ./setup.sh -d all.
After setup,I run ./run_dev.sh and the server start.Then I run web_shop_runs/run_reflexion.sh(in which I subtitute the url in webshop_trial.py with my local url)

Here is one line of my user_session_logs:
{"page": "search_results", "url": "http://10.28.34.92:6666/search_results/fixed_0/bed/1", "goal": {"asin": "B09KCMTNQP", "category": "beauty", "query": "hair extensions, wigs & accessories", "name": "Long Curly Wave Clip in Hair Extensions for Women 20" Synthetic Natural Looking Colorful Fashion Hair Styling Accessories", "product_category": "Beauty & Personal Care \u203a Hair Care \u203a Hair Extensions, Wigs & Accessories \u203a Hairpieces", "instruction_text": "i need a long clip-in hair extension which is natural looking, and price lower than 30.00 dollars", "attributes": ["hair extensions"], "price_upper": 30.0, "goal_options": [], "weight": 1}, "content": {"keywords": ["bed"], "search_result_asins": [], "page": 1}}
in all my search results, the size of search_result_asins is zero.

Can you tell me what I did wrong?Thank you!!!

Add support for MultiPL-E

MultiPL-E translates HumanEval to multiple other programming languages. It allows more diverse evaluation and has lower chances of being contained in training sets.

What's the difference between these strategies

hello, I have tried "simple" and "reflexion" strategies . But there are other four strategies .Can anyone explain the difference between those strategies.

label leaks may happen?

Hi Noah, I'm reproducing your work, generally I view reflexion as some kind of in-context few-shot sft/rl, which requires supervised signals (either from environment or label) . However, in your code, the evaluation on hotpotQA seems directly using the validation set label as this supervised signal, which means label leaks happened. I'm pretty confused here.
Did you do the experiments on whether reflection on training samples could generalize to the validation samples? Or did I correctly get your thought?

actor in webshop does not intake the memory and relfexion

hi there,

I am a bit confused about the reflexion for webshop.
in code here, line 245,
https://github.com/noahshinn/reflexion/blob/main/webshop_runs/webshop_trial.py
the llm actor only intakes the base_prompt + prompt, which is the trajectory in current step, which is the left yellow block in the figure.

however, it seems that the llm actor does not intake the env_history which is the right yellow block in the figure, which contains the memory and reflexion from previous trials (if trial > 1)

may I know if I miss something ?

if this llm actor surely does not intake the memory, can it be explained that there is no gain in webshop task, as you reported in the paper ?
thanks.

HotpotQA oracle evaluator

Hi, I appreciate your excellent work! I want to ask about the hotpotQA setting. It appears that your framework relies on the oracle evaluator (gold label) to determine whether self-reflection should occur. I am curious about how this framework operates during inference time when the gold label is unavailable. Thank you once again.

About reflexion temperature(for HotpotQA)

1 For HotpotQA task, you used =0 to compare the difference between react only and react + reflexion.
Is it because this is a quiz type, so temperature is 0?
At the same time, in this way , can we guarantee that each trial will have a definite and unchanged answer for react only?

2 Have you compared the effect when temperature is not 0 (such as 0.7 or even 1)?
I currently suspect that, for example, Trial = 5, is it possible that the effect of react only can exceed the effect of react+reflexion?
What does the author think, or has done similar experiments?

Clarification on ReflexionStrategy Implementation

I hope this message finds you well. I am currently exploring the ReflexionStrategy in your project located at https://github.com/noahshinn024/reflexion in the hotpotqa_runs/notebook section.
Specifically, I am interested in understanding the behavior of the strategy when it comes to the "LAST_ATTEMPT_AND_REFLEXION" mode.
According to the code at print(ReflexionStrategy.doc), the description for "LAST_ATTEMPT_AND_REFLEXION" is as follows:
LAST_ATTEMPT_AND_REFLEXION: Use last reasoning trace in context and apply reflexion to the next reasoning trace. However, upon examining the implementation in agent.py at line 116, it appears that the strategy does not combine the results of reflexion for each iteration but only considers the last reflexion results of the previous trace.
Could you kindly clarify if this is indeed the intended behavior, or if there may be a misunderstanding in the description? It seems to me that the strategy should be named "LAST_ATTEMPT_AND_LAST_REFLEXION" if it only takes the last reflexion results of the previous trace into account.
I appreciate your assistance in clarifying this aspect of the ReflexionStrategy implementation. Thank you for your time and support.

Interpreting results files

Hi, super interesting work here! I was wondering how to interpret the results files in root -- initially I thought is_solved meant correct or not, and that would indeed give 87.8% for Reflexion with GPT-4; but then the equivalent file without Reflexion gets 81.7 when my impression is that should be 67. How should I be interpreting the columns, and if is_solved != correct, how do I check correctness? + I see the reflections for the ones that have is_solved False, but not the predicted (incorrect) solutions, how can I see those?

Prompt for Llama-2-7b-chat-hf model

I tried Llama-2-7b-chat-hf model on the programming tasks using the prompt for CodeLlama model. The dataset I use is mbpp-py.jsonl. But it seems like the prompt doesn't suit the Llama-2-7b-chat-hf model very well and it gives me an error:

GENERATED FUNC BODY
None
Traceback (most recent call last):
File "/home/user/botao/reflexion-main/programming_runs/my_main_llama2.py", line 128, in
main(args)
File "/home/user/botao/reflexion-main/programming_runs/my_main_llama2.py", line 111, in main
run_strategy(
File "/home/user/botao/reflexion-main/programming_runs/my_main_llama2.py", line 51, in kwargs_wrapper
return func(**kwargs)
File "/home/user/botao/reflexion-main/programming_runs/reflexion.py", line 75, in run_reflexion
assert isinstance(cur_func_impl, str)
AssertionError

Is there a better prompt for this model?

what's the difference between COT_INSTRUCTION & COT_AGENT_REFLECT_INSTRUCTION in prompts.py

in prompts.py:

COT_INSTRUCTION = """Solve a question answering task by having a Thought, then Finish with your answer. Thought can reason about the current situation. Finish[answer] returns the answer and finishes the task. You will be given context that you should use to help you answer the question.
Here are some examples:
{examples}
(END OF EXAMPLES)
{reflections}
Relevant Context: {context} 
Question: {question}{scratchpad}"""

COT_AGENT_REFLECT_INSTRUCTION = """Solve a question answering task by having a Thought, then Finish with your answer. Thought can reason about the current situation. Finish[answer] returns the answer and finishes the task. You will be given context that you should use to help you answer the question.
Here are some examples:
{examples}
(END OF EXAMPLES)

{reflections}

Relevant Context: {context}
Question: {question}{scratchpad}"""

They are quite the same. Even the usage of two prompts are the same:

cot_agent_prompt = PromptTemplate(
                        input_variables=["examples", "reflections", "context", "question", "scratchpad"],
                        template = COT_INSTRUCTION,
                        )

cot_reflect_agent_prompt = PromptTemplate(
                        input_variables=["examples", "reflections", "context", "question", "scratchpad"],
                        template = COT_AGENT_REFLECT_INSTRUCTION,
                        )

So what's the difference between COT_INSTRUCTION & COT_AGENT_REFLECT_INSTRUCTION?

ModuleNotFoundError: No module named 'alfworld'

C:\Users\20834\anaconda3\envs\reflexion\python.exe F:\reflexion\alfworld_runs\main.py --num_trials 10 --num_envs 134 --run_name reflexion_run_logs --use_memory --model gpt-3.5-turbo
Traceback (most recent call last):
File "F:\reflexion\alfworld_runs\main.py", line 5, in
from alfworld_trial import run_trial
File "F:\reflexion\alfworld_runs\alfworld_trial.py", line 9, in
import alfworld
ModuleNotFoundError: No module named 'alfworld'

Which commit matches your published work (for HumanEval-rs)?

I'd like to run your code as it was implemented for the original paper. Would it be possible to add a link in the README that points to the specific commit representing the code used to achieve your results as published in the Reflexion paper? Thanks!

ValueError: AlfWorld Environment [expert_plan, expert_type]

    -----
    Starting run with the following parameters:
    Run name: reflexion_run_logs
    Number of trials: 1
    Number of environments: 1
    Use memory: True

    Sending all logs to `reflexion_run_logs`
    -----
    
Initializing AlfredTWEnv...
Checking for solvable games...
Overall we have 134 games
Evaluating with 134 games
Traceback (most recent call last):
  File "/reflexion/alfworld_runs/main.py", line 119, in <module>
    main(args)
  File "/reflexion/alfworld_runs/main.py", line 100, in main
    run_trial(trial_log_path, world_log_path, trial_idx, env_configs, args.use_memory, args.model)
  File "/reflexion/alfworld_runs/alfworld_trial.py", line 99, in run_trial
    env = env.init_env(batch_size=1)
  File "/reflexion/env_alfworld/lib/python3.9/site-packages/alfworld/agents/environment/alfred_tw_env.py", line 224, in init_env
    infos = textworld.EnvInfos(won=True, admissible_commands=True, expert_type=expert_type, expert_plan=expert_plan, extras=["gamefile"])
  File "/reflexion/env_alfworld/lib/python3.9/site-packages/textworld/core.py", line 109, in __init__
    raise ValueError(msg)
ValueError: Unknown information requested: ['expert_plan', 'expert_type']. Available information are: ['admissible_commands', 'command_templates', 'description', 'entities', 'extras', 'facts', 'fail_facts', 'feedback', 'game', 'intermediate_reward', 'inventory', 'last_action', 'last_command', 'location', 'lost', 'max_score', 'moves', 'objective', 'policy_commands', 'score', 'verbs', 'win_facts', 'won']

How to run the code

I want to run it, and please let me know how to run it. Please give me guidance.

Using CodeLLAMA cause the program crash

I was trying the script with CodeLLaMA

Sometimes I found the script just got killed, without showing any error. Any intuition?

Can use local LLM？not openai api

[Bug] The success condition for alfworld might be incorrect

According to this line and the other line, the function alfworld_run will return True when the environment reaches the allowed maximal number of steps regardless the goal is or not achieved, which will result in a spuriously higher measure of success rate.

Content filtering using gpt3.5-turbo-16k

When I input prompt as below, gpt filter my content and nothing happen

Interact with a household to solve a task. Here are two examples.
You are in the middle of a room. Looking quickly around you, you see a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 3, a diningtable 2, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a sidetable 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
Your task is to: clean some apple and put it in sidetable.

think: To solve the task, I need to find and take an apple, then clean it with sinkbasin, then put it in sidetable.
OK.
think: First I need to find an apple. An apple is more likely to appear in fridges (1), diningtable (1-3), sidetable (1), countertop (1), sinkbasin (1), garbagecan (1). I can check one by one, starting with fridge 1.
OK.
go to fridge 1
The fridge 1 is closed.
open fridge 1
You open the fridge 1. The fridge 1 is open. In it, you see a lettuce 2, a mug 2, a potato 2, and a tomato 1.
go to diningtable 1
On the diningtable 1, you see nothing.
go to diningtable 2
On the diningtable 2, you see a pineapple 1.
go to diningtable 3
On the diningtable 3, you see a bowl 2, a bowl 1, a bread 2, and a bread 1.
go to sidetable 1
On the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.
go to countertop 1
On the countertop 1, you see a butterknife 1, a dishsponge 1, and a soapbottle 1.
go to sinkbasin 1
On the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.
go to garbagecan 1
On the garbagecan 1, you see a apple 3, and a egg 3.
think: Now I find an apple (3). Next, I need to take it.
OK.
take apple 3 from garbagecan 1
You pick up the apple 3 from the garbagecan 1.
think: Now I take an apple (3). Next, I need to go to sinkbasin (1) and clean it.
OK.
go to sinkbasin 1
On the sinkbasin 1, you see a cup 2, a egg 1, a mug 1, and a potato 3.
clean apple 3 with sinkbasin 1
You clean the apple 3 using the sinkbasin 1.
think: Now I clean an apple (3). Next, I need to put it in/on sidetable 1.
OK.
go to sidetable 1
On the sidetable 1, you see a cup 1, a lettuce 1, a peppershaker 3, a potato 1, and a saltshaker 1.
put apple 3 in/on sidetable 1
You put the apple 3 in/on the sidetable 1.
You are in the middle of a room. Looking quickly around you, you see a cabinet 13, a cabinet 12, a cabinet 11, a cabinet 10, a cabinet 9, a cabinet 8, a cabinet 7, a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 1, a diningtable 1, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
Your task is to: put a clean lettuce in diningtable.
think: To solve the task, I need to find and take a lettuce, then clean it with sinkbasin, then put it in diningtable.
OK.
think: First I need to find a lettuce. A lettuce is more likely to appear in fridge (1), diningtable (1), sinkbasin (1), stoveburner (1-3), cabinet (1-13). I can check one by one, starting with fridge 1.
OK.
go to fridge 1
The fridge 1 is closed.
open fridge 1
You open the fridge 1. The fridge 1 is open. In it, you see a cup 3, a egg 2, a potato 3, and a potato 2.
go to diningtable 1
On the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a lettuce 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.
think: Now I find a lettuce (1). Next, I need to take it.
OK.
take lettuce 1 from diningtable 1
You pick up the lettuce 1 from the diningtable 1.
think: Now I take a lettuce (1). Next, I need to go to sinkbasin (1) and clean it.
OK.
go to sinkbasin 1
On the sinkbasin 1, you see a apple 2, a ladle 2, a spoon 1, and a tomato 3.
clean lettuce 1 with sinkbasin 1
You clean the lettuce 1 using the sinkbasin 1.
think: Now I clean a lettuce (1). Next, I need to put it in/on diningtable 1.
OK.
go to diningtable 1
On the diningtable 1, you see a apple 1, a bread 1, a butterknife 2, a cup 2, a fork 2, a knife 2, a knife 1, a ladle 1, a mug 2, a mug 1, a pan 2, a peppershaker 1, a spatula 3, a tomato 2, and a tomato 1.
put lettuce 1 in/on diningtable 1
You put the lettuce 1 in/on the diningtable 1.

Here is the task:
You are in the middle of a room. Looking quickly around you, you see a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1.
Your task is to: clean some spatula and put it in drawer.

think: To solve the task, I need to find and take a spatula, then clean it with the sinkbasin, and finally put it in the drawer.
OK.
think: First, I need to find a spatula. A spatula is more likely to appear on a countertop (1-3), in a drawer (1-3), or on a shelf (1-3). I can check one by one, starting with the countertop.
OK.
go to countertop 1
On the countertop 1, you see a bowl 1, a cup 1, a egg 2, a mug 2, and a peppershaker 1.
go to countertop 2
On the countertop 2, you see a creditcard 2, a dishsponge 1, a glassbottle 1, a knife 1, a lettuce 1, and a statue 1.
go to countertop 3
On the countertop 3, you see a bowl 2, a bread 1, a butterknife 2, a cellphone 1, a houseplant 1, a knife 2, a soapbottle 2, a spatula 1, a tomato 2, a vase 3, and a vase 2.
go to drawer 1
The drawer 1 is closed.
open drawer 1
You open the drawer 1. The drawer 1 is open. In it, you see a fork 1.
go to drawer 2
The drawer 2 is closed.
open drawer 2
You open the drawer 2. The drawer 2 is open. In it, you see a butterknife 1.
take butterknife 1 from drawer 2
You pick up the butterknife 1 from the drawer 2.
think: I found a butterknife (1), but I need a spatula. I should continue searching.
OK.
go to drawer 3
The drawer 3 is closed.
open drawer 3
You open the drawer 3. The drawer 3 is open. In it, you see a creditcard 1, and a saltshaker 1.
go to shelf 1
On the shelf 1, you see a dishsponge 2, and a peppershaker 3.
go to shelf 2
On the shelf 2, you see a creditcard 3, a cup 2, a glassbottle 2, and a saltshaker 2. >

WARNING:root:OpenAIError: The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766.

Script for leetcode results

Hi,

Thanks for your excellent work and code! I have the following two questions in hope to get your clarification.

May I know the correct way to reproduce the reflexion results for the leetcode-hard-gym benchmark? Specifically, I tried run_reflexion_py_leet.sh, but I'm not sure what dataset_path to set. The default path does not exist, and the closet one is ./executors/leetcode_env/leetcode_dataset/build/leetcode-hard-uncontaminated-python3.jsonl, which throws an error "KeyError: 'visible_tests'".
What does reflexion_ucs stand for?

Any help is very much appreciated!

Best,
Shenao

How to run a specific alfworld environment?

It seems like the ./run_simple.sh and ./run_reflexion.sh scripts run through all the environments one by one. It is not clear, however, where in the code the particular environment is selected. Does it happen in alfworld_trial.py line 109:

ob, info = env.reset()

I.e. how can we for example run env5, 23, 28 only?

Also, what splits are available and how do they influence the environments (and how)?

Thank you.

Combine with Parsel for improving the state of the art?

Hi @noahshinn024 this paper result is fascinating, but
If we look at the leaderboard on humaneval,
it appears the second best paper (Parsel) is using a combination of GPT-4 with the prior best paper on codegen, Code-T
which had brought a 20% absolute accuracy gain previously
https://paperswithcode.com/paper/codet-code-generation-with-generated-tests

Therefore, my question about synergies is two fold,

Could you, like Parsel, combine reflexion with Code-T for even better accuracy
Could you combine Parsel with Reflexion for potentially even better accuracy than 1)

edit I have skimmed the paper and you do mention CodeT and also says that Reflexion also rely on test generation. Therefore my point might be moot but 1) CodeT might generate tests in a better way than your inhouse solution? unlikely
and 2) the innovation brought by Parsel is not mentionned in the paper and could allow record accuracy

TL;DR I believe Reflexion could go beyond 91% with Parsel

Reproducing Alfworld Results

Hi,

Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on Alfworld.

E.g. Env0 & Env1 are successful for you, however, we always get failures on our end. (Other Envs are successful though, so it does work sometimes).

@noahshinn

"is_solved": false on all results

Do we need to adjust some parameters or anything? i tried running both run_reflexion.sh and run_simple.sh but all of the tests fail.

noahshinn / reflexion Goto Github PK

reflexion's Introduction

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning

To Run: reasoning (HotPotQA)

Setup

Agent Types

Reflexion Strategies

To Run: decision-making (AlfWorld)

Another Note

Other Notes

Cite

reflexion's People

Contributors

Stargazers

Watchers

Forkers

reflexion's Issues

Observation

Reason

Recommend Projects

Recommend Topics

Recommend Org