princeton-nlp / intercode Goto Github PK
View Code? Open in Web Editor NEW[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
Home Page: https://intercode-benchmark.github.io/
License: MIT License
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
Home Page: https://intercode-benchmark.github.io/
License: MIT License
Hi John,
Thanks for the great work. I'm following the steps of building from source and have such an error, when I run python run_demo.py sql
Traceback (most recent call last):
File "/net/nfs2.mosaic/yuchenl/intercode/run_demo.py", line 7, in <module>
from experiments.policies import HumanPolicy
File "/net/nfs2.mosaic/yuchenl/intercode/experiments/policies.py", line 5, in <module>
from .utils import ACTION_PARSER_MAP, PROMPT_MAP, CompletionGPT, ChatGPT, PalmChat, PalmCompletion, HFChat
File "/net/nfs2.mosaic/yuchenl/intercode/experiments/utils/__init__.py", line 16, in <module>
from .palm_api import (
File "/net/nfs2.mosaic/yuchenl/intercode/experiments/utils/palm_api.py", line 15, in <module>
assert(api_key != None)
AssertionError
btw, I also found that rpyc
is not part of the requirements so that I have to install it. (a minor issue)
Hi,
I was trying to run the tests using pytest
and I realized that a lot of data dependencies for tests do not exist in the repository. Would it be possible to include them as well?
For example: ./data/test/bash_queries.json
which is required in tests/test_env_bash.py
I am planning to use intercode as an isolated execution environment for my cybersecurity related competition. It would be very helpful if you could include the tests in the repo.
I run the eval_n_turn.py to reproduce the single turn handicap sql results
python -m experiments.eval_n_turn \
--data_path ./data/sql/spider/ic_spider_dev.json \
--dialogue_limit 5 \
--env sql \
--image_name docker-env-sql \
--log_dir logs/experiments \
--max_turns 1 \
--policy chat \
--template game_sql \
--model gpt-3.5-turbo \
--handicap \
--verbose
i use this script to compute the success rate:
import json
from re import T
result_file_path = './logs/experiments/ic_sql_multiturn_gpt-3.5-turbo_1_turns.json'
with open(result_file_path, 'r') as f:
result = { key: {'success':0, 'total':0} for key in ['easy', 'medium', 'hard', 'extra','all'] }
data = json.load(f)
for index in data.keys():
if data[index]['summary']['max_reward'] == 1.0:
result[data[index]['hardness']]['success']+=1
result['all']['success']+=1
result[data[index]['hardness']]['total']+=1
result['all']['total']+=1
for key in result.keys():
success = result[key]['success']
total = result[key]['total']
print(f"{key} Success rate: {success}/{total} ({success/total:.2%})")
get this result:
easy Success rate: 202/248 (81.45%)
medium Success rate: 281/446 (63.00%)
hard Success rate: 75/174 (43.10%)
extra Success rate: 37/166 (22.29%)
all Success rate: 595/1034 (57.54%)
It is lower than the result in paper.
Did I do something wrong?
I also run the eval_n_turn.py to reproduce the single turn sql results.
python -m experiments.eval_n_turn \
--data_path ./data/sql/spider/ic_spider_dev.json \
--dialogue_limit 5 \
--env sql \
--image_name docker-env-sql \
--log_dir logs/experiments \
--max_turns 1 \
--policy chat \
--template game_sql \
--model gpt-3.5-turbo
Result is here:
easy Success rate: 41/248 (16.53%)
medium Success rate: 28/446 (6.28%)
hard Success rate: 3/174 (1.72%)
extra Success rate: 2/166 (1.20%)
all Success rate: 74/1034 (7.16%)
Did I do something wrong?
In the run_sql.py, you initialize the pocily by policy = HumanPolicy().
Shouldn't it be ChatGPTPolicy?
@john-b-yang I Just wanted to know before I invest time trying to see if it's still up to date and spinning up a huggingface endpoint to eval a model.
Thanks!
Hi,
Thanks for your contribution and the excellent benchmark! But the results when running eval_n_turn with gpt 3.5 and n=10 don't match the TryAgain baseline reported in your paper Table 2. I tested in the Bash environment, and the success rate in Bash 2 is 15%. Is it due to the internal changes in openai api? eval_n_turn with gpt 3.5 and n=20 roughly matches the results in your paper that uses n=10.
Best,
Shenao
Thanks for your release of the environment code and this is really a nice work!
During my re-implementation of the experiments in your paper, I was not sure about what is your system prompt for the NL2Bash single-turn code generation?
When I run the following scripts:
SQL Call python -m experiments.eval_n_turn \ --data_path ./data/sql/spider/ic_spider_dev.json \ --dialogue_limit 5 \ --env sql \ --image_name docker-env-sql \ --log_dir logs/experiments \ --max_turns 10 \ --policy chat \ --template game_sql \ --model gpt-3.5-turbo --handicap --verbose
It throws an exeception that tells me the docker image is not found:
`(intercode) user@ubuntu:/botao/intercode-master$ python main.py/botao/intercode-master$ vim experiments/utils/open_api.py
Traceback (most recent call last):
File "/home/user/botao/intercode-master/main.py", line 11, in
from experiments.policies import (
File "/home/user/botao/intercode-master/experiments/policies.py", line 5, in
from .utils import ACTION_PARSER_MAP, PROMPT_MAP, CompletionGPT, ChatGPT, PalmChat, PalmCompletion, HFChat
File "/home/user/botao/intercode-master/experiments/utils/init.py", line 20, in
from .open_api import (
File "/home/user/botao/intercode-master/experiments/utils/open_api.py", line 13, in
assert(access_token)
AssertionError
(intercode) user@ubuntu:
(intercode) user@ubuntu:~/botao/intercode-master$ python main.py
Traceback (most recent call last):
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/api/client.py", line 265, in _raise_for_status
response.raise_for_status()
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.43/images/docker-env-sql/json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/botao/intercode-master/main.py", line 215, in
expr_wrapper = ExperimentWrapper(args)
File "/home/user/botao/intercode-master/main.py", line 55, in init
self.env = SqlEnv(image_name=args.image_name,
File "/home/user/botao/intercode-master/intercode/envs/sql/sql_env.py", line 27, in init
super(SqlEnv, self).init(image_name, **kwargs)
File "/home/user/botao/intercode-master/intercode/envs/ic_env.py", line 77, in init
self.container = get_container(self.container_name, self.image_name, **kwargs)
File "/home/user/botao/intercode-master/intercode/utils/utils.py", line 48, in get_container
image = client.images.get(image_name)
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/models/images.py", line 333, in get
return self.prepare_model(self.client.api.inspect_image(name))
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/api/image.py", line 251, in inspect_image
return self._result(
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/api/client.py", line 271, in _result
self._raise_for_status(response)
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/api/client.py", line 267, in _raise_for_status
raise create_api_error_from_http_exception(e) from e
File "/home/user/anaconda3/envs/intercode/lib/python3.9/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation) from e
docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.43/images/docker-env-sql/json: Not Found ("No such image: docker-env-sql:latest")`
When I run this script to test nl2bash. I get an error : exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown
python -m experiments.eval_n_turn \
--data_path ./data/nl2bash/nl2bash_fs_2.json \
--dialogue_limit 7 \
--env bash \
--image_name intercode-nl2bash \
--log_dir logs/experiments \
--max_turns 10 \
--policy chat \
--template v2 \
--model gpt-3.5-turbo \
--verbose
This is the log
Query 37: Recursively find the latest modified file in the /system directory
- Turn 0
-- Action: find /system -type f -printf '%t@ %p\n' | sort -n | tail -1 | cut -d' ' -f2
-- Observation: find: '/system': No such file or directory
- Turn 1
-- Action: find / -type d -name "system" 2>/dev/null | xargs -I {} find {} -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -d' ' -f2 -- Observation: Command timed out
- Turn 2
-- Action: find / -type f -name "*" -printf '%T@ %p\n' 2>/dev/null | sort -n | tail -1 | cut -d' ' -f2
-- Observation:
ERRO[2024-03-19T22:23:28.934840930+08:00] Error running exec 4f326e717202bfe9efbd2065082594160d5013167216f0c6616b67b8548fb118 in container: exec a
ttach failed: error attaching stdout stream: write unix /var/tmp/xdg_runtime_dir_zhoupeng/docker.sock->@: write: broken pipe
- Turn 3
-- Action: find /system -type f -printf '%T@ %p\n' 2>/dev/null | sort -n | tail -1 | cut -d' ' -f2 -- Observation:
- Turn 4
-- Action: find /system -type f -exec stat --format='%Y : %n' {} + 2>/dev/null | sort -n | tail -1 | cut -d' ' -f3
-- Observation:
- Turn 5
-- Action: find /system -type f -exec stat --format='%Y : %n' {} + 2>/dev/null | sort -n | tail -1 | cut -d' ' -f3 -- Observation:
- Turn 6
-- Action: find /system -type f -exec stat --format='%Y : %n' {} + 2>/dev/null | sort -n | tail -n 1 | cut -d' ' -f3
-- Observation:
- Turn 7
-- Action: find /system -type f -exec stat --format='%Y : %n' {} + 2>/dev/null | sort -nr | head -n 1 | cut -d' ' -f3
-- Observation: - Turn 8
-- Action: find /system -type f -printf '%T@ %p\n' 2>/dev/null | sort -n | tail -n 1 | cut -d' ' -f2-
-- Observation:
ERRO[2024-03-19T22:23:40.527309951+08:00] Error running exec 3c3630d421ca422ba88b6ed33fe82fea20d32c761077d6c5a82ad64a078d31a8 in container: exec a
ttach failed: error attaching stderr stream: write unix /var/tmp/xdg_runtime_dir_zhoupeng/docker.sock->@: write: broken pipe
ERRO[2024-03-19T22:23:40.905832218+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:40.905902049+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:40.907188057+08:00] Error running exec 5b84a6bf90ddd458529cb7e96f52eec3b8091e42d36e2e4288adf6737c4bbfb2 in container: OCI ru
ntime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown
ERRO[2024-03-19T22:23:41.059118502+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:41.063127206+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:41.064658406+08:00] Error running exec 3722625ff81b2a28f3c2735dd61f113f7e44093acbdb535d27d201b2f3e99400 in container: OCI ru
ntime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown
- Turn 9
-- Action: find /system -type f -exec stat --format='%Y : %n' {} + 2>/dev/null | sort -nr | head -n 1 | cut -d' ' -f3
-- Observation: OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or
directory: unknown
Query 37 Finished
-Reward: 0.8200000000000001 -Turns: 10
ERRO[2024-03-19T22:23:41.211589420+08:00] stream copy error: reading from a closed fifo
-Turns: 10 [0/1933]
ERRO[2024-03-19T22:23:41.211589420+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:41.211596422+08:00] stream copy error: reading from a closed fifo
ERRO[2024-03-19T22:23:41.213089845+08:00] Error running exec 285b36c54baa3d5cc84ab194a215470986d9915ee9f3d4c251db947a4e48ae5d in container: OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown
INFO[2024-03-19T22:23:51.243086638+08:00] Container failed to exit within 10s of signal 15 - using the force container=56653d248d010916f88eb3562c
bd5e261191b32415f8fbf37b3ad13a1453c992
ERRO[2024-03-19T22:23:51.275239748+08:00] Error running exec a19b9f9275762b661d1703acdfff36073cf3be170b8a9442d85df9af7dac33a2 in container: exec a
ttach failed: error attaching stderr stream: write unix /var/tmp/xdg_runtime_dir_zhoupeng/docker.sock->@: write: broken pipe
ERRO[2024-03-19T22:24:01.273464072+08:00] Container failed to exit within 10s of kill - trying direct SIGKILL container=56653d248d010916f88eb3562
cbd5e261191b32415f8fbf37b3ad13a1453c992 error="context deadline exceeded"
ERRO[2024-03-19T22:24:04.460413534+08:00] Error running exec 8aac40fe9cf45cfed74caaf8fc7ac7a5f343d17ee0164e9c13ec9e0da19d4b71 in container: exec attach failed: error attaching stderr stream: write unix /var/tmp/xdg_runtime_dir_zhoupeng/docker.sock->@: write: broken pipe
INFO[2024-03-19T22:24:04.477007128+08:00] ignoring event container=56653d248d010916f88eb3562cbd5e261191b32415f8fbf3
7b3ad13a1453c992 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2024-03-19T22:24:04.476982691+08:00] shim disconnected id=56653d248d010916f88eb3562cbd5e261191b32415f8fbf37b3ad13
a1453c992 namespace=moby
WARN[2024-03-19T22:24:04.477138964+08:00] cleaning up after shim disconnected id=56653d248d010916f88eb3562cbd5e261191b32415f8fbf37b3ad13a1453c992 namespace=moby
INFO[2024-03-19T22:24:04.477158816+08:00] cleaning up dead shim namespace=moby
INFO[2024-03-19T22:24:14.728170504+08:00] Container failed to exit within 10s of signal 15 - using the force container=9ad3cbf1df1f6950a2e24fd172
a4a047ed593fd08bc94ac31cb023a93810d149
INFO[2024-03-19T22:24:14.825408778+08:00] ignoring event container=9ad3cbf1df1f6950a2e24fd172a4a047ed593fd08bc94ac3
1cb023a93810d149 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2024-03-19T22:24:14.825971676+08:00] shim disconnected id=9ad3cbf1df1f6950a2e24fd172a4a047ed593fd08bc94ac31cb023a93810d149 namespace=moby
WARN[2024-03-19T22:24:14.826073440+08:00] cleaning up after shim disconnected id=9ad3cbf1df1f6950a2e24fd172a4a047ed593fd08bc94ac31cb023a
93810d149 namespace=moby
INFO[2024-03-19T22:24:14.826121839+08:00] cleaning up dead shim namespace=moby
WARN[2024-03-19T22:24:15.498477176+08:00] cleanup warnings time="2024-03-19T22:24:15+08:00" level=warning msg="failed to remove runc container" er
ror="runc did not terminate successfully: exit status 255: " runtime=io.containerd.runc.v2 namespace=moby
Traceback (most recent call last):
File "/home/zhoupeng/miniconda3/envs/intercode/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zhoupeng/miniconda3/envs/intercode/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/zhoupeng/project/LLM/Code_LLM_Survey/intercode/experiments/eval_n_turn.py", line 212, in <module>
expr_wrapper.run_expr()
File "/home/zhoupeng/project/LLM/Code_LLM_Survey/intercode/experiments/eval_n_turn.py", line 94, in run_expr
self.env.reset(idx)
File "/home/zhoupeng/project/LLM/Code_LLM_Survey/intercode/intercode/envs/ic_env.py", line 142, in reset
self.reset_container()
File "/home/zhoupeng/project/LLM/Code_LLM_Survey/intercode/intercode/envs/bash/bash_env.py", line 37, in reset_container
raise RuntimeError(f"Failed to reset `{self.container_name}` container successfully: {output}")
RuntimeError: Failed to reset `intercode-nl2bash_ic_ctr` container successfully: b'OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory: unknown\r\n'
Currently the latest release on pypi is from June 23'.
https://pypi.org/project/intercode-bench/0.1.22/
Can you upload the new 1.0.1 release containing the rewritten CTF environment?
Hi Authors,
Thanks for building this environment, that's a really great contribution.
I was wondering if it's possible to extend the codebase and either get rid of the dependency on Docker or make it compatible with other container technologies (such as Apptainer)?
The reason I'm asking is that Docker is not available on the Canadian cluster (and probably on some other clusters too) because of its security risks (https://docs.alliancecan.ca/wiki/Apptainer#Other_Linux_container_technologies).
Thank you.
Hi,
Thanks for building this environment, that's a really great contribution!
I am just wondering why there is no MBPP results in the paper and leadboard?
Best,
Jiyang Zhang
Hello, thanks for your outstanding job!
When I run the scripts/expr_react.sh
python -m experiments.eval_react
--data_path ./data/sql/spider/ic_spider_dev.json
--env sql
--image_name docker-env-sql
--log_dir logs/experiments/sql_gpt3test
--max_turns 5
--verbose
I encounter the error:
Traceback (most recent call last):
File "/data/home/miniconda3/envs/intercode/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/home/miniconda3/envs/intercode/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/home/data_creation/intercode/experiments/eval_react.py", line 195, in
expr_wrapper.run_expr()
File "/data/home/data_creation/intercode/experiments/eval_react.py", line 101, in run_expr
self.env.reset(idx)
File "/data/home/data_creation/intercode/intercode/envs/ic_env.py", line 150, in reset
raise RuntimeError(f"Preprocess command failed to execute successfully: {self.preprocess(self.record)}")
RuntimeError: Preprocess command failed to execute successfully: ['use poker_player']
It seems that the problem is caused by function:
def preprocess_sql(record: Dict) -> List:
db = record["db"]
print(f"db {db}")
return [f"use {db}"]
And the error is raised in
if self.preprocess is not None:
preprocess_cmds = self.preprocess(self.record)
for cmd in preprocess_cmds:
self.exec_action(cmd)
if not self.info[ACTION_EXEC]:
raise RuntimeError(f"Preprocess command failed to execute successfully: {self.preprocess(self.record)}")
Additionally, I find function exec_action
is not implement and the self.info = {'action_executed': False}
.
if self.preprocess is not None:
# self.info is *{}*
preprocess_cmds = self.preprocess(self.record)
for cmd in preprocess_cmds:
self.exec_action(cmd)
# self.info is *{'action_executed': False}*
if not self.info[ACTION_EXEC]:
raise RuntimeError(f"Preprocess command failed to execute successfully: {self.preprocess(self.record)}")
So how can I solver the problem, or is there anything I missed? Appreciate for your response. :-)
You have LangChain listed in the requirements (ie environment.yml) but I cannot find any use of it in the code. Do you use it anywhere? Also, is there anyway to integrate InterCode with LangChain? If so, please describe how you would do this.
I want to access the json results of trajectories of the bash dataset.
However, the .data/results
directory, which should have contains experiment results, is missed. And googledrive link does not contain files of the bash dataset. And ./data/docker/bash_scripts/setup_nl2b_fs_*.sh
, which should have contained file system definition, is missed too(:
Would you please provide me with json results of trajectories of the bash dataset? Or the setup_nl2b_fs_*.sh, which would help me reproduce the results?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.