ruixiangcui / agieval Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I've noticed that the current code uses the expression demo + question
. However, I believe the correct expression should be demo + question_input
. By using demo + question
, the previously defined question_input
is not being utilized and some multiple-choice questions may lack options in the prompt. Please consider updating the code to reflect this change for proper functionality. Thank you!
https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L215
in sat-math corpus, it happens to have incomplete question, which may make it insufficient to solve.
{"passage": "", "question": "Which of the following is equivalent to the expression above?" ...
The few-shot prompts in gaokao-geography dataset looks like this:
{'passage': None, 'question': '在某城市中心,一种创新型绿色建筑一垂直森林高层住宅落成面世。它是在建筑的垂直方向上,覆盖满本地乔木、灌木和草本等植物,为每层住户营造“空中花园”,形成具有森林效应的生态居住群落。与传统设计相比,“垂直森林”在居住空间设计上变化最大的地方是( )', 'options': ['A. 阳台\tB. 客厅\tC. 卧室\tD. 厨房'], 'label': 'A', 'answer': None, 'other': {'source': '2022年湖北省高考地理试题'}}
It should be
'options': ['(A)阳台', '(B)客厅', '(C)卧室', '(D)厨房']
There is only average score of qwen1.5-14b in README. Could you please provide me detailed eval results?
The options are wrong in this data
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L108
{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是( )", "options": ["錀 第七周期", "镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}
It should be
{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是( )", "options": ["(A)錀 第七周期", "(B)镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75
{"passage": null, "question": "水溶液呈酸性的是( $)$", "options": ["(A)$\\mathrm{NaCl}$", "(B)$\\mathrm{NaHSO}_{4}$", "(C)HCOONa", "(D)$\mathrm{NaHCO}_{3}"], "label": "B", "answer": null, "other": {"source": "2020年浙江省高考化学【7月】"}}
Option D is missing a backslash \
If you inspect aqua-rat.jsonl (and other datasets), there are unicode escape sequences throughout the data.
{"passage": null, "question": "A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45\u00b0 to 60\u00b0. After how much more time will this car reach the base of the tower?", "options": ["(A)5(\u221a3 + 1)", "(B)6(\u221a3 + \u221a2)", "(C)7(\u221a3 \u2013 1)", "(D)8(\u221a3 \u2013 2)", "(E)None of these"],
This can be prevented by going back to the original script you used to write out the data and adding ensure_ascii=False and encode('utf-8') before writing to your file, like so:
f.write(json.dumps(row, ensure_ascii=False)+ '\n').encode('utf8'))
There are several problems in logiqa-zh, e.g.
[ "A 没有党参", "B 没有首乌", "C 有白术", "D 没有白术" ]
and it should be
[ "(A)没有党参", "(B)没有首乌", "(C)有白术", "(D)没有白术" ]
https://github.com/ruixiangcui/AGIEval/blob/624021ed76ddb82046b97803ae95d0cb90c0738d/src/dataset_loader.py#L57C1-L57C44
prefix = "该问题为单选题,所有选项中必有一个正确答案,且只有一个正确答案。\n"
this prefix is not used,
and I found jec-qa, gk-physics only outputs one choice under the chatglm2 model.
run multi choice task,
What kind of prompt can output multiple choices?
The correct answer shall be ["A", "D"] rather than ["B", "C"], as stated in https://edu.sina.cn/gaokao/qsbk/2022-06-01/detail-imizirau6064163.d.html
How to get the custum_api_name?Why i have some error?
multi-thread n = 3
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
The fifth problem in gaokao-biology dataset only has 3 options, and the explanation gives 4.
model output is :
"model_output": "选项 (B)
as gaokao-physics has multi-answer, it will take all uppercase letters, which makes the correct answer become an error.
def parse_qa_multiple_answer(string, setting_name):
if setting_name == "few-shot-CoT":
string = extract_last_line(string)
pattern = "\(*([A-Z])\)*"
match = re.findall(pattern, string)
if match:
return match
return []
maybe we can make a candidate answer list like ["A", "B", "C", "D", "E", "F"] to reduce the prob of error?
def parse_qa_multiple_answer(string, setting_name):
if setting_name == "few-shot-CoT":
string = extract_last_line(string)
pattern = "\(*([A-F])\)*"
match = re.findall(pattern, string)
if match:
return match
return []
where are Gaokao and SAT datasets from?
I am interested in the human evaluation result, but there are only 4 pictures. So I want to konw whther the result(detailed or overall numeric results) will be public?
Thanks for your awesome work! I notice that Gaokao is an important part in your dataset, but most Gaokao papers are not freely available online. Could you please explain how to collect the Gaokao dataset? Thanks in advance :)
I already download JEC-QA data, so how can I generate the post-processed data from it? Could you provide the official processing scripts?
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
There are about 7 multiple choice questions in gaokao-mathqa dataset, e.g.
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-mathqa.jsonl#L149
{"passage": null, "question": "函数 $f(x)=\\sin (2 x+\\varphi)(0<\\varphi<\\pi)$ 的图象以 $\\left(\\frac{2 \\pi}{3}, 0\\right)$ 中心对称, 则 ($\\quad$)\\\\\n", "options": ["(A)$y=f(x)$ 在 $\\left(0, \\frac{5 \\pi}{12}\\right)$ 单调递减", "(B)$y=f(x)$ 在 $\\left( -\\frac{\\pi}{12}, \\frac{11 \\pi}{12}\\right)$ 有 $2$ 个极值点", "(C)直线 $x= \\frac{7 \\pi}{6} $ 是一条对称轴", "(D)直线 $y= \\frac{\\sqrt{3}}{2} - x $ 是一条切线"], "label": "AD", "answer": null, "other": {"source": "2022年全国新高考II卷数学"}}
which doesn't match the format in gaokao-physics, i.e. ["A", "D"]
.
The gaokao-english
has a dirty data.
The question is
The engineer Camillo Oliver was 40 years old when he started the company in 1908. At his factory in Ivrea, he designed and produced the first Italian typewriter. Today the company's head office s still in Ivrea, near Turin, but the company is much larger than it was in those days and there are offices all around the world.By 1930 there was a staff of 700 and the company turned out 13,000 machines a year. Some went to customers in Italy, but Olivetti exported more typewriters to other countries.Camillo's son, Adriano, started working for the company in 1924 and later he became the boss. He introduced a standard speed for the production line and he employed technology and design specialists. The company developed new and better typewriters and then calculators(计算机). In 1959 it produced the ELEA computer system. This was the first mainframe(主机)computer designed and made in Italy.After Adriano died in 1960, the company had a period of financial problems. Other companies, especially the Japanese, made faster progress in electronic technology than the Italian company. In 1978, Carlo de Benedetti became the new boss. Olivetti increased its marking and service networks and made agreements with other companies to design and produce more advanced office equipment. Soon it became one of the world's leading companies in information technology and communications. There are now five independent companies in the Olivetti group—one for personal computers, one for Systems and services, and two for telecommunications.
The option is:
like:
['(A)It produced the best typewriter in the world. ', '(B)It designed the world’s firs![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAXwAAAAkCAMAAAC9k3HWAAADAFBMVEUAAACAAAAAgACAgAAAAICAAIAAgICAgIDAwMD/AAAA/wD//wAAAP//AP8A//////8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADMAAGYAAJkAAMwAAP8AMwAAMzMAM2YAM5kAM8wAM/8AZgAAZjMAZmYAZpkAZswAZv8AmQAAmTMAmWYAmZkAmcwAmf8AzAAAzDMAzGYAzJkAzMwAzP8A/wAA/zMA/2YA/5kA/8wA//8zAAAzADMzAGYzAJkzAMwzAP8zMwAzMzMzM2YzM5kzM8wzM/8zZgAzZjMzZmYzZpkzZswzZv8zmQAzmTMzmWYzmZkzmcwzmf8zzAAzzDMzzGYzzJkzzMwzzP8z/wAz/zMz/2Yz/5kz/8wz//9mAABmADNmAGZmAJlmAMxmAP9mMwBmMzNmM2ZmM5lmM8xmM/9mZgBmZjNmZmZmZplmZsxmZv9mmQBmmTNmmWZmmZlmmcxmmf9mzABmzDNmzGZmzJlmzMxmzP9m/wBm/zNm/2Zm/5lm/8xm//+ZAACZADOZAGaZAJmZAMyZAP+ZMwCZMzOZM2aZM5mZM8yZM/+ZZgCZZjOZZmaZZpmZZsyZZv+ZmQCZmTOZmWaZmZmZmcyZmf+ZzACZzDOZzGaZzJmZzMyZzP+Z/wCZ/zOZ/2aZ/5mZ/8yZ///MAADMADPMAGbMAJnMAMzMAP/MMwDMMzPMM2bMM5nMM8zMM//MZgDMZjPMZmbMZpnMZszMZv/MmQDMmTPMmWbMmZnMmczMmf/MzADMzDPMzGbMzJnMzMzMzP/M/wDM/zPM/2bM/5nM/8zM////AAD/ADP/AGb/AJn/AMz/AP//MwD/MzP/M2b/M5n/M8z/M///ZgD/ZjP/Zmb/Zpn/Zsz/Zv//mQD/mTP/mWb/mZn/mcz/mf//zAD/zDP/zGb/zJn/zMz/zP///wD//zP//2b//5n//8z///9EYrBQAAAAEXRSTlP/////////////////////ACWtmWIAAAABYktHRACIBR1IAAAADGNtUFBKQ21wMDcxMgAAAANIAHO8AAAMLElEQVRoQ+1bzW7bZhbNA5ELewLYzWKeIVpMCiSb5h2iTQI0U0Te+A28KRcJ0RSDqn0MEagthO7C8xiVAJEBrGDOOff7pagmaROnwJhCTJP8+P3c79xzz71y7ty7e/v5Uha4c/d6g0+rn7fnG7XDvTv3zOq3n5u3wN07dzeb6zf4196eb9oOt8j/gj4P5Avz5Lrb8w3b4X3I3/oINMqI/f35zTPl5xhxcbK/1/oPnu2+1bWDe6v5Xt+KnE/e57/zBy3Oi+InXXfTA/Tmn+m8Kg/99eW0PAnvYR+ydpvNauddN8ag3fC9m79exnVgbtuqmMc5NHG94/NObNRNy2Jgr7os5nvWO4L8ZYm92lalML8uvzpJd247LXF4vL+bRlT0Ez4ZHIdfkFE/0EO2FVazPI4evuIajuLbTbHX+4GlzTta5KgjvhutvrhKR+7O8uv0GTl/oO9h39l1C3xjzO3UzYK/V+z5u6OkfTd9GPKDbjob5gnL4iro5qY8/RvlEevycchrqnK2WR/AYt2TeY8lct3XvzoL4vc1rZfo/ypcb6vDV6Wt64Jmn43kSatS+NT7YOksjxjj/HUJHGDHgIhFggibbWO7sRXm7ad9+skO/1/Ae91TgAJ7U+eO8Q3fSkbozg7Qvpsm/rKd2vOas+GIFf0y/xjg4ug2SoLdi9JhBLiytxel2YSf+titExg2ZOWfC3JA5teYt32a0EufjJ++LwZzrd8Cu5mNxnR+B7hjlUDA5ZP5UP/D+OL2foLVXQL5Pj84n8x4f1th//n8281mSf7T8yWNsdk+y42P2AEfxziWZ4AwYRv4XMlYoffIoYwlOM80Lgxr/YdxhTkZ2+7HDT6K/cIC7A82P7J+4cW+nwbtlgdXHP/hWL4D/1UsfORi4uRhGL/G3G393cTm5+ZleNDxSEStGWLcl6k9h8jfvgS6vvf705BJMG6ylw1wYrsOq+ScP2ekN4/ZXK/AfcKMecUQU0SSMI1peoyadxBPoTV+VW/dmZ2xiQNfxFbBsnhu8yI3Wn/YBIdjcomtCN3Rt6yd52L6sjh/W43qGvpvU87fPbU33yYYb4KPjvi95l3vZ/zN9VDnXz61vZTu77/VeYFVhDwAOLHnq0Nhhe10Dc7vvp+3/eTkHCttl+Vp6zi/hSHIqUkeQduf2rVhmdxKa2kctwvthkjVe+ILXTsM+/mcT+A56IfMwnawveuX/eEdzb8U77TEu38fPQrn2BH4jDi/rdBHWI9rb6A9bV+BBWy+AJfPB4h8YPP+/Jp+P5InrbW2fXnUEPngpXgcAGh2+Lgqzt9OPY8NOX/75Ao4oeeItTzngzTyeADbB0TACvIV3HOoJI7FzXjP4bzDkPQmux8/aKn1+w/srPjnsB98wPyCvBZYdw1RqHY1Wi2Of5sW/w3I72NEoO5UO88HlZ+leZKOYt5P/Lh9MkNEND8bRPNs5rw/0Pld5fR9otn7yUw6NexEcXX53J6T84POJ+evDp8BT5hnN33hOB/Pl8RbonVl73DNnWiJ7XgPeMY8LAaoHdq8AK+Ga58vYPUYJ/SNvOMojgN/o+ZmN/A83JeI9O3R1tZFCz7unr7pXlPOBXPGdRnnvwEHK9ZBJ/lnRD7j1g/45/qFlcJapQ/joffTfwPkLwP6bMc6WjLVrf3kn3pyqZ8Dzg8c10/EdY7zMYdMRXBOqSeAkw/JPsk9mOnYsw9H6oH8XnuR4p7eEpQHn4hpQgtcEbN8j7vvdHjg9cbNyuk3cX730mJDorfeeg9Zyo8u0hGsXQ9vYlzR8c1196OfgR/BvM7iZ/rJdb5EhZ6bR9Lbf5dNvG+T/+McdnV+bTHTdLTj/N8n4Ew3quPg7BqewUNri+Polr/uJ8XPpoWSfqhsPHsZM8RdtnGUZ8Au/za2qcpj30IxhvxPzvfjVg8Nachu4jhUT9aPxQVjHcdY4rwV5nWhdUcL835dfh37cbyUzX9X5wMv6ZErU7KxrVej7ej87uxrYWwlD3KcjziSIhaoTjS4Ic1jM+JCt4LPiTp2FD4wnekTXKceJetypsUvE84WfjJrolqCHlNr6nx96MeNzT6ilCuGD2LG3RmeQU8k2K21E2/Rz9rFrTj+tjpu6+jxe5Cf6tNc51JN5jp/WXx3cDllXBD37ej81ZHut9Uh0gCv84FAr9sVJ8jneb/g+PJ1Pg9GAsUF6WjlOMalyfcO4tfkGtuaXhvnw3KHGPMxtVLR2lzUj+IInzMfQDxEjHmxWaEdVHSc77L4D3CPLBiaDXWsRTrP7TPpfPL9QhrIax7mO9PHyGtMq1msGOQnuP8HVc2uzvjUYaM5xqyJlQYeJKw4fFLpsk5Ctn9bXDXlY8/5CdoM5UP2E65z/S5rR12hULmTQdeIFSmHAuNJfOpqaRzmreCjA2hujAEO8l4ITAvFXrK0rNH2/6I/B43COED/lS/0xVV3lkWv+h/yyGLemQ9Eta+aRJ7Rj3P+qA6F+iAWyVG/prq2ORBWTA+D84E+4YR8t1mdULe/7qpT7FAxd5wPBEb9DDM6fR30L3BZPkC+n8yD9Tukw0F3WzxztZPQDgSe6XKpqqCrqeSZj/A9MhD28yGwGfIA93xTMTemhRlL22qGWGGxRflDMV9zfa9gS/g5Y2vU7SgFsB3eI85lD9W34Dsur4jzA+fv2Hk/8iHUDVV1UBQr5npRVwsywTu466/IigfXK3I6ihqO81PkA4eJTjbcMs1ixpnyJfU6BZBDNvj5NwrNFOeMrzH/sDiUIh8Mw/ZSQNiBU11jfP+OQ77ltfBVq6BdHF6m6K6xF6GqiUpG9AnG1zPfF2Nuhny3sugn/bjaSbQnuXfssLpJN/0Be0oeJVbJHgOdvzStC82M8+IQGa74dWF6W89Y3c7r/uBT+gJwKC1u9SH4AcfzdR98h1D8RCZyvOnaAflZX4yoQ82PehJ0Orp6VKkuj59WS+I9+qCtYYU4VDNHyHOQFes+Tue7PDvR6oHH3zQ2d1ff8nPYwqn877FttPd+5CeRetNJIanSU4ObG1+zyXX+z69tvxdeTzmd7/Q2nwGBw/oJa27SDFlFR3oo6iJWilRZzLCPUOo8w+qh2NoYORSNVRWiUlG5jV4avcPVi95NlV+Dzy3GrdPslsoTEHKjwNWz2a9i7crmNazwZFXc9+n8qLNb7VpyjVmJz8D58f5oPb8PdX3H+dR0Pm/w1YCgd2Ej6WcXiV0e4LjN6j5cldicYTfW4aUdze993SHuFmpLVs9hjMGZ9R3xNUOArQtvk9u7KTFD3obguW5dtcPmt+C8Qz1/Uc7S7wGQx1gsAyLdfHOdDw/+WJ3vdhlO6H9zZxsHaife/7B6vthANmSRMu8VXjGs6JBZ3Ri+7qMaqfZAbEdeEkrRH87K4m0E2lgjhJomehOyvZ4C2Tj0Oh9AjBLy5yurbBZz/92By+N9bVaKR4rffUIV1DH+CPL/pM4PfDX4ex5xvtfbI/X8VHc7vYz1mt4dftN4RP0edLer7Vtdx+ljV9sn51NHe70Pi6qWqboMj1BXlybVQT5XXUn5AdWs6lAhtlD/cx21ajWr8jVrN6htUe/Tv8K6oaXZjlpIMcPVqaBtntg8VcdS+0Tnq8b6p3X+DvJtvzPk79R2AipMGzoE+yoAs8X0OGL1P1HmK7L/u1AzNb6xemRoJ573yHffMGX6yeqyAZ+NU0A4+5q+1WR9Xai2q+KXBxxb7/mqqlux5cneG+FTbry1U3pNpvhSz35/hrtbbzaE6jufTEcrHwTPhfuDen7enl8W2fuOm/fWtcfG4Xuf4j42VP3gmzf/vYPyXtaFeEZWJcxe1cxNVHOy79TCOum/28r5AmOAedpm+0T9bNYxb7G6RJh3Z5zP9eP4cJ2/y862o6rTR9ZLv8PdyVtfhZb7ehtEldDvDdz3uuvC1eH751DRcdyLuMq6aJVbRa++L29d2L11kmsMv8kNyIcvxe9yQ0+7f7djOtb4bFB/5vWP6T37G5/Rdtl95gVp3f1D3vm8bRCZTe+fu3HO30BPj47ZzJA07q6xPde9xsUWvXue/83O9mV4r3s+0vf7/mLtE2GQNfVdnXuTSM/HYhE78eBPtMqPXc/I3+1oJkGHu3n95evqb/Z3O/HvkT7Pej/EbjeE/I/FxP9HeyD/S/2fmNtx7/0PigAMta/NGbAAAAAASUVORK5CYII=)t mainframe computer.', '(C)It exported more typewriters than other companies.', '(D)It has five independent companies with its head office in Ivrea.']
The option B
has some dirty string.
Hi, when I parse the dataset's options
, I found unnormal behavior which the length of options
is different from others in the same subcatrgory.
In gaokao-chemistry.jsonl
, line 190's options
include invalid options (which is actually the question's analysis). The length of options
is actually 7 not 4.
After option "D", there is a fifth option.
Missing options.
In sat-en-without-passage.jsonl
, line 17's options
miss option D which should be "They may increase in value as those same resources become rare on Earth." reference
In sat-en-without-passage.jsonl
, line 57's options
miss option D which should be "No, because the data do not indicate whether the honeybees had been infected with mites." while the label is "D". reference
In sat-en-without-passage.jsonl
, line 98's options
miss option D which should be "Published theories of scientists who developed earlier models of the Venus flytrap". You can refer to question 11 in reference.
The same goes for sat-en.jsonl
in line 17, 57 and 98.
jec-qa-kd.jsonl
, line 212's label
is empty. The content is also dirty.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.