Code Monkey home page Code Monkey logo

agieval's People

Contributors

eureka6174 avatar microsoft-github-operations[bot] avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar ruixiangcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agieval's Issues

Bug in Dataset Loader for Few-Shot Multiple Choice Questions

I've noticed that the current code uses the expression demo + question. However, I believe the correct expression should be demo + question_input. By using demo + question, the previously defined question_input is not being utilized and some multiple-choice questions may lack options in the prompt. Please consider updating the code to reflect this change for proper functionality. Thank you!

https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L215

SAT-Math corpus includes incomplete data

in sat-math corpus, it happens to have incomplete question, which may make it insufficient to solve.

{"passage": "", "question": "Which of the following is equivalent to the expression above?" ...

the few-shot-prompt format is different in gaokao-geography dataset

The few-shot prompts in gaokao-geography dataset looks like this:

{'passage': None, 'question': '在某城市中心,一种创新型绿色建筑一垂直森林高层住宅落成面世。它是在建筑的垂直方向上,覆盖满本地乔木、灌木和草本等植物,为每层住户营造“空中花园”,形成具有森林效应的生态居住群落。与传统设计相比,“垂直森林”在居住空间设计上变化最大的地方是( )', 'options': ['A. 阳台\tB. 客厅\tC. 卧室\tD. 厨房'], 'label': 'A', 'answer': None, 'other': {'source': '2022年湖北省高考地理试题'}}

It should be

'options': ['(A)阳台', '(B)客厅', '(C)卧室', '(D)厨房']

Error in gaokao-chemistry dataset

The options are wrong in this data
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L108

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["錀   第七周期", "镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

It should be

{"passage": null, "question": "2007年3月21日,我国公布了111号元素Rg的中文名称.该元素名称及所在周期是(  )", "options": ["(A)錀   第七周期", "(B)镭 第七周期", "(C)铼 第六周期", "(D)氡 第六周期"], "label": "A", "answer": null, "other": {"source": "2007年天津高考化学试题"}}

There is a format error in the data, and an error may be reported when parsing json. In addition, it is strongly recommended to clean the data to provide users with higher quality evaluation data.

https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-chemistry.jsonl#L75
{"passage": null, "question": "水溶液呈酸性的是( $)$", "options": ["(A)$\\mathrm{NaCl}$", "(B)$\\mathrm{NaHSO}_{4}$", "(C)HCOONa", "(D)$\mathrm{NaHCO}_{3}"], "label": "B", "answer": null, "other": {"source": "2020年浙江省高考化学【7月】"}}
Option D is missing a backslash \

Unicode escape sequences in the json data

If you inspect aqua-rat.jsonl (and other datasets), there are unicode escape sequences throughout the data.

{"passage": null, "question": "A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The top of the tower is observed from the car and, in the process, it takes 10 minutes for the angle of elevation to change from 45\u00b0 to 60\u00b0. After how much more time will this car reach the base of the tower?", "options": ["(A)5(\u221a3 + 1)", "(B)6(\u221a3 + \u221a2)", "(C)7(\u221a3 \u2013 1)", "(D)8(\u221a3 \u2013 2)", "(E)None of these"], 

This can be prevented by going back to the original script you used to write out the data and adding ensure_ascii=False and encode('utf-8') before writing to your file, like so:

f.write(json.dumps(row, ensure_ascii=False)+ '\n').encode('utf8'))

Several problems in logiqa-zh

There are several problems in logiqa-zh, e.g.

[ "A 没有党参", "B 没有首乌", "C 有白术", "D 没有白术" ]

and it should be

[ "(A)没有党参", "(B)没有首乌", "(C)有白术", "(D)没有白术" ]

About API_dic

How to get the custum_api_name?Why i have some error?

multi-thread n = 3
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1
found error: Error communicating with OpenAI: HTTPSConnectionPool(host='test.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/davinci-003/chat/completions?api-version=2023-03-15-preview (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))
multi-thread n = 1

parse_result error in gaokao-physics-zero-shot

image
model output is :
"model_output": "选项 (B) $3.3\mathrm{MeV}$。", "parse_result": ["B", "M", "V"], "label": "B", "is_correct": false;
as gaokao-physics has multi-answer, it will take all uppercase letters, which makes the correct answer become an error.

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-Z])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

maybe we can make a candidate answer list like ["A", "B", "C", "D", "E", "F"] to reduce the prob of error?

def parse_qa_multiple_answer(string, setting_name):
    if setting_name == "few-shot-CoT":
        string = extract_last_line(string)
    pattern = "\(*([A-F])\)*"
    match = re.findall(pattern, string)
    if match:
        return match
    return []

Will human evaluation results be public?

I am interested in the human evaluation result, but there are only 4 pictures. So I want to konw whther the result(detailed or overall numeric results) will be public?

Details about the data collection

Thanks for your awesome work! I notice that Gaokao is an important part in your dataset, but most Gaokao papers are not freely available online. Could you please explain how to collect the Gaokao dataset? Thanks in advance :)

Multiple choice in gaokao-mathqa dataset

There are about 7 multiple choice questions in gaokao-mathqa dataset, e.g.
https://github.com/ruixiangcui/AGIEval/blob/main/data/v1/gaokao-mathqa.jsonl#L149

{"passage": null, "question": "函数 $f(x)=\\sin (2 x+\\varphi)(0<\\varphi<\\pi)$ 的图象以 $\\left(\\frac{2 \\pi}{3}, 0\\right)$ 中心对称, 则 ($\\quad$)\\\\\n", "options": ["(A)$y=f(x)$ 在 $\\left(0, \\frac{5 \\pi}{12}\\right)$ 单调递减", "(B)$y=f(x)$ 在 $\\left( -\\frac{\\pi}{12}, \\frac{11 \\pi}{12}\\right)$ 有 $2$ 个极值点", "(C)直线 $x= \\frac{7 \\pi}{6} $ 是一条对称轴", "(D)直线 $y= \\frac{\\sqrt{3}}{2} - x $ 是一条切线"], "label": "AD", "answer": null, "other": {"source": "2022年全国新高考II卷数学"}}

which doesn't match the format in gaokao-physics, i.e. ["A", "D"] .

gaokao-english dirty data

The gaokao-english has a dirty data.

The question is

The engineer Camillo Oliver was 40 years old when he started the company in 1908. At his factory in Ivrea, he designed and produced the first Italian typewriter. Today the company's head office s still in Ivrea, near Turin, but the company is much larger than it was in those days and there are offices all around the world.By 1930 there was a staff of 700 and the company turned out 13,000 machines a year. Some went to customers in Italy, but Olivetti exported more typewriters to other countries.Camillo's son, Adriano, started working for the company in 1924 and later he became the boss. He introduced a standard speed for the production line and he employed technology and design specialists. The company developed new and better typewriters and then calculators(计算机). In 1959 it produced the ELEA computer system. This was the first mainframe(主机)computer designed and made in Italy.After Adriano died in 1960, the company had a period of financial problems. Other companies, especially the Japanese, made faster progress in electronic technology than the Italian company. In 1978, Carlo de Benedetti became the new boss. Olivetti increased its marking and service networks and made agreements with other companies to design and produce more advanced office equipment. Soon it became one of the world's leading companies in information technology and communications. There are now five independent companies in the Olivetti group—one for personal computers, one for Systems and services, and two for telecommunications.

The option is:

like:

['(A)It produced the best typewriter in the world.     ', '(B)It designed the world’s firs![]()t mainframe computer.', '(C)It exported more typewriters than other companies.', '(D)It has five independent companies with its head office in Ivrea.']

The option B has some dirty string.

Dirty data in the dataset.

Hi, when I parse the dataset's options, I found unnormal behavior which the length of options is different from others in the same subcatrgory.

  1. In gaokao-chemistry.jsonl, line 190's options include invalid options (which is actually the question's analysis). The length of options is actually 7 not 4.
    20230817-170517
    After option "D", there is a fifth option.
    20230817-170556

  2. Missing options.

  • In sat-en-without-passage.jsonl, line 17's options miss option D which should be "They may increase in value as those same resources become rare on Earth." reference
    20230817-171359

  • In sat-en-without-passage.jsonl, line 57's options miss option D which should be "No, because the data do not indicate whether the honeybees had been infected with mites." while the label is "D". reference
    img_v2_83f511ea-27ce-45ab-a43e-df788a0fbe0g

  • In sat-en-without-passage.jsonl, line 98's options miss option D which should be "Published theories of scientists who developed earlier models of the Venus flytrap". You can refer to question 11 in reference.
    img_v2_5ad1f5fc-cd5d-4a2d-a607-94296e2c4abg

The same goes for sat-en.jsonl in line 17, 57 and 98.

  1. In jec-qa-kd.jsonl, line 212's label is empty. The content is also dirty.
    img_v2_e9f4cde5-a876-465b-9968-f743fb24040g
    img_v2_46ad402b-05a6-4605-900f-c2b089fd082g

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.