Code Monkey home page Code Monkey logo

Comments (11)

dan9thsense avatar dan9thsense commented on June 3, 2024

I modified testDocker.py to print the number of steps executed prior to receiving a value of True for the done parameter (see code below). I found that after an episode times out, the next episode is done after a single step. (see results below the code). It seems that it generates two failed episodes when one times out. But even with this bug, I typically get successes in about 24 of 30 runs on 1-Food.yaml, but when it is submitted, the score is under 15.

num_completed_runs = 0    
for k in range(30):
        cumulated_reward = 0
        num_steps = 0
        print('Episode {} starting'.format(k))
        try:
            for i in range(arena_config_in.arenas[0].t):
                num_steps += 1
                action = submitted_agent.step(obs, reward, done, info)
                obs, reward, done, info = env.step(action)
                cumulated_reward += reward
                if done:
                    print("in testDocker, received done after {0:2d} steps".format(num_steps))
                    num_completed_runs += 1
                    break
        except Exception as e:
            print('Episode {} failed'.format(k))
            raise e
        print('Episode {0:2d} completed after {1:2d} steps, reward {2:.2f}'.format(k, num_steps, cumulated_reward))

Results:

Episode  5 completed after 250 steps, reward -1.00
Episode 6 starting
in testDocker, received done after  1 steps
Episode  6 completed after  1 steps, reward -0.00

Episode 16 completed after 250 steps, reward -1.00
Episode 17 starting
in testDocker, received done after  1 steps
Episode 17 completed after  1 steps, reward -0.00

from animalai-olympics.

mdcrosby avatar mdcrosby commented on June 3, 2024

C1, which I assume is like the sample trial 1-Food.yaml

1-Food.yaml is just a suggested config file. The actual tests go beyond this. They can include any type of food. Details about the test setup can be found at https://www.mdcrosby.com/blog/animalaieval.html (and more in the other blog posts),

I think that the score in any one category is the number of runs that received a reward out of a total of 30 runs. Is that correct?

Not exactly. There are 30 tests per category. The tests are hand crafted and may go beyond those given in the training set (but not beyond the constraints in the linked information).

You get a pass on a test if you beat the threshold value for that test. Not just if you get a reward. This is set individually for each test and can even be negative. It is set to the position where we would reasonably assume that the animal/agent has demonstrated the ability to pass the test. So, for example, with just one food in the environment in most cases the threshold would be set so that a pass is getting the food (even at the last possible step).

In that case, my agent just runs around outside the area, which seems like correct behavior. How is that scored?

If this was a test it would have a negative threshold value and be scored as a pass.

Also, when there are gold rewards, ones that do not reset the sim until they are all captured, along with green rewards (e.g., 2-Preferences example), what is the success criteria?

See above and linked information.

It is not possible to get all the rewards because the sim resets after either collecting the green or after collecting all the gold ones.

Techincally the environment only resets upon collecting all the gold ones if there is no other (positive reward) food in the arena. This is a feature of the environment and taken into acount by the threshold values for the tests.

from animalai-olympics.

beyretb avatar beyretb commented on June 3, 2024

@dan9thsense regarding your second message

I modified testDocker.py to print the number of steps executed prior to receiving a value of True for the done parameter (see code below). I found that after an episode times out, the next episode is done after a single step. (see results below the code)...

There was actually a bug in the testDocker.py file where the agent should take a no-op step ([0,0]) after each environment reset in order to collect the observations from the new arena. We've updated the file in the submission folder. This bug is not present in the actual evaluation script fyi.

from animalai-olympics.

dan9thsense avatar dan9thsense commented on June 3, 2024

This all makes sense and is very helpful. For C1 I still have the question of why my agent succeeds so well on the 1-Food.yaml example but fails so much in the submission.

So, for example, with just one food in the environment in most cases the threshold would be set so that a pass is getting the food (even at the last possible step).

As far as I can tell, I am giving it the hardest cases (t = 250, random size, random location) and succeeding at finding the food nearly every time here, but less than half the time in a submission.

Is there some other possible complexity to C1?

from animalai-olympics.

mdcrosby avatar mdcrosby commented on June 3, 2024

Is there some other possible complexity to C1?

Yes. Even solving the hardest cases with just one food in the environment doesn't cover all cases where there is food (and no other types of object) in the environment.

from animalai-olympics.

dan9thsense avatar dan9thsense commented on June 3, 2024

Yes. Even solving the hardest cases with just one food in the environment doesn't cover all cases where there is food (and no other types of object) in the environment.

That is a very cryptic answer. I think that, during the testing phase and with the simplest category, it would be good to make it crystal clear what we are trying to get the agent to do. Instead this has the feel of "I'm hiding some clever aspect of the test that complies with a strict reading of its description but is something you wouldn't think of right away and if you can figure out what the actual test is, then you get a big advantage."

Why not just make it clear what the test is? If it is well-designed, then it will require a good AI to solve it. With the current situation, if a clever human figures out these hidden factors, then even with a poor AI, their agent may do well.

from animalai-olympics.

giadefa avatar giadefa commented on June 3, 2024

from animalai-olympics.

mdcrosby avatar mdcrosby commented on June 3, 2024

Hey,

Hope you don't mind a slightly longer reply to your questions as this is something important we are trying to achieve with the competition.

Why not just make it clear what the test is?

Please see the information in this post and this post about why hidden tests are a key part of the competition. As a quick summary, in animal cognition it is always important to minimise familiarisation with the actual tests so as to rule out solutions that can be ascribed purely to repeating previously successful behaviour. The point is to compare to animal cognition tests so we want to follow their paradigms as much as possible.

This does make this competition non-standard compared to other ML paradigms, but that's part of the goals behind the competition. We want to move the research agenda from solving tasks by whatever means possible (usually by following some greedy search in research-space) to solving tasks by creating agent's with capabilities similar to those shown in animal intelligence.

If it is well-designed, then it will require a good AI to solve it.

I don't think it's possible to have a single test so well designed that it can only be solved by a "good AI". For example, it is possible to solve all tests in our environment by writing down a list of actions to perform (as is true in any deterministic fully-observable problem-space). This approach could theoretically get 100%, but would not count as "good AI".

If you're interested, I think this paper, and the surrounding literature, has some really interesting discussion about this issue arising in animal cognition tests. Obviously, the referenced experiment is much more complicated than the ones we are using in our competition, but it is a good example of it not being trivial to design tasks that measure what we hope they'd measure, especially when they involve maximising some reward (in this case - like our competition - retrieving food).

With the current situation, if a clever human figures out these hidden factors, then even with a poor AI, their agent may do well.

This is true to some extent, and something we've tried to minimise the impact of by having a large number of tests. Ultimately, the best way to do well in the competition will be to submit an agent capable of robust food retrieval behaviour that acts in accordance with the properties of the environment. "Figuring out" some of the tests would be of little benefit overall with a poor AI that can only solve exactly those tests. Research time that goes into creating a more robust agent that understands more properties of the environment will be much more beneficial and will automatically lead to it solving the tests anyway.

That is a very cryptic answer.

I don't think the answer was particularly cryptic. Though, obviously, I wasn't just spelling out exactly what is in the tests (for the reasons given above) so it's cryptic if that was what you were expecting. If your agent can maximise reward on all possible configurations of food in the arena then it will get 100% on category 1 very easily. The most complex task in C1 is much simpler than the most complex possible task using only food items.

In fact, I don't think it's possible to create an agent that maximises theoretical reward on the set of all possible tasks using only food items. This is because the set includes many cases where it is ambiguous what the best solution is. For example, what should the agent do if it sees a small green food ahead of it? Should it first check behind it for bigger food? If it does so it will lose time (and therefore reward) in the case there is food ahead. Obviously, it would be bad design to include any tests like this so we have been careful to avoid this kind of situation (and also set the pass thresholds low wherever possible).

I'm hiding some clever aspect of the test that complies with a strict reading of its description but is something you wouldn't think of right away

There's not meant to be any clever aspect hidden in any of the tests (apart from anything clever - in terms of design - that is in there because it is included by animal cognition researchers in their tests). There is no intention to include any 'tricks' and the tests have been carefully designed so that they are solvable without any extra knowledge. In all cases tests were made simpler whenever possible to avoid the possibility of "unfair" inclusions. We want as many tests as possible to be solved during the competition.

Finally, the competition is our only chance to see how well people can do without knowledge of the tests themselves. As soon as it is over we will be publishing all the information, and while we hope many will still take up the challenge of testing their agents without training directly on the tests, it will be impossible to avoid some knowledge of the tests creeping in to the research decisions. So that is why we're being (perhaps too) careful with the information about the tests at this stage.

Hope that all makes sense and gives you some further insight into the design decisions for the competition.

from animalai-olympics.

mdcrosby avatar mdcrosby commented on June 3, 2024

@giadefa Please see the above answer that covers most of this. But to give more direct responses,

Is the goal to guess the right test scenario?

No. This should only provide minimal benefit and would be almost impossible to do for all tests. The goal is to find a way to train/create an agent that is capable of retrieving food in its environment in as wide a range of situations as possible. The category information is there to inform the kinds of abilities we are testing for to make this a bit easier as we're not just performing random tests, but those designed to probe for specific cognitive abilities.

I was expecting to receive an environment that is just impossible to solve if taken as it is.

I'm not sure exactly what you mean by this. Could you clarify if I haven't answered your question here or above?

The only way to progress would be to test frequently and therefore over-fit the test.

If you, for example, find a way to create an agent that understands object permanence then it should be able to pass many (or even all if it also has some of the previous tested for capabilities) of the tests from Category 8. For example, designing a relevant DRL architecture, or perhaps a useful curriculum of environment configurations to train on, or using a learning method that is well suited to encoding this kind of capability should be more beneficial than trying to overfit with the small information possible from submitting once a day (and not employing any extra techniques). Indeed, the competition is designed so that real progress won't be possible using this overfitting strategy alone.

from animalai-olympics.

dan9thsense avatar dan9thsense commented on June 3, 2024

Thanks for the detailed answer, now I understand the goals of the competition much better.

from animalai-olympics.

giadefa avatar giadefa commented on June 3, 2024

Thanks for the clarification. I have added some further questions in another issue, as this one was closed.

from animalai-olympics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.