wpi-mmr / gym_solo Goto Github PK
View Code? Open in Web Editor NEWA custom open ai gym environment for solo experimentation.
License: MIT License
A custom open ai gym environment for solo experimentation.
License: MIT License
Since we are making this open to the general public, we should think about creating a detailed README which talks about the steps to create new observations, rewards, and terminations. We should also have a brief explanation for the package so it is easy to understand why we have our code organized in this particular way.
Just piggy backing onto the end of this PR cause it'll be easier to merge in. Create an AdditiveReward
and MultiplicitiveReward
to make easier reward function creation.
Note that this means that RewardFactory
should be factored out soon.
Originally posted by @agupta231 in #51 (comment)
RewardsFactory
is basically just a fancy AdditiveReward
. The RewardsFactory
should probably be taken out in favor of AdditiveReward
, and maybe a new class like RewardWrapper
could be implemented (that AdditiveReward
and MultiplicitveReward
inherit from) to handle all of the Pybullet passthrough business.
I guess that we didn't know how to PyBullet worked when we started. When trying to train the models in a env and then creating a new gui-based environment for rendering, we found that we had run into a lot of issues.
The only way to circumvent them was to restart the kernel, which also happens to be the only way to terminate the pybullet physics server. I'm gonna conjecture that when we create a second environment, it causes conflicts because there already is a simulation running on the PyBullet server. I believe if we forced each environment to maintain its own physics client, then we should be a-ok.
It's either that or find a way to train without having to render and be able to start rendering mid-pybullet session.
gym_solo/gym_solo/envs/solo8v2vanilla.py
Lines 95 to 98 in 88d9d64
it's just been neglected for a while... now that the observation factory is done, just need to plop in some tests.
Seems that the motor encoder observation is trying to get information from PyBullet and it's not getting back real numbers. This actually needs to be be fixed relatively soon as this breaks the entire normalization behavior of the observation factory otherwise.
With that being said, stable-baselines
has an automatically-normalizing environment wrapper, so I'm going to be using that for the time being. However, that performs the normalization via rms standardization, which is just a hack when we should be using min-max normalization.
Similar to the observation factory. The reward factory should be able to take in Reward
objects, similar to an Observation
object, evaluate the rewards for the state, and combine them.
I'm thinking of making the final reward a linear combination of the registered rewards. With that in mind, consider the following example:
r1 = Reward()
r2 = Reward()
r3 = Reward()
rf = RewardFactory()
r1.register_reward(-1, r1)
r1.register_reward(.1, r2)
r1.register_reward(.9, r3)
Notice that register_reward()
has two args: weight: float
and reward: gym_solo.core.rewards.Reward
. Thus, the final reward would evaluate to: -r1() + 0.1 * r2() + 0.9 * r3()
. Figure that if you need functionality more elaborate than a linear combination, you should be offloading that processing into a Reward
class.
@mahajanrevant thoughts on this? lmk if you think we need anything stronger than linear combinations in the RewardFactory
.
It seems as if the Solo Env is still sending out a hard coded reward--this needs to be switched to the dynamic model asap:
gym_solo/gym_solo/envs/solo8v2vanilla.py
Line 77 in 88d9d64
^^ Here the 0.0 needs to be dynamically computed for the reward.
Motors were initially made to be torque controlled. Since the Arduino is going to implement a PID controller, we can abstract it out to position control.
We have a lot of "testing objects" such as core.test_obs_factory.CompliantObs
and envs.test_solo8v2vanilla.SimpleReward
that can be used across tests. We should move everything to a separate module and/or a util module as well.
The Solo8VanillaEnv
was never meant to be an all-in-one package. It's intentioned to be a two-part system, where there is a BaseEnv
that does a lot of the duplicated work, and the Solo8VanillaEnv
would just be a subclass of that. Additionally, any new models and/or envs could inherit from BaseEnv
and it should (in theory) just work like that.
Unfortunately, when there was only one model, it was easier to debug by putting everything together. This should get resolved before we start heavy design iterations however.
This should be an implementation of #19.
Figure that this would be our "basic" stopping criteria--after n steps, terminate the episode.
Basically just an extension of #18. In the Observation
class, there is a PyBullet client that can set at the Observation
instantiation time:
Lines 70 to 82 in 0ee3f1b
This basically just needs to extend to Rewards
for the same reasons.
Note that we might need to extend this to Terminations
in the future. However, all of our Termination
s are just time based and this is a relatively painless change to implement, so I think it might be wise to leave it off until we know we're going to actually need it.
Currently the environment is reset by removing all of the bodies in the simulation (via removeBody()
) and reloading all of the bodies again.
Running it over and over again causes the bodies to gradually disappear--leading me to believe that there is a VRAM memory leak Pybullet's implementation.
Instead, I basically copy and pasted all of the env initiation code and plopped it right after a env.resetSimulation()
. This logic should be encompassed in the Solo8BaseEnv
, so it should be moved there.
Lines 205 to 206 in e4bcb11
it doesn't really make sense that they would report one value in radians and one in degrees??
Implementation of #19.
From the most basic RL testing, it seems as if the model pretty quickly just ends up on the ground twitching. Obviously, once it's in this state, it can be very difficult to get out of. Figure that it could be interesting to track the derivative of position over time and terminate the episode when it approaches 0.
Currently the file is named test_rewards_factory.py
. Since it contains more tests than just the factory, it should probably be renamed to something like test_rewards.py
Need to implement a gym_solo.core.obs.Observation
that will return the motor encoder values for the Solo8v2 Vanilla model (already implemented in the environment).
At the very least, these should return the position of the motor encoders (labelled for each joint name, of course). According to the documentation, it seems that we can get the position, velocity, and applied angular forces on all of joints.
I'm not really sure if we need the velocity / applied angular forces in the observation-- @andrew103 or @mahajanrevant could one of you pitch in and give some insight to what will be realistic in terms of the robot? Worst comes to worst, we can add those as another observation cause I think that information could be useful in the RL aspect of the project.
After running the first couple of simulations, the robot usually ends up on the ground and twitching. This is going to be an artifact of RL and we need to make a way to early terminate an episode so that the agent can start fresh.
I figure that the interface to this factory would be similar to our ObservationFactory
and RewardsFactory
. Create your env, register your stopping criteria, and you are off the races!
Was never added.
It seems as if the Solo Env is still sending out a hard coded reward--this needs to be switched to the dynamic model asap:
gym_solo/gym_solo/envs/solo8v2vanilla.py
Line 77 in 88d9d64
^^ Here the 0.0 needs to be dynamically computed for the reward.
It would be easier to create a reward that is geared towards getting the robot from flat on the ground to the home position.
The home position would be when all the legs are straight and the body is parallel to the ground. In theory, this is a much easier task as compared to getting the robot to completely stand up on two legs while being flat on the ground.
At the very least, we know that our reward will need to be somewhat correlated with how upright the robot is. I'm reaching out to a roommate rn to verify the math, but I feel like the cosine similarity between the torso IMU vector and straight upright should basically be it, correct? @andrew103 and @mahajanrevant. Choosing cosine similarity over inner product because it's automatically normalized to be in [-1, 1]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.