Random agent for each episode. We can see in the graph that this is not a very good approach as the agent does not learn anything from the states and the actions. It takes completely random actions through which it achieves success in a stochastic manner.
The agent follows some policy
This agent uses the bellman optimality to update something called a q-table, which contains the q values associated with each state and action. It is a form of TD learning through which the more the algorithm runs the more it learns. You can see in the graph that the agent's performance steadily increases over time eventhough it is fluctuating.