Let’s consider a problem where an agent can be in various states and can choose an action from a set of actions. Such type of problems are called Sequential Decision Problems. An mdpis the mathematical framework which captures such a fully observable, non-deterministic environment with Markovian Transition Model and additive rewards in which the agent acts. The solution to an MDP is an optimal policy which refers to the choice of action for every state that maximizes overall cumulative reward. Thus, the transition model that represents an agent’s environment(when the environment is known) and the optimal policy which decides what action the agent needs to perform in each state are required elements for training the agent learn a specific behavior.