top of page


In this blog we will discuss the Q learning type of reinforcement learning. Let's Start.

What is Reinforcement learning ?

Have you ever seen a dog trainer giving some food to the dog for every successful task? The food is a reward to the brain saying well done you are going in the right direction. Our brains crave rewards. Something that gives happiness and our brain tries to do those things again and again that gives it more and more reward.

Similarly, We can replicate this technique used to train our robots or models. This autonomous learning is called reinforcement learning or Q learning algorithm.

Using Q learning reinforcement learning algorithm agents tries to figure out the quality of its action based on the rewards it receives. So that it can decide which action to perform in sequence to get maximum reward in the long term.

In reinforcement learning there are five main element

Reinforcement Q learning Elements

  • Agent

  • Environment

  • Reward

  • State

  • Action

Agent : In reinforcement Q learning Agent is the one who takes decisions on the rewards and punishment.

Environment : The Environment is a task or simulation and the agent is an AI algorithm that interacts with the environment and tries to solve it.

Reward : A reward in RL is part of the feedback from the environment. When an agent interacts with the environment, he can observe the changes in the state and reward signals through his actions, if there is change. He can then use this reward signal (can be positive for a good action or negative for a bad action) to draw conclusions about how to behave in a state. The goal, in general, is to solve a given task with the maximum reward possible. That is why many algorithms have a very small negative reward for each action the agent takes to animate him to solve the task as fast as possible.

State : State Describe the current situations.

Actions : a set of actions which the agent can perform.

Introduction of Q Learning

In Q learning involves an AI agent operating in an environment with states & rewards(inputs) and Action(outputs). Q learning involves model free environments. The AI agent is not seeking to learn about an underlying mathematical problem or probability distribution. Instead, the AI agent attempts to construct an optimal policy directly by interacting with the environment. Q learning uses a trial and error based approach. The AI agent repeatedly tries to solve the problems using varied approaches and continuously updates policy as it learns more and more about its environment.

The characteristics of the Q learning model are input and output systems, rewards, an environment, Markov decision processes, and training and inference. Additional two characteristics are the number of possible states is limited. The AI agent is always with one fixed number of possible situations. The second is that an AI agent always needs to choose from among a fixed number of possible actions.

What are Q values and Q table policy in Q learning ?

In Q learning Q stands for quality. Quality in this case represents how useful a given action is in gaining some future reward. Quality of particular action in a given state. Q(state, action)

Q value determines how much additional reward we can accumulate through all remaining steps in the current episode if the AI agent is in state and takes action. Q value increases as the AI agent gets close to the highest reward.

Q Table

Q values are stored in a table which is known as Q table. Which has one row for each possible state and one column for each possible action. An optimal Q table contains values that allow the AI agent to take the best action in any possible state. Thus providing the agent with the optimal path to the highest reward. Q table represents the AI agent policy for acting in the current environment.

Temporal Difference

Temporal difference (TD) provides us with a method of determining how the Q value for the action taken in the previous state should be changed based on what the AI agent has learned about the Q value for the current state's actions.


  • Temporal difference for the action taken in the previous state. Our AI agent will be able to benefit from this information whenever it finds itself in that state again in future.

  • The reward received for the action taken in the previous state. Temporal difference includes the reward that was received for the action that was taken in the previous state. This is referring to immediate reward that is different from the Q value. Which represents our current calculated sum of all future rewards if we were to take a particular action in a particular state.

  • γ (gamma) is the discount factor Between 0 and 1 (0≤γ≤1) - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

  • max Q (st+1,a) : The largest Q value available for any action in the current state ( The largest predicted sum of future rewards ). We are calculating this temporal difference value in order to figure out by how much we should adjust the Q value for the previous action. Also Q value for the current state represents our best calculation of the sum of all the future rewards for each possible action that we might take in the current state.

  • Q(st,at): The Q value for the action taken in the previous state. We subtract the Q value for the AI agent's most recent action yielding the temporal difference value for the most recent action.

Bellman Equation

Bellman equation shows the new value to use as the Q value for the action taken in the previous state. Relies on the both old Q value for the action taken in the previous state and what has been learned after moving to the next state. It include learning rate It defines how quickly Q values are adjusted.


  • The new Q value for the action taken in the previous state. This new Q value for the most recently taken action is the sum of the previous Q value for that action.

  • This is the old Q value for the action taken in the previous state.

  • α (alpha) is the learning rate (0 ≤ α ≤ 1) Just like in supervised learning settings, α is the extent to which our Q-values are being updated in every iteration.

Q-Learning Algorithms

How Q learning Works? OR Q-learning algorithm process.

Step 1. Initialize Q Table

The process begins by the initialization of the Q table. Q table represents the AI agent's policy for how to behave in an environment any value can be used to initialize the Q table.

Example of Q table Initializations

Step 2 : Choose Action

We choose an action from the Q table for the current state. AI agents might choose the highest Q value. Common strategy for this is to use an epsilon greedy algorithm. In the epsilon greedy algorithm the AI agent will usually choose the action with the highest Q value 90% of the time. But we will choose an action at random for the remaining 10 % of the time.

Step 3 : Perform Action

After the AI agent has chosen which action to take. We take that action and transition to the next state.

Step 4 : Receive reward

We then receive our reward for taking the most recent action and we use that reward in conjunction with the knowledge that we have learned about our new state to compute the temporal difference value for our previous action.

Step 5 : Update Q table

We use the temporal difference value and bellman equation to update the Q value for our most recent action. We then loop back to the beginning by once again choosing an action for the current state; the process continues until we reach a terminal state such as the target location.

30 views0 comments
bottom of page