A Review of Q-Learning (Is this Q*? AGI?)

In the midst of recent upheavals at OpenAI, marked by the temporary dismissal and subsequent rehiring of co-founder and CEO Sam Altman, the spotlight has inadvertently shifted towards a mysterious algorithm known as Q*. OpenAI has maintained a shroud of secrecy around Q*, leaving the tech community and enthusiasts intrigued and eager for details.

The intrigue surrounding Q* is not without reason. Rumors and speculation have circulated, suggesting that this enigmatic algorithm represents a groundbreaking advancement in artificial intelligence. Reports claim that Q* possesses the capability to solve mathematical problems that were previously deemed unsolvable by conventional AI methods.

Interestingly, within the speculative discussions, a connection to the well-established reinforcement learning algorithm, Q-Learning, has been hinted at by certain influencers and marketers. However, it is crucial to emphasize that OpenAI has not confirmed any relationship between Q* and Q-Learning. The details of Q* remain a closely guarded secret, adding an air of mystery to the ongoing narrative.

In light of this renewed interest and speculation, it becomes pertinent to revisit the fundamentals of Q-Learning, a proven reinforcement learning algorithm with a distinct legacy. By understanding the core principles of Q-Learning, we can gain valuable insights into its potential influence on Q* and its significance in the broader landscape of artificial intelligence.

Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning that focuses on training agents to make decisions by interacting with an environment. Q-Learning and its extension, Deep Q-Learning, are popular algorithms in RL, enabling agents to learn optimal strategies in complex and dynamic environments.


Q-Learning is a model-free RL algorithm that falls under the category of Temporal Difference (TD) learning. The central idea behind Q-Learning is to learn a Q-function, denoted as Q(s, a), which represents the expected cumulative reward of taking action ‘a’ in state ‘s’ and following the optimal policy thereafter. The Q-value is updated iteratively based on the following equation:

$$ Q(s,a) \leftarrow (1 – \alpha) Q(s,a) + \alpha (r + \gamma \max_{a’} Q(s’, a’)) $$

Here, α is the learning rate, r is the immediate reward, γ is the discount factor, s′ is the next state, and a′ is the next action. The algorithm aims to find the optimal Q-function, guiding the agent to make the best decisions in any given state.

Deep Q-Learning (DQN)

While Q-Learning is effective for problems with a relatively small state and action space, it struggles with high-dimensional and continuous environments. Deep Q-Learning addresses this limitation by incorporating neural networks to approximate the Q-function.

The Deep Q-Network (DQN) architecture, introduced by researchers at DeepMind, employs a neural network to estimate Q-values. The network takes the environment state as input and outputs Q-values for each possible action. The loss function for training the DQN is derived from the temporal difference error:

$$ L(\theta) = \mathbb{E} \left[ (r + \gamma \max_{a’} Q(s’, a’; \theta^-) – Q(s,a; \theta))^2 \right] $$

Here, \( \theta \) represents the parameters of the neural network, and \( \theta^- \) denotes the parameters of a target network that is periodically updated. This approach stabilizes the training process by reducing the correlation between successive Q-value estimates.

Experience Replay

To further enhance stability and efficiency, DQN employs a technique called experience replay. Instead of updating the neural network with the latest experiences in sequential order, DQN stores experiences in a replay buffer and samples batches randomly during the training process. This mitigates the issues associated with correlated and non-stationary data, improving the learning process.

Exploration-Exploitation Dilemma

Balancing exploration (trying new actions) and exploitation (choosing actions with known high rewards) is crucial for RL agents. The ϵ-greedy strategy is often used in Q-learning and DQN, where the agent selects the action with the highest Q-value with probability 1−ϵ and explores with probability ϵ.

Fun Historical Note

Currently, there is a lot of hype surrounding LLMs, and more recently Q*. Everybody is saying that this might be the key to AGI. However, it is important to note that everybody said the same thing about Deep Q-Learning when that was new too. People love speculating about whether or not the sky is falling, armageddon, and the end of the world. In the old days, we called this “religion”…


Q-Learning and Deep Q-Learning are powerful techniques in reinforcement learning, allowing agents to learn optimal strategies in diverse environments. While Q-Learning is suitable for problems with discrete state and action spaces, Deep Q-Learning extends this capability to handle high-dimensional and continuous spaces through the use of neural networks. With advancements like experience replay and target networks, Deep Q-Learning has proven to be a robust and effective approach in training intelligent agents for various applications, from playing games to controlling robotic systems. Importantly, we don’t know if Q* has anything to do with Q-Learning!

Learn more about Q-Learning

Learn more about Deep Q-Learning (note: the above course is a prerequisite to this one, so don’t skip it)