DRL - Getting Started with Deep Reinforcement Learning Can Be a Beast, Here' a Way to Frame It
A lot of primers on DRL can be confusing and needlessly high level. There's a simpler way to begin your journey.
Q-learning in action
There's a wonderful metaphor to be found in fantasy fiction that applies to learning computer science, artificial intelligence, and machine learning. Often, the 'magic users' of a fantasy novel will act as gatekeepers to knowledge, making every tutorial and needlessly complicated for those that want to learn magic. This is intended to hoard this knowledge and power for themselves. I'm not trying to make every CS professor out to be an evil sorcerer, but... just kidding, I just think it's really, really difficult to teach computer science - especially if you never wanted to be a teacher in the first place. I'm far from an expert, but I would like to make all these concepts a bit easier for everyone to understand. So here's how to frame deep reinforcement learning in your mind.
The best way to understand and learn programming is to envision a project and figure out how to complete it. That means piecing together what kind of tools you'l need to use to finish it, and how to use those tools. In this example, I've created a fictional character named Marco. Marco's pretty good with the 'magical language'(i.e. Python), understands how to use the spells other people made for his own purposes (i.e. packages), and is beginning to be able to generalize how the 'magic world'(computing) works. Like all excellent wizards, Marco is lazy AF. His goal is to create a non-sentient agent that can do all his housework for him. But he has a major problem - creating spells takes a while! He's been working on his automated agent for a while and has realized that having a non-sentient butler means he has to personally create a unique spell for every single action in every single environment. Marco doesn't have time for this, he just wants a simple answer and he thought magic was supposed to make things easier! Marco wants his servant to be able to draw on Marco' own human experience but he doesn' want to just create another human. He wants the agent to master everything he knows and more in just an afternoon. Marco draws on his skills - he remembers that he can create mini-worlds, where he controls everything! Including time. Even though Marco isn't powerful enough to influence time in the real world, he can in this mini-world. But that doesn't help him, right? He needs to be able to use his agent in the real world, and having one that' limited to his fake world is useless. Marco spends the rest of the morning thinking and comes up with a solution! What if he takes the 'mind'of the agent and puts it in the fake world to gain experience AND THEN takes the trained mind back out and puts it back into his servant. He'd get exactly what he wants!
Marco's just identified the foundations of reinforcement learning, a subset of AI. In this fictional world where magic replaced programming, Marco realized that in order to create an automated creature that can do advanced tasks, he needs it to gain experience. But gaining experience took Marco twenty-four years, and he still doesn't know everything! He doesn't have time to wait around for that. He needs to replicate the experience process at an immense scale. So he finds a workaround - make a fake world where he controls all the variables, set the variables to mimic the real world with some variations (i.e. a fake second = a fake hour/day/year) and puts the brain into the fake world. Once it's learned just as much, or more than Marco, he takes it out and he has a ready-made agent willing and capable of doing his accounting for him! Now that we kind of get a picture of what Marco's doing, let's abandon the magic metaphor to get into some of the more real world applications of what I just described.
In the next three sections, I want to cover the main aspects of deep reinforcement learning. The first is model construction. There are four main aspects that are essential to understand:
- The Goal
- The World
- The Action
- The Reward
These factors are pretty much all you need to build a deep reinforcement learning model. I'll go into each of them in a second, but I want to communicate an essential topic about data before we begin.
Marco had it easy - with magic, all he had to do to create his fake world was use some kind of mystical energy available and think really hard about what he wanted to create. For us real worlders, we have different parameters. Data is the most important. There are three types of data out there: numerical, categorical, and both! Numerical data is the easiest to work with - computers can notice patterns quickly using mathematical operations. Categorical data is a bit more tricky. How can you begin to communicate to a computer that there's a qualitative difference between two different objects? How can you abstract this? NUMBERS. Feature engineering is a huge part of deep reinforcement learning and it' how a lot of the AI 'agic'happens. We just convert everything we find to numbers using a couple of use-specific methods. Feature engineering's super important, so I'l cover that one in a whole post later. The important thing to understand is that everything will eventually be represented by numbers, and that' how the computers can pick up on patterns.
With this in mind, let's talk about the goal. This is variable and can get SUPER complicated. Just look at self-driving cars. Your goal can be to get from point A to point B, but it also needs to not destroy the car, go off the road, go at a certain speed, not hit pedestrians, keep the passengers safe, etc. etc. Objective functions are the most important part of any deep learning model, and partially why it gets complicated. So let's start with something simple: a thermometer in a house. The goal is easy: keep the house temperature at 70 degrees Fahrenheit when people are home. The hard part then becomes defining the state (i.e. the world).
The state space is the virtual world. There are three types of worlds you can create and explore. The simplest and easiest to work with are model-based reinforcement learning. MB-RL applies to games, especially ones with constant rules like Chess. The 'odel'is the game rules. It can only take place on a special environment - (i.e. the chessboard). There are a limited number of things that can happen on the board (still an obnoxiously big number of possibilities for a human to work with), and it's why this is the easiest method to work with. LEARNING CURVE ALERT: Professors will confuse you by bringing up a concept called 'sample efficiency,' and it involves a LOT of math. To avoid going too far into decision trees and pruning, just know that there's a mathematical way to mimic how professional chess or Go players can 'feel' what a good move is, and then the computers do it even better. The fact that the computer doesn't have to plan out every single permutation of what could happen makes it 'efficient'. The next step is valued-based reinforcement learning. VB-RL tries to estimate the qualities of states through concepts called 'on and off policy' Basically, the VB-RL guesses if its action might be good or bad for the world and its goal. Off policy means that the agent constantly tries to reevaluate its approach. This is 'sample inefficient,'because the model needs a lot of examples before it's actually able to do anything useful. They don't directly learn a strategy of action, they learn how good a course of action is and try and maximize that in their action. Occasionally, they'll flip a coin to try something new. On policy means that the agent will constantly change their approach given a scenario to determine whether or not their action is good. One of the most prominent algorithms to determine what's the best reward (I'l get to that in a second) is called Q-learning (named for Q, which is an abstraction for the action). With this, let's move on to the action:
The action step will probably be more familiar - it's just an if function. An action says that if certain conditions are met, then the program will do something. Q-learning is a state-action function (i.e. it involves variables of both the state and action classes) that tries to maximize its reward. The reward helps produce the output (wait a minute, I'll get to rewards after this). Q-learning is an algorithm that basically says: use any strategy to guess a Q (i.e. an action) that maximizes the future reward. This is independent of the strategy you gave it at the beginning. The Q algorithm is a bit tricky to understand, but it' best to think of programmatically. It's basically a for-if-loop, i.e. after you've initialized the world, for a state and action in the world, perform the action, measure the reward, and update Q. Continue to do this if you haven't gotten to the optimal state, otherwise, terminate the loop.
Finally, the reward function: it can get super wonky. Sometimes, especially if your data is all numeric, it can be really easy to determine. You've got a variable, 'x' and you just try and get it as high as it can go. In the real world, there's often many more variables involved, and they all affect each other (i.e. why it's a function). Many people like talking about the Cobra Effect. The Cobra Effect talks about an Indian Government policy where New Delhi tried to encourage people to bring in venomous snakes for money. People immediately started breeding these dangerous snakes to game the system. You do not want to train your system to adopt a solution that' worse than the problem. This is an edge case - the consequences of your system hopefully aren't dire - but this exact scenario can be abstracted to simply represent all negative outcomes that you don't want from a DRL model. A better example might be a robot arm that's being trained to place blocks on top of each other. The author of the code thought it's be a good idea to make the 'reward' a measure of distance - put the cube as far as you can (the author thinking that the robot arm would stretch out all the way and place the cubes linearly on top of one another). The robot ended up learning how to chuck things as hard as it could against a wall. Rewards are a complicated task, and often aren't as easy as simply maximizing the unit of a variable. As I was writing this, I realized I was going a bit off tangent, so I'd like to cover rewards more in-depth in a future post.
This is the basics behind deep reinforcement learning - it's all about creating a world, crafting an agent, giving it a set of actions it can take, and teaching it to maximize an arbitrary reward to gain experience and an idea of what actions will lead to what results. It's all a fascinating idea and hopefully I managed to convey its basics well. If you have any questions, let me know!
\