Learning from Richer Supervision
Robots have captivated human imagination for a long time. Numerous science fictions have promised that the era of robots is beckoning and the day androids walk shoulder to shoulder with us is not too far away. If this were to happen, the intelligent systems that would populate our environment will have to adapt and learn fast, possibly from humans too. For such agents, human teachers present a narrow but critical learning environment. Even between humans, the interactive teacher-student relationship is known to be effective. Endowing robots with human-like learning capabilities would facilitate a similar human-machine interaction, resulting in effective utilization of human skills by the robot. An added attraction of the field is the generality of advice, i.e., it could be applied to a variety of tasks with minimal modification or specialization. There have been various research initiatives in the field and the attempts could broadly be classified as learning through imitation, learning from demonstration, active learning, interactive shaping, teachable agents, bootstrapped learning, etc.
Reinforcement learning [] is a machine learning technique, that falls neither under supervised nor unsupervised learning methods. In supervised learning, there is a predefined input-output pair representation, i.e., at a given state (input), a supervised learner would be told what action to take next (output). Reinforcement learners are not bound by any such input-output representation. Also, unlike unsupervised learning, there is a notion of feedback in terms of rewards and next state. Initially an RL agent chooses randomly from the action space, receiving rewards from the environment for every action selected. Based on the experience gained over repeated runs, the agent computes the desirability of choosing a particular action. The agent looks to maximize the reward accumulated over time, i.e., it looks to choose actions that promise higher returns in the long run. Returns are defined as a time-discounted sum of rewards. The value of a state is defined as the expected return starting from that state. Thus, the expected return of selecting an action is the weighted sum of the immediate reward and the value of the expected next state.
Reinforcement Learning is very similar to the natural way in which humans learn. RL agents employ a trial-and-error based approach of interacting with the world and in the process build a preference model. They are different from human like learners though, as they do not possess any prior knowledge of the domain, past experience or enjoy the presence of a teacher to learn from, all of which are important contributors to the natural human way of learning. This results in inordinate amounts of exploration and long learning periods. Many methods to speed up learning have been proposed which include reduction of state space by use of deixis, hierarchical learning and learning from humans. Endowing robots with human-like learning capabilities would facilitate human-machine interaction, resulting in effective utilization of human skills by the robot. There have been various research initiatives in the field and the attempts could broadly be classified as learning through imitation, learning from demonstration, active learning, interactive shaping, teachable agents, bootstrapped learning, learning though instructions, etc.
Rewards have been considered the most important channel for understanding an environment’s dynamics and have been very well used as a feedback mechanism to an agent’s control algorithm. Although recently, there have been interesting forays into other modes of understanding the environment. Instructions are one such alternative and offer to bring into the learning process rich information about the world of interest if exploited properly.
Any external input to the control algorithm that could be used by the agent to take decisions about and modify the progress of its exploration or strengthen its belief in a policy can be called advice.
For example, "Do not drive fast". We would not drive fast generally, but sometimes we would have to drive fast to reach the destination in time. We can choose to violate the advice, like we choose to violate elders' advice most times :P
Instruction is advice that is absolute i.e., it cannot be overridden by the agent. For example, an agent learning to cut vegetables with a knife, can be instructed thus, use the sharp edge. For example, “Get the cup in that room”. The listener immediately sets out to get the cup.
Instructions could be of various forms:
- The next action or an option (temporally extended action).
- “Turn Right” or “Throw the ball.”
- A binding in the form of state or region of state space to visit next.
- “Look for a knife in the closet” or “Get close to the table.”
- A target object to define a task goal and ground policies.
- “That tea cup.”
It would be very useful to set up a generic framework that would facilitate translation of such instructions into a form useful to the agent and incorporate them into the agent’s learning mechanism. The translation could depend on the task at hand or the agents capabilities or the instruction itself. For example, suppose a robot was instructed to use a screwdriver, if it were a handyman robot, it would probably use the screwdriver to tighten screws. Whereas a robot trying to keep a door ajar, may use the screw driver to block the door from swinging shut. The following example throws light on the role played by instructions in the learning process. Consider a domestic robot learning to wash cups. Even though the task of the robot is to learn to wash cups, the instruction “those cups”, referring to soiled cups, will provide the robot an opportunity to learn that the useful characteristic of the cups that need to be washed is that they are soiled.
Simple experiments with Instructions
A few experiments were performed to justify that instructions could be exploited to accelerate learning. One of which is explained below. The world shown in Figure 1, was solved using a simple Q-learning based RL agent and an agent that had access to the instruction “Head south-east. When you find a bridge, cross it.” The results of the experiment are plotted in Figure. The instruction was given for the first 30 runs. It can be clearly seen from the graph that the agent converged to the shortest path sooner with instructions.