21] Crash Course on Decision Process (II): Reinforcement Learning

작성자: 허필원 with chatGPT

지난 2024년 8월 21일, 연구실 학생들과 함께 Reinforcement Learning (강화학습)에 대한 Crash Course를 진행했습니다. 강화학습은 MDP(Markov Decision Process)를 확장한 개념으로, 로봇 제어, Human-Robot Interaction, 그리고 웨어러블 로봇 등 다양한 분야에서 응용될 수 있습니다. 이번 강의에는 약 15명의 학생들이 참석하였으며, 강의는 영어로 진행되었습니다. 이번 블로그 글에서는 강화학습의 주요 개념과 강의 내용을 간략히 정리하겠습니다.

참고로, 강의자료와 스크립트는 아래의 github 에서 확인할 수 있습니다. https://github.com/pilwonhur/hurgroupccmdp

Reinforcement Learning의 개요

강화학습은 에이전트(agent)가 환경(environment)과 상호작용하면서 최적의 행동을 학습하는 과정입니다. 에이전트는 주어진 상태(state)에서 행동(action)을 선택하고, 그 결과로 보상(reward)을 받으며 학습합니다. MDP와 마찬가지로 강화학습에서도 상태, 행동, 상태 전이 확률, 그리고 보상의 개념이 중요하지만, 강화학습에서는 특히 에이전트가 경험을 통해 스스로 학습해 나간다는 점에서 차이가 있습니다.

MDP에서 RL로의 확장

MDP는 강화학습의 기본적인 틀을 제공합니다. 하지만 MDP와 달리 강화학습에서는 에이전트가 환경의 동적인 특성을 학습하고 적응해야 합니다. 강의에서는 MDP와 RL의 차이점을 설명하며, 특히 상태 전이 확률을 정확히 알지 못하는 상황에서 에이전트가 어떻게 최적의 정책(policy)을 찾아나가는지에 대해 다루었습니다. 이러한 점에서 강화학습은 탐색(exploration)과 활용(exploitation) 사이의 균형을 맞추는 것이 중요한 과제가 됩니다.

탐색과 활용의 균형

강화학습에서 에이전트는 새로운 상태와 행동을 탐색하면서도, 현재까지 학습한 정책을 활용하여 보상을 최대화해야 합니다. 이러한 균형을 맞추기 위한 대표적인 방법으로 “epsilon-greedy” 정책이 소개되었습니다. 이 방법은 일정 확률로 새로운 행동을 탐색하고, 나머지 확률로는 현재의 최적 정책을 따르는 방식입니다. 강의에서는 이러한 탐색과 활용의 균형을 맞추는 여러 전략들을 다루었습니다.

가치 함수와 정책

강의에서는 강화학습에서 중요한 가치 함수(value function)와 행동 가치 함수(action value function)에 대해 설명했습니다. 이 함수들은 에이전트가 특정 상태에서 얻을 수 있는 보상의 기대값을 나타내며, 최적의 행동을 선택하는 데 중요한 역할을 합니다. 특히 Bellman 방정식을 통해 가치 함수를 갱신하고, 이를 바탕으로 최적 정책을 학습하는 방법을 다루었습니다.

주요 알고리즘

강의 후반부에서는 강화학습에서 사용되는 주요 알고리즘에 대해 다루었습니다. 대표적인 방법으로는 SARSA와 Q-Learning이 있으며, 이 두 알고리즘의 차이점을 이해하는 것이 강화학습을 깊이 있게 이해하는 데 중요합니다. SARSA는 현재 정책을 따르는 “On-Policy” 방법이고, Q-Learning은 최적의 정책을 찾는 “Off-Policy” 방법입니다. 이 두 가지 방법을 통해 에이전트가 어떻게 학습할 수 있는지에 대해 예시와 함께 설명했습니다.

심층 강화학습(Deep Reinforcement Learning)

최근 강화학습에서 가장 주목받고 있는 분야 중 하나는 심층 강화학습입니다. 심층 강화학습은 신경망(neural network)을 사용하여 복잡한 환경에서도 에이전트가 학습할 수 있도록 돕는 방법입니다. 강의에서는 Deep Q-Network(DQN)을 포함한 여러 심층 강화학습 기법들을 소개하며, 이러한 기법들이 기존 강화학습의 한계를 어떻게 극복하는지에 대해 설명했습니다. 특히, 경험 재생(experience replay)과 타깃 네트워크(target network)와 같은 기법들을 사용하여 학습의 안정성을 높이는 방법에 대해 다루었습니다.

정책 경사법과 액터-크리틱

마지막으로, 정책 경사법(policy gradient)과 액터-크리틱(actor-critic) 알고리즘에 대해 다루었습니다. 정책 경사법은 에이전트가 직접 최적의 정책을 학습하는 방법으로, 복잡한 환경에서 유용하게 사용됩니다. 액터-크리틱 알고리즘은 정책(actor)과 가치 함수(critic)를 함께 학습하는 방법으로, 강화학습의 효율성을 높이는 데 도움을 줍니다. 이러한 방법들은 복잡한 로봇 제어 문제를 해결하는 데 강력한 도구가 될 수 있습니다.

이번 강의를 통해 학생들이 강화학습의 기본 개념부터 심화된 알고리즘까지 이해할 수 있는 기회를 가졌기를 바랍니다. 다음 Crash Course에서는 POMDP(Partially Observable Markov Decision Process)에 대해 다룰 예정입니다. 학생들이 더욱 깊이 있는 학습을 할 수 있도록 지속적인 교육을 제공할 계획입니다.

아래는 실제 강의 영상의 내용 요약입니다.

00:00:00 📚 Lecture Preparation and Challenges

00:04:52 📚 Overview of Reinforcement Learning and Examples

00:08:35 📘 Overview of Markov Decision Processes (MDP)

00:10:08 🧠 Understanding Transition Probability in MDP vs. RL

00:11:40 🤖 Understanding Agents and Environments in RL

00:13:39 🤖 Understanding Markov Decision Processes and Reinforcement Learning

00:16:33 ♻️ Understanding the Bellman Equation in MDP

00:18:10 ⚙️ Dynamic Programming in MDP

00:19:37 ⚙️ Transition from MDP to RL

00:20:56 🤖 Challenges in Modeling and Using MDP

00:22:40 🔍 Challenges in Solving High Degree of Freedom Problems

00:24:25 🌍 Balancing Exploration and Exploitation in Reinforcement Learning

00:26:01 🌀 Reinforcement Learning vs. Markov Decision Processes

00:29:05 🤸‍♂️ Understanding Energy Gain in Motion and MDP Application

00:31:21 🌀 Understanding Phase Portraits and Energy in Motion

00:35:13 🌀 Understanding Angular Dynamics in Programming

00:37:10 🎯 Pendulum Control with RL and MDP

00:41:50 🔍 Understanding Deterministic Models in MDP

00:45:29 🔍 Understanding Uncertainty in Systems

00:47:55 ⚙️ Understanding State Changes in Discretization

00:49:48 🔍 Understanding State Transitions and Transition Probability Functions

00:53:45 🔍 Understanding State Transition Probability in RL

00:56:21 🔍 Key Concepts in Grid and Delta T Adjustments

00:59:43 🤔 Challenges and Considerations in MDP

01:02:57 ⚙️ Strategies in Reinforcement Learning for Optimal Control

01:07:21 🍽️ Cooking with Repeated Ingredients

01:10:56 🤖 Challenges and Solutions in Reinforcement Learning for Robotics

01:16:04 🌟 Reinforcement Learning in Healthcare and Basic Concepts

01:17:54 🤖 Key Concepts of Reinforcement Learning

01:19:08 📘 Importance of Value and Action Value Functions in RL

01:21:55 🔍 Understanding MDP and Reinforcement Learning

01:23:44 🌟 Key Concepts in Exploration and Exploitation

01:25:05 🎯 Balancing Exploration and Exploitation in Reinforcement Learning

01:28:05 🔍 Key Algorithms in Reinforcement Learning

01:30:29 🎯 Key Methods in Reinforcement Learning

01:32:18 📘 Temporal Difference Learning in Reinforcement Learning

01:33:37 🔍 Understanding Temporal Difference Learning in RL

01:35:09 🔍 Key Differences Between On-Policy and Off-Policy in Reinforcement Learning

01:36:51 🔍 Understanding SARSA and Q-Learning

01:38:28 🚀 Key Concepts in Reinforcement Learning Algorithms

01:42:18 🔍 Estimating Areas and Value Functions

01:45:27 🎯 Understanding Episode Returns in Reinforcement Learning

01:47:50 🔍 Understanding the Update Equation in RL

01:49:07 🔍 Understanding the Update Equation in Reinforcement Learning

01:50:47 🧠 Understanding the Update Rule and Convergence in RL

01:52:55 🚀 Importance of Exploration in RL

01:54:31 🔄 Temporal Difference in Reinforcement Learning

01:56:21 🥾 Understanding Bootstrapping in Computer Science

02:00:46 🔍 Bootstrapping and Temporal Difference in Reinforcement Learning

02:04:14 📚 Understanding Deep Reinforcement Learning and Temporal Difference

02:08:46 📌 Project Timeline and Testing Strategy

02:11:10 📝 Discussion on X and V in Matrix Operations

02:16:49 📚 Introduction to Algorithms in Reinforcement Learning

02:19:14 🔄 Generalized Policy Iteration in Reinforcement Learning

02:20:37 🎯 Understanding Temporal Difference Update in RL

02:22:50 🤖 Understanding Greedy and Epsilon-Greedy Policies

02:25:17 🎯 Action Value vs. State Value in Reinforcement Learning

02:26:54 🎯 Epsilon Greedy Policy in Reinforcement Learning

02:29:29 🎯 Understanding Sarsa in Reinforcement Learning

02:32:12 🎯 Understanding the Action and Reward Process in Reinforcement Learning

02:34:42 ⚙️ Understanding Q-Learning and Convergence

02:36:31 🚀 Q-Learning vs. Exploration Challenges

02:39:38 💡 Differences Between Salsa and Q-Learning

02:43:55 🔍 Key Concepts in Q-learning

02:46:32 🚀 Scalability and Combining Deep Learning with Reinforcement Learning

02:47:50 🚀 Using Deep Learning in Q-Learning

02:50:20 🚀 Importance of Deep Learning in Reinforcement Learning

02:51:50 📊 End-to-End Learning and Deep Learning Basics

02:52:58 🤖 Neural Network Layers and Activation Functions

02:54:41 ⚙️ Understanding Backpropagation and Gradient Descent

02:56:20 🔍 Effects of Small Slopes in Activation Functions

02:57:38 🔍 Understanding Loss Functions and Optimization in RL

03:02:14 📊 Understanding Predicted and True Values in Networks

03:04:40 🚀 Implementing Reinforcement Learning with Python Libraries

03:07:49 🍱 Lunch Break and Preferences

03:13:39 🔍 Understanding Q Function and Training in Reinforcement Learning

03:16:57 ⚙️ Q-Learning Update and Epsilon-Greedy Strategy

03:19:23 🚀 Key Concepts in Q-function and Temporal Difference

03:22:02 ⚙️ Understanding Deep Q-Networks (DQN)

03:26:31 📚 Understanding Experience Replay and Target Networks

03:30:43 🎯 Stability in Reinforcement Learning

03:33:36 📚 Reinforcement Learning Models and Methods

03:37:36 🚀 Function Approximators and Network Updates in RL

03:42:15 📊 Reinforcement Learning Concepts: Policy Gradient and Value Functions

03:44:16 📊 Understanding the Gradient Function in Reinforcement Learning

03:46:23 🔍 Understanding Policy Gradient Method and Reinforce Algorithm

03:48:32 🎯 Actor-Critic Algorithms in Reinforcement Learning

03:50:30 🎯 Understanding Policy Gradient and Reinforce Algorithm

03:51:51 🎯 Actor-Critic Method in Reinforcement Learning

03:55:15 🎯 Understanding Policy Networks and Cross-Entropy

03:57:15 🎭 Actor-Critic and Advantage Functions

03:59:14 🔍 Overview of Reinforcement Learning Algorithms

04:01:06 🎓 Hybrid Methods in Reinforcement Learning

04:02:22 🔍 Understanding Safe Reinforcement Learning and Its Methods

04:06:56 🚀 Overview of Dynamic Grid World and Algorithms

04:09:52 🤔 Planning for Friday

CC BY-SA 4.0 Pilwon Hur. Last modified: October 08, 2024. Website built with Franklin.jl and the Julia programming language.

HUR Group

[2024/8/21] Crash Course on Decision Process (II): Reinforcement Learning