1
00:00:12,120 --> 00:00:16,630
In this lecture we are going to begin discussing the code for our reinforcement learning.

2
00:00:16,650 --> 00:00:22,980
Stock trader you will notice that we are not using Google COLA for this which I found performs very

3
00:00:22,980 --> 00:00:25,620
slowly compared to running it locally.

4
00:00:25,620 --> 00:00:28,050
For this reason I'm running the scripts locally.

5
00:00:28,110 --> 00:00:34,050
But as usual if you prefer to run this code in Google Ecolab all you would need to do is paste the code

6
00:00:34,080 --> 00:00:40,650
and modify the data source the relevant file for this lecture is RL trader dot Pi which can be found

7
00:00:40,650 --> 00:00:41,910
in the course repository

8
00:00:44,460 --> 00:00:45,210
in this lecture.

9
00:00:45,210 --> 00:00:49,910
We're going to describe just a few initial bits and pieces of our program.

10
00:00:49,950 --> 00:00:51,930
The first function we have is get data.

11
00:00:52,800 --> 00:00:56,180
Luckily I've already included the data in the course repository.

12
00:00:56,220 --> 00:00:58,510
So all we have to do is important.

13
00:00:58,920 --> 00:01:03,350
The data is stored in a CSP which we can load in a pen as data frame.

14
00:01:03,420 --> 00:01:10,490
But of course what we are interested in are the values so we can call dot values to get a number higher.

15
00:01:10,560 --> 00:01:15,700
Note that in order to simplify this problem we're going to use the closed price only.

16
00:01:15,810 --> 00:01:19,290
You're welcome to look at the CSO for yourself if you want to know what's in it.

17
00:01:20,420 --> 00:01:26,630
As usual this returns that he by D a sequence where t is the number of times steps and d the feature

18
00:01:26,630 --> 00:01:33,200
dimensionality and in this case that's the number of stocks three here are the three stocks we'll be

19
00:01:33,200 --> 00:01:35,670
looking at in case you're interested.

20
00:01:35,720 --> 00:01:42,710
First we have Apple that's AAPL next we have Motorola that's MSCI and finally we have Starbucks that's

21
00:01:42,710 --> 00:01:48,370
SB ex note that each row corresponds to the same date.

22
00:01:48,440 --> 00:02:00,380
So for each stock we have data from about February 2013 to February 2018.

23
00:02:00,410 --> 00:02:02,580
Next we have the replay buffer.

24
00:02:02,780 --> 00:02:03,980
This has three functions.

25
00:02:03,980 --> 00:02:08,730
The constructor store and sample batch in the constructor.

26
00:02:08,750 --> 00:02:12,230
We initialize our array buffers and pointers.

27
00:02:12,230 --> 00:02:18,500
First we have obs one buff and OBS to above which store the states and next states respectively.

28
00:02:18,500 --> 00:02:21,950
Next we have ax buff which stores our actions.

29
00:02:21,950 --> 00:02:28,580
These are represented by integers from zero up to 26 inclusive so we can use you in eight to represent

30
00:02:28,580 --> 00:02:30,320
them.

31
00:02:30,450 --> 00:02:36,180
Next we have the rewards buffer which stores our rewards and we have the done buffer which stores the

32
00:02:36,180 --> 00:02:37,530
Dunn flag.

33
00:02:37,530 --> 00:02:42,630
This can only be 0 or 1 So again it can be a U.N. 8.

34
00:02:42,630 --> 00:02:48,300
Lastly we have a pointer which starts at zero the current size of the buffer is zero and the max size

35
00:02:48,300 --> 00:02:50,960
of the buffer is specified as the size argument.

36
00:02:54,470 --> 00:02:56,500
Next we have the store function.

37
00:02:56,600 --> 00:02:58,970
This store is the state action reward.

38
00:02:58,970 --> 00:03:05,510
Next state and dunk flag in their respective buffers at the index self-taught pointer.

39
00:03:05,510 --> 00:03:10,790
After this we increment the pointer so that the next time we call the store function all the values

40
00:03:10,790 --> 00:03:13,570
will be stored in the next position.

41
00:03:13,670 --> 00:03:15,530
And remember this is circular.

42
00:03:15,530 --> 00:03:22,460
So we make use of the modulo operation when incrementing the pointer would go up to max size it's set

43
00:03:22,460 --> 00:03:24,910
back down to zero.

44
00:03:24,920 --> 00:03:30,590
Lastly we set the current size of our memory which is equal to the minimum of the previous size plus

45
00:03:30,590 --> 00:03:31,340
one.

46
00:03:31,340 --> 00:03:34,310
Since we just added one more value or max size.

47
00:03:34,310 --> 00:03:35,810
In the case the buffer is for

48
00:03:39,020 --> 00:03:41,660
finally we have the sample batch function.

49
00:03:41,840 --> 00:03:46,080
This chooses random indices from 0 up to the size of the buffer.

50
00:03:46,400 --> 00:03:53,030
Then we return a dictionary containing the state's actions rewards and so forth indexed by those indices

51
00:03:59,110 --> 00:04:04,540
the next thing we're going to look at is the get scalar function this takes in an environment object

52
00:04:04,600 --> 00:04:07,930
since that will be used in fitting our scalar.

53
00:04:07,930 --> 00:04:11,520
The idea is in order to get the right parameters for a scalar.

54
00:04:11,650 --> 00:04:15,140
We must have some data in order to get this data.

55
00:04:15,430 --> 00:04:21,150
It's a face is to just play an episode of randomly and store each of the states we encounter.

56
00:04:21,150 --> 00:04:26,010
There is no need to have an AGM because such an agent wouldn't be trained anyway.

57
00:04:26,080 --> 00:04:31,870
So you can see that when we choose an action to perform the only thing we need to do is sample a value

58
00:04:31,870 --> 00:04:34,000
from the action space.

59
00:04:34,210 --> 00:04:41,260
When we're done we create a standard scalar object and fit it to the states we encountered.

60
00:04:41,320 --> 00:04:45,730
Now one thing you could do to make this more accurate is run it for multiple episodes.

61
00:04:50,030 --> 00:04:53,140
Next we have a function called maybe make Dir.

62
00:04:53,150 --> 00:04:55,220
This is more just a utility function.

63
00:04:55,280 --> 00:05:01,520
It checks if a particular directory exists and if it doesn't then it creates the directory.

64
00:05:01,550 --> 00:05:06,650
The reason we need this is because we're going to store our trained model and the rewards we encounter

65
00:05:06,860 --> 00:05:15,180
to separate files so that we can plot them afterwards.

66
00:05:15,280 --> 00:05:21,790
Next we have the MLP class which is going to represent our Q function approximated as you can see.

67
00:05:21,790 --> 00:05:26,040
This is just the standard day and then with an arbitrary number of hidden layers.

68
00:05:26,350 --> 00:05:31,840
So as input to the constructor be taken the number of inputs the number of actions the number of hidden

69
00:05:31,840 --> 00:05:34,930
layers and the number of hidden units.

70
00:05:34,930 --> 00:05:41,860
Next we create a variable M initialized to be the number of inputs and a list to store our layers.

71
00:05:41,860 --> 00:05:45,170
Next we enter a loop so we can add each of our layers.

72
00:05:45,310 --> 00:05:52,930
So in hidden layers times I'm going to create a linear layer with size M by hidden dim I'm going to

73
00:05:52,930 --> 00:05:55,870
update m to be the size of the next layer as output.

74
00:05:56,680 --> 00:06:02,500
I'm going to append to the layer to my list of layers and then I'm going to add a real you non linearity

75
00:06:03,430 --> 00:06:04,250
outside the loop.

76
00:06:04,270 --> 00:06:10,600
I'm going to add one final layer with output size equal to N action and then I'm going to pass all these

77
00:06:10,600 --> 00:06:14,980
layers into a sequential object and then store that in the layers attribute.

78
00:06:17,850 --> 00:06:23,040
Next we have the forward function which simply passes the data through our previously created sequential

79
00:06:23,040 --> 00:06:25,630
object.

80
00:06:25,900 --> 00:06:31,120
Next we have some special functions that are not normally part of a PI George model for saving and loading

81
00:06:31,900 --> 00:06:32,560
for saving.

82
00:06:32,560 --> 00:06:37,930
I just called towards that save pass in the state dict belonging to the current object and then the

83
00:06:37,930 --> 00:06:41,950
path to save the file to for loading.

84
00:06:41,950 --> 00:06:43,670
I call the load state dict function.

85
00:06:43,690 --> 00:06:48,160
Belonging to this class and then passing the output from torch dot load path

86
00:06:56,880 --> 00:06:57,420
next.

87
00:06:57,420 --> 00:07:01,850
We have a few convenience functions to interact with the rest of the script.

88
00:07:01,890 --> 00:07:05,370
I like the rest of the script to be library agnostic.

89
00:07:05,370 --> 00:07:10,080
What I mean by that is I don't want the rest of the scripts to have to know anything about pi talk at

90
00:07:10,080 --> 00:07:10,980
all.

91
00:07:10,980 --> 00:07:15,780
It should work no matter how I implemented my neural network whether that's with PI torch tensor flow

92
00:07:15,790 --> 00:07:17,730
jacks or any other library.

93
00:07:17,880 --> 00:07:24,420
So to that end we basically need two kinds of functions that psyche learn would normally provide a training

94
00:07:24,420 --> 00:07:28,410
function and a prediction function in the only interface.

95
00:07:28,410 --> 00:07:32,560
These should be no higher raise or other basic Python types.

96
00:07:32,700 --> 00:07:36,260
There should be no pi towards specific inputs or outputs.

97
00:07:36,540 --> 00:07:40,810
So for the predict function we take in a model and a set of states.

98
00:07:41,100 --> 00:07:48,240
We can assume that the set of states has the typical and by D data shape inside the function we use

99
00:07:48,240 --> 00:07:51,760
torture that no grad since this is just a prediction.

100
00:07:51,840 --> 00:07:57,550
First we convert the states to float 32 and then convert it into a torch tensor.

101
00:07:57,660 --> 00:08:03,510
Then we pass that into the model and retrieve the output once we have the output we call dot num pi

102
00:08:03,810 --> 00:08:05,880
to return the prediction as an umpire a

103
00:08:11,380 --> 00:08:14,020
next we have the train ones that function.

104
00:08:14,020 --> 00:08:19,440
I call it this because we're going to do one step of our optimizer for every step we take in the environment.

105
00:08:20,500 --> 00:08:27,030
As input we take in the model the criterion the optimizer and a set of inputs and targets.

106
00:08:27,040 --> 00:08:32,320
Now you might be like Wait a minute I thought you said you don't want any PI towards specific inputs.

107
00:08:32,320 --> 00:08:34,110
So I lied a little bit.

108
00:08:34,120 --> 00:08:38,260
Well you'll see later that the Asian object will have no pi talk specific arguments.

109
00:08:38,320 --> 00:08:40,060
So that's kind of the true interface.

110
00:08:40,780 --> 00:08:46,980
Alternatively you could create the criterion and the optimizer as global variables otherwise we'd have

111
00:08:46,980 --> 00:08:52,560
to make this function part of the model including the criterion and the optimizer and so forth which

112
00:08:52,560 --> 00:08:54,630
would be very unconventional pi tau.

113
00:08:54,780 --> 00:08:56,650
So we don't want to do that.

114
00:08:56,670 --> 00:09:02,800
So anyway the first thing we do is convert the inputs and targets into flow 32 and turn them into torch

115
00:09:02,840 --> 00:09:04,140
sensors.

116
00:09:04,170 --> 00:09:06,430
Next we zero the gradients.

117
00:09:06,450 --> 00:09:09,710
Next we pass the inputs through the model to get the outputs.

118
00:09:09,780 --> 00:09:11,440
And next we calculate the loss.

119
00:09:11,490 --> 00:09:15,170
And as usual we call lost that backward and optimize it our step.

120
00:09:15,180 --> 00:09:17,010
And this does one step of gradient descent.
