1
00:00:11,110 --> 00:00:16,780
OK, so in this lecture, we are going to start looking at some code now, although the code I provided

2
00:00:16,780 --> 00:00:21,250
to you is in a CoLab notebook, you can see that I'm running this locally in a Jupiter notebook.

3
00:00:21,730 --> 00:00:26,360
This is one of those rare situations where using Jupiter notebook actually makes sense.

4
00:00:26,980 --> 00:00:28,180
So why do I say this?

5
00:00:28,780 --> 00:00:34,660
Firstly, recall that I've stored my Adewusi credentials in a local file, which I do not have on Google

6
00:00:34,660 --> 00:00:35,290
CoLab.

7
00:00:35,710 --> 00:00:39,940
So if you'd like to use Google CoLab, then make sure that you store your credentials in the correct

8
00:00:39,940 --> 00:00:40,640
location.

9
00:00:41,590 --> 00:00:46,330
For me, it makes sense to run the code locally because that's where my attaboys credentials are stored.

10
00:00:47,200 --> 00:00:52,420
The second major reason a notebook is good for this exercise is because, as you recall, some of these

11
00:00:52,420 --> 00:00:57,370
steps are going to take a very long time, sometimes hours, as you'll see very shortly.

12
00:00:57,380 --> 00:01:02,350
Everything you do is a job that you launch and then you check the status of your job to see whether

13
00:01:02,350 --> 00:01:03,270
or not it's done.

14
00:01:03,880 --> 00:01:06,430
Only when it's done can you move on to the next step.

15
00:01:07,330 --> 00:01:12,130
So basically, you're going to run some code to launch various jobs and then you're going to run more

16
00:01:12,130 --> 00:01:14,230
code to check the status of those jobs.

17
00:01:14,680 --> 00:01:18,820
But you're going to be checking the status repeatedly until those jobs are finished.

18
00:01:19,300 --> 00:01:24,100
This basically means you're hitting the run button over and over again until the job is done, which

19
00:01:24,100 --> 00:01:26,200
is nice because you can just click a button.

20
00:01:27,190 --> 00:01:32,320
Another reason is that when it comes to deleting a job, you'll have to go backwards in the notebook.

21
00:01:32,710 --> 00:01:36,490
You'll see what I mean if you ever try to delete one of the things we're about to create.

22
00:01:37,210 --> 00:01:42,610
Basically, it's a long chain of dependencies, so you can't delete your data set without first deleting

23
00:01:42,610 --> 00:01:48,160
your predictor, but you can't delete your predictor without first deleting your forecast and so on

24
00:01:48,160 --> 00:01:48,970
and so forth.

25
00:01:49,480 --> 00:01:52,780
In any case, you'll see what each of these items is later in this section.

26
00:01:53,860 --> 00:01:57,790
So because of this, it's very helpful to be able to scroll through the previous code.

27
00:01:57,790 --> 00:02:00,790
You've run and then run the relevant blocks when you need.

28
00:02:01,480 --> 00:02:03,720
Now, of course, you can do all that and CoLab too.

29
00:02:03,730 --> 00:02:09,970
So this is more like why you might not want to do this in a regular Python script or in the Python console,

30
00:02:10,150 --> 00:02:11,570
which I frequently use.

31
00:02:12,490 --> 00:02:17,890
One reason you might want to avoid CoLab notebook is that, remember, it's online and you're using

32
00:02:17,890 --> 00:02:19,030
shared resources.

33
00:02:19,360 --> 00:02:24,580
If you stay idle for too long or your Internet disconnects, CoLab simply destroys your session and

34
00:02:24,580 --> 00:02:25,660
you have to start again.

35
00:02:26,290 --> 00:02:31,240
So if your model takes five hours to train, it's likely your notebook won't be running anymore by the

36
00:02:31,240 --> 00:02:32,060
time it's done.

37
00:02:32,830 --> 00:02:38,410
So my personal preference, if I were you, would be to just download the CoLab notebook locally as

38
00:02:38,410 --> 00:02:41,170
a Jupiter notebook and run it on your own computer.

39
00:02:41,740 --> 00:02:45,450
CoLab is just a convenient way for me to share the notebook with you.

40
00:02:50,240 --> 00:02:56,900
So the basic idea is we're going to start by creating some caves that will be read by Adewusi forecast,

41
00:02:57,440 --> 00:03:03,200
the next step is to upload these caves to S3, which is Amazon's simple storage service.

42
00:03:03,900 --> 00:03:08,260
Adewusi Forecast will then read the files from S3 and train a model from them.

43
00:03:08,990 --> 00:03:14,840
Once we've trained, our model will generate a forecast and then compare that to the true Time series.

44
00:03:19,420 --> 00:03:26,080
So let's start by importing pandas, why finance Boto three in daytime in the first block of code will

45
00:03:26,080 --> 00:03:29,590
download and transform the data will be using for the forecast.

46
00:03:30,070 --> 00:03:35,560
So here we call why after download and we'll be using stock prices from the S&amp;P 500.

47
00:03:36,100 --> 00:03:42,280
We use data from January one twenty eighteen up to April 17 at twenty twenty one.

48
00:03:43,930 --> 00:03:48,250
Next, we call DFG head just to remind ourselves what the data looks like.

49
00:03:48,720 --> 00:03:54,430
As you recall, the date column is used as an index and then we have open high, low, close adjusted

50
00:03:54,430 --> 00:03:55,600
close in volume.

51
00:03:56,710 --> 00:04:00,040
So we'll be using the close column instead of the adjusted close column.

52
00:04:00,730 --> 00:04:06,370
The reason for this is if you look at the open high and low prices, notice that these are not adjusted.

53
00:04:06,800 --> 00:04:10,140
So in the first row, the adjusted closes is about two five three.

54
00:04:10,480 --> 00:04:15,610
But the open, high, low closed prices are all around to sixty seven to sixty nine.

55
00:04:19,620 --> 00:04:23,560
In the next block, we're going to start working on filling in the missing data.

56
00:04:24,240 --> 00:04:29,700
The first thing we need to do in this process is generate a data frame with all the necessary dates.

57
00:04:30,450 --> 00:04:36,510
To do this, we use the date range function for the start date will pass in the first date from our

58
00:04:36,510 --> 00:04:43,470
data at index zero for the end date will pass in the last date from our data at index minus one.

59
00:04:44,870 --> 00:04:50,630
When we print out the result, you can see that all the dates are now contiguous, by the way, note

60
00:04:50,630 --> 00:04:55,670
that for this lecture, it's essentially necessary to have run this script before the actual lecture

61
00:04:55,670 --> 00:04:59,130
recording, since some steps can take several hours to run.

62
00:04:59,600 --> 00:05:03,460
So that's why I'm not clicking around the notebook alive, since it wouldn't make a lot of sense.

63
00:05:07,610 --> 00:05:14,320
The next step is to create a new data frame with our new date range as an index, if we do a dot head,

64
00:05:14,510 --> 00:05:18,980
we can see that it's an empty data frame with no columns except for the index.

65
00:05:22,920 --> 00:05:29,100
Next, we call the joint function, joining our new data frame with the existing data, we use an outer

66
00:05:29,100 --> 00:05:30,960
join so that all the rows are kept.

67
00:05:31,620 --> 00:05:37,440
You can see from the output that our data frame now has nans for the days where there was no data before.

68
00:05:41,090 --> 00:05:47,040
The next step is to fill in the missing data, as you recall, for stock prices we use forward filling.

69
00:05:47,840 --> 00:05:52,640
Remember that we don't want to do something like interpellation since that requires you to look into

70
00:05:52,640 --> 00:05:53,330
the future.

71
00:05:54,440 --> 00:05:56,790
The only exception to this is the volume column.

72
00:05:57,440 --> 00:06:02,510
We know that for days where there is missing data, those are non-trading days and the trading volume

73
00:06:02,510 --> 00:06:04,780
for those days is obviously zero.

74
00:06:05,730 --> 00:06:09,320
Once we've done this, we do another dot head to check the output.

75
00:06:11,360 --> 00:06:16,610
So we can see here that for the days where we previously had missing data, the prices are now carried

76
00:06:16,610 --> 00:06:18,980
forward, whereas the volume is zero.

77
00:06:23,270 --> 00:06:29,300
Next, we're going to save our data frame to a CSV just in case we need a later, since this script

78
00:06:29,300 --> 00:06:31,790
is meant to be run from start to end in one go.

79
00:06:32,090 --> 00:06:33,850
This probably won't be necessary.

80
00:06:34,280 --> 00:06:39,350
However, saving it lets you return to the script at another time or continue the work in a different

81
00:06:39,350 --> 00:06:39,940
script.

82
00:06:42,910 --> 00:06:48,070
So I always think it's helpful to actually look at the output of what we've done, so here I'm using

83
00:06:48,070 --> 00:06:50,810
the head come in to check what our CSFI looks like.

84
00:06:52,350 --> 00:06:57,140
As you can see, it is indeed a CSFI of our data frame, including the date column.

85
00:06:57,930 --> 00:07:01,080
Note that the date column doesn't have a name, which is fine.

86
00:07:04,990 --> 00:07:11,830
Next, as you recall, Adewusi forecast requires that we have a column called Item ID, therefore,

87
00:07:11,830 --> 00:07:14,630
I've created a column called Item ID.

88
00:07:15,220 --> 00:07:19,830
It's going to take on the value S.P.I since that's the ticker for our data set.

89
00:07:20,530 --> 00:07:25,340
In practice, what you might want to do in your own experiments is a vector auto regression.

90
00:07:25,900 --> 00:07:31,300
So, for example, you can include all the stocks in the S&amp;P 500 and each of them would have their own

91
00:07:31,300 --> 00:07:32,170
item ID.

92
00:07:33,250 --> 00:07:39,400
As a side note, these can also be stored as separate caves since Adewusi forecast has the ability to

93
00:07:39,400 --> 00:07:41,620
look at more than one caveat once.

94
00:07:45,030 --> 00:07:51,330
The next step is to define the forecast length, which I've said of 30, so our forecasts will be approximately

95
00:07:51,330 --> 00:07:52,850
one month into the future.

96
00:07:53,640 --> 00:07:59,790
Using this, we can now define our train set, which is the data frame, D.F. up to the last 30 data

97
00:07:59,790 --> 00:08:00,440
points.

98
00:08:03,540 --> 00:08:09,720
The next step is to separate out the Target type series and the related Time series, so we discuss

99
00:08:09,720 --> 00:08:13,170
the differences between these data sets earlier in this section.

100
00:08:14,660 --> 00:08:20,240
The Target series is the series whose value we want to forecast, since that's the close price, we've

101
00:08:20,240 --> 00:08:21,290
chosen the close call.

102
00:08:22,910 --> 00:08:27,680
We also need the item ID column since that's required by a US forecast.

103
00:08:28,430 --> 00:08:33,530
The related series is a series of features that can be used to predict the Target series.

104
00:08:34,610 --> 00:08:38,870
Technically, we don't know that these would be useful, but we are making the assumption that they

105
00:08:38,870 --> 00:08:39,250
are.

106
00:08:39,830 --> 00:08:45,020
So for the related series, we choose open high, low volume as before.

107
00:08:45,170 --> 00:08:46,760
We also need the item ID.

108
00:08:50,900 --> 00:08:58,160
Next, we're going to write both of these new data frames to CSV files using the Toksvig function we

109
00:08:58,160 --> 00:09:03,650
set header equals nine, a sense Adewusi forecast does not require headers in your CSS vs.

110
00:09:07,570 --> 00:09:13,090
And again, it's always nice to check the output of your code to make sure it's what we expect, so

111
00:09:13,090 --> 00:09:17,760
we call the head commands and look at the first five lines of our Target series, CSFI.

112
00:09:18,430 --> 00:09:23,910
So each row here is the date, followed by the price, followed by the ticker, which is the item ID.

113
00:09:24,370 --> 00:09:25,660
So this looks correct.

114
00:09:29,360 --> 00:09:35,750
Next, we do the same thing for the related Time series here we see the date followed by a few prices,

115
00:09:35,870 --> 00:09:40,200
followed by a really big number, which represents the volume followed by the ticker.

116
00:09:40,550 --> 00:09:41,780
So that looks correct.

117
00:09:43,130 --> 00:09:45,350
OK, so that's everything for this lecture.

118
00:09:45,740 --> 00:09:51,140
Now that our data is stored in the format we want, the next step will be to upload this data to S3

119
00:09:51,380 --> 00:09:53,930
to be ingested by Adewusi forecast.