1
00:00:11,690 --> 00:00:15,070
In this lecture we are going to do a few more tests in CO lab.

2
00:00:15,110 --> 00:00:20,240
Specifically we're going to look at some ways to upload your own data set to collab.

3
00:00:21,140 --> 00:00:27,170
Let's say for example you're a client or employer gives you a CSP file or you downloaded a CSP from

4
00:00:27,170 --> 00:00:28,310
Kaggle.

5
00:00:28,310 --> 00:00:32,990
How can we then make this file accessible from our call lab notebook.

6
00:00:32,990 --> 00:00:36,650
In this lecture we are going to discuss a few different ways of doing this.

7
00:00:40,010 --> 00:00:46,400
The first method we're going to look at is just to use the classic Linux command w get as mentioned

8
00:00:46,400 --> 00:00:52,940
previously you can run command line commands by preceding the command with the bang symbol or exclamation

9
00:00:52,940 --> 00:00:53,390
mark.

10
00:00:53,900 --> 00:01:06,850
So let's go ahead and download the arrhythmia dataset.

11
00:01:07,260 --> 00:01:13,500
Now we want to check where the data went so let's use it bang l s to see if the data is in our current

12
00:01:13,500 --> 00:01:14,130
directory

13
00:01:18,150 --> 00:01:18,570
OK.

14
00:01:18,610 --> 00:01:26,100
It looks like it is now let's use the head command to see the first few lines of the data file and also

15
00:01:26,100 --> 00:01:34,140
to check whether or not the file has a header row.

16
00:01:34,220 --> 00:01:36,470
So it looks like it does not have a header row

17
00:01:39,780 --> 00:01:43,320
next let's try to load in the data using pandas.

18
00:01:43,320 --> 00:01:47,970
We're going to pass in a header equals none since we know that the data does not have a header

19
00:01:53,330 --> 00:01:56,090
next since the data has many columns.

20
00:01:56,090 --> 00:01:58,510
We're just going to take the first few.

21
00:01:58,550 --> 00:02:04,970
We're also going to rename the columns because they're currently just integer values as usual since

22
00:02:04,970 --> 00:02:08,620
this data is from the UCI machine learning repository.

23
00:02:08,690 --> 00:02:13,910
You can just check the documentation if you want to know more about the data like what each column is

24
00:02:15,910 --> 00:02:16,990
so let's run this

25
00:02:19,820 --> 00:02:26,120
next let's create a histogram of these data columns since notebook by default makes the plot pretty

26
00:02:26,120 --> 00:02:26,630
small.

27
00:02:26,630 --> 00:02:31,910
We're going to import map plot lib and change the figure size once we've done that.

28
00:02:31,970 --> 00:02:36,890
We can call DFA dot hist to create histogram for each column.

29
00:02:37,190 --> 00:02:43,310
Note that I've added a semicolon to the end of the F dot hist because if you don't then no book will

30
00:02:43,310 --> 00:02:48,020
print out the last returned value like it usually does which we don't want right now.

31
00:02:53,690 --> 00:02:56,480
So here are some nice histogram is for you to look at.

32
00:03:00,440 --> 00:03:03,500
Next let's create a common plot for data analysis.

33
00:03:03,500 --> 00:03:05,420
The scatter matrix.

34
00:03:05,420 --> 00:03:11,360
This does a scatter plot between each feature and every other feature along the diagonal it just plots

35
00:03:11,360 --> 00:03:14,450
a histogram of each feature which we've already seen

36
00:03:21,880 --> 00:03:22,220
all right.

37
00:03:22,250 --> 00:03:23,990
So pretty standard so far

38
00:03:29,110 --> 00:03:35,590
next let's look at the second method of loading in data which also applies when you have a you are out.

39
00:03:35,590 --> 00:03:41,940
This is to use tends to flow directly specifically the cars get file function.

40
00:03:41,950 --> 00:03:46,510
Now you might be thinking why would we use tensor flow if this course is about pi to work.

41
00:03:46,780 --> 00:03:49,500
That's a great question and I would understand if you were hesitant.

42
00:03:50,140 --> 00:03:54,130
However it's important to realize that these are all just Python libraries.

43
00:03:54,130 --> 00:03:56,710
You're not restricted to using only one or the other.

44
00:03:56,770 --> 00:03:58,240
That would be silly.

45
00:03:58,240 --> 00:04:02,520
As an example there are several libraries for making HDTV requests.

46
00:04:02,530 --> 00:04:03,820
One is called requests.

47
00:04:04,000 --> 00:04:05,740
Another is called Your relic.

48
00:04:06,040 --> 00:04:11,930
I don't think anyone would hesitate to use both in their Python code if you need more convincing.

49
00:04:11,950 --> 00:04:13,170
It's actually common.

50
00:04:13,270 --> 00:04:18,550
Even if you're working in PI talk to make use of the text pre processing functionality that comes with

51
00:04:18,550 --> 00:04:19,650
characters.

52
00:04:19,750 --> 00:04:25,290
So in fact data scientists and machine learning engineers have already been doing this for years.

53
00:04:25,360 --> 00:04:30,430
Don't think just because I'm using pi to which I'm not allowed to use the utilities from any other deep

54
00:04:30,430 --> 00:04:37,000
learning library let's start by assigning the euro to a variable called Euro.

55
00:04:37,640 --> 00:04:42,830
We're going to be using the auto SPG data set although it doesn't really matter what you use for this

56
00:04:42,830 --> 00:04:43,430
example.

57
00:04:43,430 --> 00:04:46,950
As long as you can access it directly via your

58
00:04:49,620 --> 00:04:57,890
let's run this next we're going to make sure we have tensor flow 2.0 installed so we're going to run

59
00:04:57,920 --> 00:05:02,890
pip install tensor flow and then prints out the version to make sure that we have the correct one

60
00:05:07,320 --> 00:05:12,870
note that when I first created this lecture tensor flow to was not yet released today you could just

61
00:05:12,870 --> 00:05:20,010
do import sensor flow SDF and it'll automatically load the latest release which should be 2.0 or higher.

62
00:05:20,010 --> 00:05:25,920
In addition remember that this uses the caris API so an equivalent alternative to this would be to just

63
00:05:25,920 --> 00:05:31,850
use carries itself whenever you see CFD KRS you can replace that with just the careless.

64
00:05:32,430 --> 00:05:39,880
So TFT carries that get file becomes crossed start get file presuming you imported carries earlier although

65
00:05:39,900 --> 00:05:42,840
remember that Kerry's installation still require a back end.

66
00:05:42,870 --> 00:05:50,110
So ultimately you would probably already have tens of low installed next we're going to call the carers

67
00:05:50,180 --> 00:05:52,760
get filed function.

68
00:05:52,780 --> 00:05:54,670
The first argument is the file path.

69
00:05:54,670 --> 00:05:56,090
We want to save too.

70
00:05:56,210 --> 00:06:04,670
And the second argument is the far source so let's run this note that it's possible to save the file

71
00:06:04,670 --> 00:06:11,030
to a different directory but we'll be saving it to Chris's default folder so you can see from the printout

72
00:06:11,030 --> 00:06:19,720
that the file ends up in slash root slash scare us slash datasets next let's call the head command so

73
00:06:19,720 --> 00:06:22,180
that we can see the first few lines of the file

74
00:06:28,220 --> 00:06:31,300
as you can see it's not exactly a C as V.

75
00:06:31,430 --> 00:06:38,770
Instead each column is separated by whitespace and there is no hetero so in order to load this data

76
00:06:38,800 --> 00:06:46,750
we can still use the pen as read C as V function but we have to pass in two extra arguments the first

77
00:06:46,750 --> 00:06:50,340
argument is to say that there's no hetero so hetero equals none.

78
00:06:50,470 --> 00:06:55,300
And the second extra argument is the tell partners that the delimiter is whitespace.

79
00:06:55,300 --> 00:06:58,420
So we set the limb whitespace equal to true

80
00:07:01,360 --> 00:07:10,140
X. We call DFT head is to make sure everything works as expected.

81
00:07:10,190 --> 00:07:15,800
So as you can see the result appears to be in the right format and from here you can process this data

82
00:07:15,830 --> 00:07:21,580
using Python code as you normally would.

83
00:07:21,610 --> 00:07:27,340
The third method we're going to look at in order to add your own files to collab is to upload the file

84
00:07:27,340 --> 00:07:29,650
directly in order to do this.

85
00:07:29,650 --> 00:07:32,110
We have to run a special collab function

86
00:07:36,130 --> 00:07:47,420
so we say from Google that collab import files then we call files dot upload so let's run this so you

87
00:07:47,420 --> 00:07:52,630
see that this creates an upload button which we can click and then choose a file from the local file

88
00:07:52,630 --> 00:07:53,170
system.

89
00:07:57,570 --> 00:08:00,410
Some of the truce daily minimum temperatures

90
00:08:03,710 --> 00:08:10,070
and if we prints out the returned value you can see that it's a dictionary where the file name is the

91
00:08:10,070 --> 00:08:12,860
key and the value is the file contents

92
00:08:16,490 --> 00:08:22,760
if we use the command bang unless we can see that the file has been uploaded to the working directory

93
00:08:26,310 --> 00:08:33,140
next let's read in the file using pandas to make sure we get what we expect now this file has some garbage

94
00:08:33,140 --> 00:08:33,960
lines near the end.

95
00:08:33,980 --> 00:08:39,110
So I've accounted for that by setting the argument error bad lines equal to False.

96
00:08:39,110 --> 00:08:42,080
This ignores errors but prints them out as they are encountered

97
00:08:48,530 --> 00:08:51,320
as you can see the file is loaded in successfully

98
00:08:57,540 --> 00:09:02,280
to follow up this example we're going to look at a variation on what we just did.

99
00:09:02,550 --> 00:09:08,010
You recall that when you're writing code in Python sometimes it's useful to split your code amongst

100
00:09:08,070 --> 00:09:09,260
several files.

101
00:09:10,530 --> 00:09:15,690
This helps to organize your code and keep similar things all in one place while keeping different things

102
00:09:15,690 --> 00:09:19,120
separate as a simple example.

103
00:09:19,120 --> 00:09:24,490
Sometimes we'll learn about multiple algorithms in one chorus but we'll test all those algorithms on

104
00:09:24,490 --> 00:09:26,150
the same dataset.

105
00:09:26,170 --> 00:09:31,670
So there's no point in rewriting the code to load in the data set multiple different times.

106
00:09:31,690 --> 00:09:37,540
Instead we can write the data loading code once and then import it from each file.

107
00:09:37,540 --> 00:09:42,730
Now you might wonder is since we're working in code lab how can you import a function from a Python

108
00:09:42,730 --> 00:09:43,460
script.

109
00:09:43,540 --> 00:09:49,420
If that python script is on your local harddrive luckily we can take the same approach we already have

110
00:09:49,420 --> 00:09:53,110
been to upload that file to Google collab.

111
00:09:53,110 --> 00:09:58,900
So here I'm going to call fouls that upload again and this time I'm uploading the python script.

112
00:09:58,900 --> 00:10:00,280
Fake you till that pie

113
00:10:13,900 --> 00:10:20,590
so fake you tell that pi contains only one function called my useful function and all it does is print

114
00:10:20,590 --> 00:10:22,570
out hello world.

115
00:10:22,570 --> 00:10:28,240
So once you've uploaded the file you can see that we can import it just like we would if we were working

116
00:10:28,240 --> 00:10:28,970
locally.

117
00:10:29,380 --> 00:10:31,520
So I can say from fake you tell.

118
00:10:31,750 --> 00:10:34,090
Import my useful function.

119
00:10:34,090 --> 00:10:40,170
Then when I call my useful function you can see that hello world is printed out just like we expect

120
00:10:46,080 --> 00:10:52,170
and by the way you might be wondering as I did what the path of the current directory actually is to

121
00:10:52,170 --> 00:10:52,860
determine this.

122
00:10:52,860 --> 00:10:57,450
You can just run the usual Linux command Pete WD and that prints out

123
00:11:01,300 --> 00:11:05,920
slash content so slash content is our current working directory

124
00:11:11,170 --> 00:11:15,100
the last thing I want to cover is something you're probably all wondering.

125
00:11:15,340 --> 00:11:17,480
Google Drive is for storing files.

126
00:11:17,500 --> 00:11:22,310
So is it possible to access files on your google drive.

127
00:11:22,420 --> 00:11:24,580
And of course the answer is yes.

128
00:11:24,640 --> 00:11:33,400
So in order to do this we have to import drive from Google at then we have to mount the drive by calling

129
00:11:33,400 --> 00:11:47,050
drive down Mount and specifying the path slash content slash G drive.

130
00:11:47,060 --> 00:11:49,770
So this is going to give you an authorization code.

131
00:11:50,000 --> 00:11:52,610
So you go to the euro in your browser

132
00:11:56,560 --> 00:12:07,420
it asks you to sign in as a accepts terms and then it gives you a code.

133
00:12:07,470 --> 00:12:12,940
You copy this code and you put it back into this box.

134
00:12:12,940 --> 00:12:13,900
You hit Enter

135
00:12:21,610 --> 00:12:23,560
Okay so that works.

136
00:12:23,570 --> 00:12:28,730
So after we've done this we can call LSI again to check what's now in the current directory

137
00:12:33,790 --> 00:12:36,150
we can see that there is now an extra thing here.

138
00:12:36,190 --> 00:12:37,410
She drive.

139
00:12:37,600 --> 00:12:43,670
So let's allez G drive and see what that gives us.

140
00:12:43,950 --> 00:12:44,190
All right.

141
00:12:44,220 --> 00:12:47,820
So it looks like we now have a thing called Google Drive.

142
00:12:47,820 --> 00:12:50,030
Once again I'll ask this.

143
00:12:50,340 --> 00:12:54,060
And remember that you have to add quotes if your path contains whitespace

144
00:12:59,050 --> 00:13:04,570
and now we can see a bunch of files that are in my google drive which is essentially a bunch of VIP

145
00:13:04,570 --> 00:13:07,810
content for the VIP versions of my courses.
