WEBVTT

0
00:00.290 --> 00:01.270
All right.

1
00:01.280 --> 00:06.140
I want us to now start talking a bit more about the actual encoding itself.

2
00:07.120 --> 00:10.030
Firstly, what is and what is not a legal 

3
00:10.060 --> 00:16.970
URL is formally defined in the RFC 3986.

4
00:16.990 --> 00:20.150
But as we're going to see a bit later, there are others.

5
00:20.170 --> 00:23.140
And why is this spec so important?

6
00:23.180 --> 00:30.580
Well, it's important because the W3C URI Specification has accepted this as being the formal spec that

7
00:30.580 --> 00:31.960
should govern your URLs.

8
00:31.990 --> 00:34.630
What are the main conclusions we can draw from the spec?

9
00:34.780 --> 00:35.730
It's not that difficult.

10
00:35.740 --> 00:38.320
There are just two broad categories of characters.

11
00:38.380 --> 00:42.220
Reserved characters and unreserved characters.

12
00:42.400 --> 00:45.320
Reserved characters have a special meaning.

13
00:45.340 --> 00:47.170
They are reserved, right?

14
00:47.680 --> 00:53.290
Like a forward slash or a question mark that represents the start of a query string.

15
00:53.410 --> 00:58.660
We've already seen some examples of these. So, those are the reserved characters and those kind of have

16
00:58.660 --> 01:01.810
to be treated in a slightly different way, as you would expect.

17
01:01.810 --> 01:07.580
And then on the flip side, we've got these unreserved characters and these have no special meaning.

18
01:07.580 --> 01:12.950
And because they have no special meaning, they are allowed to be in the URI itself.

19
01:12.980 --> 01:15.470
Let me just get rid of all this noise.

20
01:16.520 --> 01:22.160
Like I mentioned, we've got reserved characters and unreserved characters, and reserved characters

21
01:22.160 --> 01:23.990
are very special.

22
01:24.020 --> 01:32.030
The specification has chosen these characters to mean something very specific when it comes to URLs.

23
01:32.240 --> 01:40.160
And because of this, if you want to use a reserved character in your URL, then they have to be encoded.

24
01:40.700 --> 01:44.150
For example, we know that question mark is the start of a query string.

25
01:44.150 --> 01:50.660
So if you want to use a question mark in your URL in some area, how does the browser know whether that's

26
01:50.660 --> 01:53.450
now a query string or whether it's just your character?

27
01:53.450 --> 01:55.130
And that's why it has to be encoded?

28
01:55.130 --> 01:56.390
It makes sense.

29
01:57.360 --> 02:03.300
Unreserved characters traditionally, like mentioned, are a limited subset of the ASCII character set.

30
02:03.300 --> 02:10.860
But I've read the spec and in my opinion the spec does not explicitly state that this list is entirely

31
02:10.860 --> 02:11.770
exhaustive.

32
02:11.790 --> 02:17.460
I'm saying that with a bit of tongue in cheek because behind the scenes they may still be URL encoding,

33
02:17.460 --> 02:20.330
but visually I'm talking about what the user sees,

34
02:20.340 --> 02:21.640
that is not the case.

35
02:21.660 --> 02:24.110
But don't get lost in all the detail.

36
02:24.120 --> 02:30.810
Encoding is particularly important for encoding characters that are not permitted to be in a URL.

37
02:31.140 --> 02:34.230
That's all that URL encoding is trying to do.

38
02:34.260 --> 02:40.530
It's trying to just take non-performing characters and transform them in a way that's safe to transmit

39
02:40.530 --> 02:41.210
over the web.

40
02:41.220 --> 02:47.490
So all of these special characters spaces - percentage signs, tabs, colons, the equal sign and a

41
02:47.490 --> 02:54.840
whole bunch of others - these need to be treated very specifically in the URL, and if we use them in

42
02:54.840 --> 03:00.580
a URL, they have to be encoded to distinguish them from the reserved set itself.

43
03:00.940 --> 03:02.740
Is that kind of making sense?

44
03:03.760 --> 03:04.360
I hope so.

45
03:04.360 --> 03:07.450
But it's not that you don't have other questions.

46
03:07.450 --> 03:12.610
I mean, you might be asking, you know, why does the URL not permit certain characters in the first place?

47
03:12.610 --> 03:14.650
Why can't we just have whatever we want?

48
03:14.950 --> 03:19.780
Well, your browser's just trying to make sure that all the characters you want to send with a GET request

49
03:19.810 --> 03:26.650
can arrive at the other end at the destination, and the browser has to encode some characters like

50
03:26.650 --> 03:29.230
unprintable ones spaces, for example.

51
03:29.230 --> 03:35.650
And as I just mentioned, it has to encode characters with special meaning because if it doesn't, how's

52
03:35.650 --> 03:42.070
it going to know whether that query string is a query string or whether it's just a question mark from

53
03:42.070 --> 03:42.610
your side?

54
03:42.610 --> 03:47.920
So it just logically makes sense that the URL has to not permit all characters.

55
03:47.920 --> 03:54.550
In other words, it makes logical sense that URL encoding takes place in some situations.

56
03:58.030 --> 04:03.850
But let me clarify here, my dear students, encryption is not the same as encoding your URL.

57
04:03.860 --> 04:07.250
Encoding is all about sending your data over a network.

58
04:07.280 --> 04:09.440
It's about transport.

59
04:09.680 --> 04:13.040
It doesn't make your data safe in any way.

60
04:13.190 --> 04:14.790
I just wanted to clarify that.

61
04:14.810 --> 04:15.280
Cool.

62
04:15.290 --> 04:16.040
Have you got it?

63
04:16.070 --> 04:16.680
Good.

64
04:16.700 --> 04:21.440
But now how does your encoding actually take place?

65
04:21.470 --> 04:23.180
What does it do?

66
04:23.810 --> 04:32.420
URL encoding replaces nonconforming characters with a percentage symbol (%), followed by two hexadecimal

67
04:32.420 --> 04:33.050
digits.

68
04:33.050 --> 04:35.470
But what is this nonconforming character?

69
04:35.480 --> 04:37.810
What do I mean by nonconforming?

70
04:37.820 --> 04:41.570
Well, it's those reserved characters we were just talking about, isn't it?

71
04:41.600 --> 04:47.960
If you want to use a question mark in your URL, it has to be URL encoded because it will be nonconforming.

72
04:47.960 --> 04:50.520
And what is that percentage symbol?

73
04:50.540 --> 04:51.740
What's that all about?

74
04:52.340 --> 04:56.630
Well, in the previous lecture you'll know that it's just the hex prefix.

75
04:56.630 --> 05:01.130
And after that percentage symbol, we have two hexadecimal digits.

76
05:01.160 --> 05:02.570
Hexadecimal 🤷‍♀️?

77
05:03.230 --> 05:07.850
Well, the good news is we spoke about that in the previous lecture, so it should all start kind of

78
05:07.850 --> 05:09.200
gelling together right now.

79
05:09.200 --> 05:12.140
You should start understanding the structure, what these things mean.

80
05:12.140 --> 05:17.000
And before we move on, I just want to discuss one of the most frequent URL encoded characters you'll

81
05:17.000 --> 05:20.300
come across, and that is the character space.

82
05:20.600 --> 05:25.640
The space character is quite special because it's unprintable and therefore it makes sense that it has

83
05:25.640 --> 05:27.320
to be URL encoded.

84
05:27.680 --> 05:31.190
Well, the value of a space in decimal form is 32.

85
05:31.190 --> 05:33.590
But we don't care about decimal forms, do we?

86
05:33.590 --> 05:41.870
Because URLs only understand hex and in hex the value 20 is assigned to the character space.

87
05:41.900 --> 05:45.710
Another one I want to look at actually is the plus sign as well.

88
05:45.710 --> 05:51.890
That's a common one you're going to see and that is represented by 2B in ASCII.

89
05:52.040 --> 05:57.020
So often in URLs you're going to be seeing %20 and %2B.

90
05:57.470 --> 05:57.740
Woo!

91
05:57.830 --> 06:03.890
We are cruising through this.  And it may seem quite daunting, but don't feel overwhelmed.

92
06:03.890 --> 06:10.190
In fact, the sets of reserved and unreserved characters are constantly changing with each revision

93
06:10.190 --> 06:12.980
of the specs that govern your URLs.

94
06:12.980 --> 06:20.360
So it can be very confusing. But don't worry, once you understand, once you grasp the fundamentals

95
06:20.360 --> 06:25.370
of URL encoding, then it really doesn't matter what's reserved or unreserved, who cares?

96
06:25.370 --> 06:29.630
Because we know exactly what's happening and we can always adapt as developers.

97
06:29.660 --> 06:30.170
All right.

98
06:30.170 --> 06:31.510
So I hope it's starting to gel.

99
06:31.520 --> 06:32.840
I hope it's starting to make sense.

100
06:32.840 --> 06:38.000
But in the next lecture, I really want to jump into these international characters.

101
06:38.000 --> 06:42.370
Remember that example where we use the Japanese characters and we could see it in the URL?

102
06:42.380 --> 06:49.520
Well, how is that possible when the RFC 3986 defines URLs as only containing a limited subset of ASCII

103
06:49.520 --> 06:50.300
characters?

104
06:50.300 --> 06:51.620
It's so weird.

105
06:51.650 --> 06:55.340
Well, the good news is we're going to jump into it right now.