WEBVTT

00:00.060 --> 00:00.300
Okay.

00:00.330 --> 00:02.330
Now we'll use that notification channel.

00:02.340 --> 00:05.700
We'll simulate some errors and get an alert for that.

00:05.710 --> 00:09.390
All uses of the devices that I set up in the influx DB section.

00:09.390 --> 00:15.810
So when one of these devices stops working or at least there's an MP daemon stops will get an alert.

00:15.990 --> 00:18.390
So when creating alerts, there are many things to consider.

00:18.420 --> 00:21.120
Now CAFAGNa here will be polling in flux.

00:21.120 --> 00:27.450
DB for information and these telegraphs are pushing information about thesis and AMP days to the influx.

00:27.450 --> 00:32.910
DB So what's actually breaking here in this scenario is one of these daemons.

00:32.910 --> 00:35.730
So there are two things in between flux DB and Telegraph.

00:35.790 --> 00:39.060
So in flux database are going to be broken and neither is telegraph.

00:39.060 --> 00:40.960
So how you would test for that particular scenario?

00:40.960 --> 00:43.470
And Gasana It's going to be different for every situation.

00:43.470 --> 00:45.090
But anyway, this may be ways to solve a problem.

00:45.120 --> 00:46.590
I'll show you one way of doing it.

00:46.740 --> 00:53.400
Okay, so I'll set up now an alert, starting with my Moscow S&amp;P Daemon so that we know if we have lost

00:53.400 --> 00:56.100
connection to that, that could be because the process stopped.

00:56.130 --> 00:56.270
Okay.

00:56.340 --> 01:01.020
So in the alerting alert rules page, we have an option to create a new alert rule.

01:01.050 --> 01:09.030
Hey, I'm going to call this s mp d down and for my is choose a type Bafana mage to load because we're

01:09.030 --> 01:15.930
managing this alert in Gryphon so folder this version of Gryphon 8.3 says you must create a folder.

01:16.020 --> 01:20.340
I haven't created a folder in all the previous spaces so far before I can say it, this is going to

01:20.340 --> 01:22.140
want a folder so I'm just going to create new folder.

01:22.350 --> 01:24.750
S an MP d I'm going to create that nice.

01:24.750 --> 01:28.170
That folder is now credit and you'll find folders underneath dashboard.

01:28.170 --> 01:32.820
So that's a way for us to organize our dashboards data into folders and we'll sit at the moment.

01:33.060 --> 01:34.320
Okay, so that is compulsory.

01:34.320 --> 01:36.300
It might not be in later versions of Gryphon.

01:36.300 --> 01:36.870
It's hard to know.

01:37.470 --> 01:38.940
Okay, we need to create a query.

01:40.020 --> 01:40.530
In flux.

01:40.530 --> 01:47.520
Debbie, now the alert manager will be running queries periodically looking for the results, whether

01:47.520 --> 01:51.240
they exist within a range or whether they don't exist at all.

01:51.270 --> 01:53.850
I know data, so we need to create a query.

01:53.880 --> 01:55.880
I'm going to go into influx day being prepare my query.

01:56.310 --> 01:59.280
So in the Explore tab, I'm going to look at S&amp;P.

01:59.310 --> 02:00.680
There is an MP.

02:00.690 --> 02:05.340
Uptime is a good property to know about whether it's an MP is working or not.

02:05.370 --> 02:10.500
I'm not going to select agent host, I'm going to stick source and I'm going to select just one of them

02:10.500 --> 02:11.880
being the most Q of first.

02:12.060 --> 02:12.420
Okay.

02:12.420 --> 02:13.620
So script editor.

02:13.650 --> 02:13.830
Okay.

02:13.830 --> 02:14.650
That's my query.

02:14.670 --> 02:15.100
Excellent.

02:15.120 --> 02:19.920
Put that query into flux db and go out and we have an updated graph.

02:20.040 --> 02:24.960
Now what we need to do for this particular case, what I want to know the SNP daemon is not working.

02:24.970 --> 02:26.490
We're actually going to stop getting data.

02:26.490 --> 02:31.990
So I need to think about what is a good time range for me to decide that I'm not getting data.

02:32.010 --> 02:33.210
So for this influx.

02:33.210 --> 02:40.380
DB I'm going to change that down to now minus one minute and just rerun that query.

02:40.590 --> 02:40.920
Okay.

02:40.920 --> 02:45.830
So now I'm looking at 1 minutes worth of data, so I'm going to decide that if I'm not getting any data

02:45.840 --> 02:49.770
for one minute from my SNP daemon, then I'm going to consider that down.

02:49.800 --> 02:53.850
Now one thing about reading data from as an MP daemons is that it uses the UDP protocol.

02:53.880 --> 03:00.720
UDP can be unreliable by design, so there could actually be a chance that your SNP Damian is running.

03:00.720 --> 03:04.650
But there's a problem on the network preventing telegraph from querying the data.

03:04.650 --> 03:10.190
So I'm just bringing that to your attention just so that you understand that doing this query in Gafah

03:10.260 --> 03:16.200
is not the most reliable way of testing whether ICMP Diamond is actually working, but it's a good example

03:16.200 --> 03:19.530
to show anyway because quite a complex task to try and achieve in your phone.

03:19.650 --> 03:26.940
So anyway, I have a query, I have 1 minutes worth of data and I'm checking whether I have results

03:26.940 --> 03:27.900
in that one minute.

03:28.440 --> 03:35.550
So the condition here under b expression conditions when the average or last that doesn't really matter

03:35.550 --> 03:38.550
in this particular case of a has no value.

03:38.820 --> 03:44.760
So what I'm checking for is if there are no values in that query, one minute to now, I can even change

03:44.760 --> 03:48.240
that to now minus 30 seconds or even 10 seconds, but 1 minutes.

03:48.400 --> 03:48.690
Okay.

03:49.210 --> 03:49.470
Okay.

03:49.500 --> 03:54.030
So that query is done so we can just run that again and well, I can see it's still working.

03:54.030 --> 03:56.250
I still have results in that time period.

03:56.940 --> 03:58.680
Three Define the alert condition.

03:58.680 --> 04:00.960
So it's looking at the query, be there.

04:01.410 --> 04:02.820
So it needs to be looking at the expression.

04:03.600 --> 04:06.330
We're going to evaluate this query every one minute.

04:06.360 --> 04:13.080
Now, in this particular example, this for property has no effect in older versions of Gryphon up before

04:13.080 --> 04:13.950
8.3.

04:13.980 --> 04:18.990
This would cause the alert to go into a pending state for this particular problem I'm trying to solve

04:18.990 --> 04:20.970
and then go into a following state.

04:20.970 --> 04:25.170
And when it's in the following state, we then get the alerts, for example, through our email channel

04:25.170 --> 04:27.870
in this version of Gryphon to 8.3.3.

04:27.900 --> 04:35.520
This particular property has no effect for this kind of query where I'm looking for no data that may

04:35.530 --> 04:37.830
change in future versions of FINA.

04:37.860 --> 04:44.010
The intended purpose of this field is to once a condition is breached, for example, in one minute

04:44.010 --> 04:45.660
is decide that there's no data.

04:45.660 --> 04:47.880
For example, the alert will go into a pending state.

04:47.880 --> 04:52.440
If it's pending for longer than that full value being 5 minutes, it will become a firing alert.

04:52.470 --> 04:57.510
Now, the way I've set this up is that it's going to go into firing state as soon as this decided.

04:57.510 --> 04:59.640
There is no data in the last minute.

04:59.640 --> 05:01.500
And that's because I have one minute set up there.

05:02.010 --> 05:02.400
Okay.

05:02.970 --> 05:08.250
These options here for summary description and run book URL, these are up to you.

05:08.250 --> 05:14.010
I'm not going to use any of those six things that you can make happen when it goes into firing state.

05:14.100 --> 05:14.460
Okay.

05:14.460 --> 05:20.580
So let's preview what the state of the problem or the alert would be now as saying normal.

05:20.640 --> 05:20.880
Okay.

05:20.880 --> 05:25.980
So looking at this graph, it appears normal because I'm doing a condition where that condition returns

05:25.980 --> 05:26.490
false.

05:26.520 --> 05:31.980
The average of I has no value while there are values for the average of I and that's the A query there.

05:32.130 --> 05:35.840
So now you actually rename your charisma default if you want to take a problem to leave it as I.

05:35.970 --> 05:39.360
Okay, so going down that looks pretty good to me.

05:39.390 --> 05:40.770
Let's save and exit that.

05:40.860 --> 05:41.700
Okay, that's done.

05:41.760 --> 05:48.000
Now in the alerting panel, get the rules section under the final I before the S&amp;P do and in that I

05:48.000 --> 05:55.650
have one rule for S&amp;P de down my school state normal and we can view that never go and that's what the

05:55.650 --> 06:01.680
query is checking against for anything in the last one minute the query remember is asking is there

06:01.680 --> 06:03.060
nothing in that last one minute?

06:03.060 --> 06:06.900
Then it would consider that as true and it actually would be writing a number one down there.

06:07.020 --> 06:07.860
So what now?

06:07.860 --> 06:09.510
It's saying zero is not a problem.

06:09.870 --> 06:11.130
The state is considered normal.

06:11.250 --> 06:16.440
So when everything's normal like that, we're not getting anything being highlighted on that screen.

06:16.440 --> 06:20.760
But what I'll do now is actually turn off that s an impact daemon on that mask crossover.

06:20.790 --> 06:26.580
Okay so my my my you all server pseudo service, S&amp;P DS stop it.

06:26.630 --> 06:27.180
Stop it.

06:28.240 --> 06:28.660
Okay.

06:28.870 --> 06:33.280
Now, in California, we look at this broad view.

06:34.000 --> 06:35.710
We'll start to see that.

06:36.610 --> 06:40.030
These data points will just disappear out of the query range.

06:40.420 --> 06:44.180
So if I just refresh that, it's right over here now.

06:44.180 --> 06:45.280
So do I.

06:45.280 --> 06:48.130
Going back to the alerting rule page.

06:48.280 --> 06:49.870
Nothing's showing up there.

06:50.230 --> 06:54.100
I'll fast forward the video until we get something showing up.

06:54.430 --> 06:55.030
There's a problem.

06:56.600 --> 06:56.990
Okay.

06:56.990 --> 06:58.370
So one minute has passed.

06:58.430 --> 07:04.790
It's now considered that it's firing as one error so I can filter by firing for normal or pending.

07:04.880 --> 07:08.690
If I wanted to clear a filter, let's look at this firing rule here.

07:08.720 --> 07:11.150
Firing SNP down MySchool.

07:11.360 --> 07:12.610
Let's look at it.

07:12.620 --> 07:13.520
More detail.

07:13.640 --> 07:19.250
If we view it, we get this error up here file to evaluate queries and expressions, fail to execute

07:19.250 --> 07:23.210
conditions, etc. That is because there's no data being returned.

07:24.080 --> 07:24.500
Okay.

07:24.500 --> 07:30.560
So while influx db still working and telegraph is still working, this mp d is not working.

07:30.570 --> 07:35.630
So telegraph op dragging flux db carafano is only looking at that one particular thing too.

07:35.630 --> 07:41.630
There's no data coming from the S and MPD case, so we're not seeing anything in here, but it's telling

07:41.630 --> 07:43.620
us that it's firing and alerting anyway.

07:43.640 --> 07:44.840
I'm gonna check my emails now.

07:44.900 --> 07:45.200
Okay.

07:45.200 --> 07:49.130
So if I check my inbox, I'm actually not getting any alert at the moment.

07:49.190 --> 07:57.920
What we need to do is we need to set a notification policy, says all alerts go to the default contact

07:57.920 --> 07:58.250
point.

07:58.250 --> 08:04.010
Unless you set additional matches in the specific reading area, we can use the specific policy for

08:04.010 --> 08:11.120
that or we can just change our default contact point some and edit that and select email as my default

08:11.120 --> 08:13.550
contact point and then press save.

08:14.030 --> 08:15.290
Okay, alert rules.

08:16.160 --> 08:16.540
Okay.

08:16.550 --> 08:21.920
I'm going to check my emails again and I've got S&amp;P down my school.

08:21.950 --> 08:22.310
Okay.

08:22.310 --> 08:23.420
So that's worked.

08:23.540 --> 08:23.960
Okay.

08:23.960 --> 08:26.690
So turn the S&amp;P demon back on.

08:26.700 --> 08:30.050
That is using start like so.

08:31.260 --> 08:34.710
And looking back at this Flirting Rules page.

08:37.200 --> 08:37.520
Okay.

08:37.530 --> 08:42.030
After some time, it should trigger that we are now getting data again.

08:43.380 --> 08:44.190
And it says, okay.

08:44.190 --> 08:48.440
Now, that took actually about a minute to actually realize that the system's back up.

08:48.450 --> 08:50.640
In fact, the system was up actually much sooner than that.

08:50.670 --> 08:57.270
That's just one of the things that you have to deal with when having your monitoring alert rules so

08:57.270 --> 09:01.380
far back from the source of the problem you're trying to look for.

09:01.410 --> 09:04.380
That's just the reality of monitoring in every system, actually.

09:04.410 --> 09:08.160
So anyway, as you can see, it's quite a lot of steps for a simple check.

09:08.190 --> 09:12.630
Now, another thing to consider, too, is that I've only done this in a day.

09:12.660 --> 09:14.700
Check on my mind also.

09:14.730 --> 09:18.760
Okay, let's just say I wanted to check all of my s and AP daemons at the same time.

09:18.780 --> 09:19.020
Okay.

09:19.020 --> 09:23.220
So I've got s an impact daemon on that server and also on my influx DB server.

09:23.250 --> 09:30.300
One option you have is to go back into your role here, press edit and you could have multiple queries

09:30.300 --> 09:36.390
so I could create another query thing identical to this one up here.

09:37.350 --> 09:38.170
Down here.

09:38.190 --> 09:40.200
But just name it in flux.

09:40.650 --> 09:42.180
DB For example.

09:42.870 --> 09:44.340
Okay, so it's filling data.

09:44.400 --> 09:50.370
I'll change it to one minute like I did before to run the query.

09:51.350 --> 09:53.270
The condition here.

09:53.330 --> 09:59.030
Now, I can add a second condition with the average of B or C actually, because this next one is called

09:59.030 --> 10:04.340
C has no value and I can keep adding those for every single server.

10:04.490 --> 10:07.970
But the problem then would be the other condition.

10:08.000 --> 10:15.320
B is one way checking would only be getting a true that that was failing if both of those service were

10:15.320 --> 10:15.650
down.

10:16.070 --> 10:18.740
So this is not really the ideal way of doing it.

10:18.890 --> 10:20.570
Instead, cancel that.

10:20.630 --> 10:22.580
Okay, let's just look at that.

10:22.910 --> 10:26.990
We don't have an option to duplicate at the moment, but that might come in the future.

10:27.320 --> 10:34.160
We need to create new alert rule s and MP d down in flux.

10:34.160 --> 10:41.480
DB This time Farnham managed to choose day that folders now credit flux db.

10:42.880 --> 10:43.690
My query.

10:45.180 --> 10:46.260
Fox TV.

10:47.980 --> 10:48.520
Now.

10:48.670 --> 10:50.020
One minute.

10:51.560 --> 10:53.630
A ritual asked in this particular case.

10:53.630 --> 10:54.170
Doesn't matter.

10:54.260 --> 10:56.060
A has no value.

10:56.360 --> 10:57.350
That's from the query.

10:57.950 --> 10:58.520
Very good.

10:58.660 --> 11:00.070
Define alert condition.

11:00.080 --> 11:01.840
Every one minute makes no difference.

11:01.850 --> 11:02.220
Again.

11:02.390 --> 11:04.460
Preview alerts set as normal.

11:04.970 --> 11:09.710
I'm not fussed about this extra information save and exit incident.

11:10.160 --> 11:13.880
I have two rules now that either of those could trigger at any time.

11:13.880 --> 11:18.170
And I could do the same with the Senate by day running a microphone or server as well, which I'll do

11:18.170 --> 11:18.680
really quickly.

11:25.080 --> 11:27.900
Very good case that when one of those goes down, I'll get an email alert.

11:27.900 --> 11:33.240
And also note that you also get an email alert when it comes back up resolved.

11:34.050 --> 11:34.440
Excellent.