Explainable AI Whiteboard Technical Series: Mutual Information

AI & ML
A screenshot from the video showing a diagram that says, “Y: TIME TO CONNECT. X: NETWORK FEATURE.” Below that it says, “H(Y) H(Y/X) H(X/Y) H(X) MI(X;Y).”

Technical Whiteboard Series: Mutual Information

In this video, we delve into how the Juniper AI-Native Networking Platform utilizes mutual information (MI) to identify key network features—such as mobile device type, client OS, or access point—that are most informative for predicting the success or failure of SLE client metrics.


Show more

You’ll learn

  • How MI is defined and used to identify the most informative network features for predicting SLE client metrics

  • How entropy measures uncertainty and how it applies to network features and SLE metrics

  • How Pearson correlation coefficient (PCC) helps determine the direction and strength of predictions, complementing MI

Who is this for?

Network Professionals Business Leaders

Transcript

0:14 today we're talking about how the

0:15 Juniper missed AI native platform uses

0:18 Mutual information to help you

0:20 understand which network features such

0:21 as mobile device type client Os or

0:24 access point that have the most

0:26 information for predicting failure or

0:27 success in your SLE client metric

0:31 let's start with a definition of mutual

0:33 information Mutual information is

0:35 defined in terms of the entropy between

0:38 random variables mathematically the

0:41 equation for Mutual information is

0:43 defined as the entropy of random

0:45 variable x minus the conditional entropy

0:47 of x given y now what does this mean let

0:51 me give you an example let's say Y is

0:54 one of our random variables that we want

0:56 to predict and represents the SLE metric

0:58 time to connect and can be one of two

1:00 possible values pass or fail next we

1:03 have another random variable X that

1:06 represents a network feature that can

1:07 have a possible value of present or not

1:10 present an example of a network feature

1:12 can be a device type OS type time

1:15 interval or even a user or an AP any

1:19 possible feature of the network can be

1:20 represented by the random variable next

1:23 we'll look at what we mean by

1:25 entropy for most people when they hear

1:28 the term entropy they think of the

1:30 universe and entropy always increasing

1:32 as the universe tends towards less order

1:34 and more Randomness or uncertainty so

1:37 entropy represents the uncertainty of a

1:39 random variable and the classic example

1:42 is a coin toss if I have a fair coin and

1:45 I want to flip that coin the entropy of

1:47 that random variable is going to be

1:49 given by the sum of the probability of x

1:52 i * the log 2 of the probability of X

1:55 and for that Fair coin the probability

1:57 is that 50% will be heads Plus 50% Will

2:00 Be Tails and the entropy is going to be

2:03 equal to one the maximum entropy

2:06 possible when we have maximum

2:08 uncertainty the random variable will

2:10 have maximum entropy if we take an

2:12 example where we don't have a fair coin

2:14 we have some Hustler out there and he's

2:16 using a loaded coin let's say the

2:18 probability of heads is 70% and the

2:20 probability of tails is 30% now in this

2:23 case your maximum entropy is going to be

2:26 0.88 so you can see that as the

2:28 uncertainty goes down your entropy will

2:31 Trend toward zero if you were at zero

2:34 entropy that would mean no uncertainty

2:36 and the coin flip would always be heads

2:38 or tails now let's go back and see how

2:41 Mutual information works with our SLE

2:43 metrics graphically what does this

2:45 equation look like let's say we look at

2:47 how this circle here represents the

2:49 entropy of my sle metric Y and this

2:52 circle is the entropy of my feature

2:54 random variable X so if you look at our

2:57 equation the conditional entropy of

2:59 random variable y given the network

3:01 feature X is this area here if I

3:04 subtract it to what we're looking for is

3:05 this middle segment this represents the

3:08 mutual information of these two random

3:10 variables and it gives you an indication

3:12 of how well your network feature

3:14 provides some information about your SLE

3:16 metric random variable y if the network

3:19 feature tells you everything about the

3:21 SLE metric then the mutual information

3:24 is maximum if it tells you nothing about

3:26 the SLE metric then the mutual

3:28 information between X and Y is zero now

3:31 Mutual information tells you how much

3:34 information and network feature random

3:36 variable y gives you about the SLE

3:38 metric time to connect but it doesn't

3:40 tell you whether the network feature is

3:42 better at predicting failure or success

3:44 of the SLE metric for that we need

3:47 something called the Pearson correlation

3:50 if you look at the picture of the

3:51 correlation it tells us a couple of

3:53 things one is the amount of correlation

3:55 with a range from -1 to 1 the other is

3:59 the sign negative and positive which is

4:01 a predictor of pass or fail so now we

4:03 have these two things first is the

4:06 magnitude indicating how correlated the

4:08 two random variables are second is the

4:11 sign which indicates failure or success

4:13 if the correlation is negative the

4:15 network feature is good at predicting

4:17 failure if it's positive it's good at

4:19 predicting pass if the Pearson

4:21 correlation is zero it means there is no

4:23 linear correlation between the variables

4:26 but there could be mutual information

4:28 between the two but the piercing

4:29 correlation does not tell us the

4:31 importance of the network feature or if

4:33 there's not enough data to make an

4:35 inference between the network feature

4:37 random variable and the SLE metric

4:39 random variable that's given back to our

4:41 graphic of the circles there may be one

4:43 case where I have very high entropy for

4:45 both variables but there may be another

4:47 case where I have much smaller entropy

4:49 on one of those variables both of these

4:51 examples may be highly correlated with a

4:54 high Pearson's value but the entropy of

4:56 mutual information will be much higher

4:58 in the first case which means this

5:00 random variable has much more importance

5:02 in predicting success or failure of a

5:04 feature I hope this gives a little more

5:07 insight into the AI we created at mist

5:10 and if you look at the Mist dashboard

5:12 the result of this process is

5:13 demonstrated by our virtual assistant

Show more