Understanding Radio Resource Management
Get an in-depth look at Juniper Mist AI RRM
Radio Resource Management (RRM) is a key tool for large multi-site organizations to efficiently manage their radio frequency (RF) spectrum. This video provides a technical deep dive into the problems Juniper Mist™ AI RRM solves by taking a user-experience approach.
You’ll learn
How Juniper Mist uses RRM to manage the RF spectrum and maximize the Wi-Fi end-user experience
Why picking static values and letting the system run isn’t feasible and scalable
Who is this for?
Experience More
Transcript
0:00 radio resource management or RM is a key
0:03 tool for large multi-site organizations
0:05 to efficiently manage their RF Spectrum
0:09 Legacy controller-based implementations
0:11 build their Channel plan on how the aps
0:14 hear each other
0:15 usually late at night and decisions on
0:18 channel and power are then made and
0:19 implemented
0:21 the frustration is we hear from our
0:23 large customers is that these systems
0:25 Focus solely on channel reuse and don't
0:28 take into account changing conditions
0:30 during the day and then would overreact
0:32 for no clear reason mislistened
0:36 and about two years ago we completely
0:37 redesigned our
0:40 instead of just following the how the
0:42 aps hear each other Vector we wanted to
0:45 take into account the user experience
0:47 so we already had the capacity SLE
0:50 service level expectation which is an
0:53 actual measurement of every user minute
0:55 whether they had enough usable RF
0:58 capacity available taking into account
1:00 client count client usage AKA
1:04 bandwidthogs Wi-Fi and non-wifi
1:06 interference
1:08 so we implemented a reinforcement
1:10 learning based feedback model we monitor
1:13 the capacity SLE to see if a channel
1:15 change and or power change actually made
1:19 things better for the users or if it
1:21 didn't have any impact we trained the
1:23 system on these types of changes and
1:26 validate it with the capacity SLE to
1:29 make sure there was a measurable
1:30 Improvement
1:31 this Auto tuning will continue on an
1:34 ongoing basis
1:36 rather than setting 50 or more different
1:38 thresholds based on Raw metrics
1:41 available from some vendors
1:44 controller-based system from experience
1:46 we know there is no perfect value that
1:48 works across all environments
1:51 each environment is different and
1:53 probably not even consistent during the
1:55 course of a single day picking static
1:58 values and letting the system just run
2:00 isn't feasible and won't scale if the
2:03 capacity SLE is showing 90 percent then
2:06 there isn't much to gain by making
2:08 changes the client usage classifier
2:11 tracks excess bandwidth hogging by
2:13 certain clients
2:14 so if we see a two Sigma deviation for
2:17 bandwidth usage among clients then the
2:21 higher usage clients would get flagged
2:22 in the client usage classifier if the
2:25 bandwidth usage is pretty much
2:26 ubiquitous across all clients then the
2:29 client count classifier is where that
2:31 would be counted
2:34 these two events would not cause a
2:36 channel change but they would be visible
2:38 in Marvis the capacity as the lead is
2:41 taking a hit not based on client usage
2:44 but on Wi-Fi or non-wifi interference
2:47 then your end user experience is taking
2:50 a hit our system is agile and dynamic
2:52 rather than just setting min max ranges
2:56 and being purely focused on channel
2:58 reuse
2:59 we can let the system learn and adapt
3:02 based on what the end users are
3:04 experiencing
3:06 this is the underlying architecture for
3:08 mist AI driven RRM let's take a look at
3:12 the available configuration options
3:14 you can choose power range your list of
3:17 channels these are the only things
3:19 exposed as everything else is auto
3:22 Baseline so you don't need to set a
3:24 bunch of thresholds on each of your
3:26 different sites the system will
3:28 self-learn per site based on the
3:30 capacity SLE
3:32 Mist has implemented RRM as a two-tier
3:34 model first being Global optimization
3:37 which runs once a day it collects data
3:39 throughout the day on an ongoing basis
3:41 and creates a long-term Trend Baseline
3:44 then every day around two or three a.m
3:47 local time it will make changes if those
3:49 changes are worn
3:51 the second tier is event driven RRM or
3:53 as we call internally local RM
3:56 this is monitored by the capacity SLE
3:59 and will act immediately upon any
4:02 deviation from Baseline so both of these
4:04 are running in parallel conventional
4:06 systems aren't able to Leverage The
4:08 compute available in the cloud to
4:10 constantly crunch the long-term Trend
4:12 data and the ability to cross-pollinate
4:15 information from all your different
4:16 sites different client types and
4:19 different RF environments an example
4:21 would be buildings around an airport
4:23 where we have seen radar hits triggering
4:25 DFS events the cloud learns the
4:28 geolocation and the specific frequencies
4:30 of these events and then
4:32 cross-pollinates that learning to other
4:33 sites that may also be close to that
4:35 airport
4:36 existing systems have no memory and no
4:39 concept of long-term Trend data they
4:42 just make changes once a day here you
4:44 can see events happening throughout the
4:46 day all of the events with a description
4:48 are event driven and the scheduled are
4:52 the optimizations that happen at night
4:54 some systems try to implement a
4:57 pseudo-local event type RRM usually
5:00 interference based but the problem we
5:02 run into over time is drift and as
5:05 there's no learning going on eventually
5:08 you'll need to manually rebalance the
5:10 system and clear the drift and start all
5:12 over again the reason for this there is
5:14 no memory of what happened or the
5:16 compute space to understand context and
5:18 learn from it
5:21 Mr RM might also try to make a similar
5:24 channel change but first we're going to
5:25 go back and look at the last 30 days
5:28 and even though these three available
5:30 channels look great now we know one has
5:34 had multiple issues in the past so we
5:36 move that one to the bottom of the
5:37 pecking order this makes our Global RRM
5:40 less disruptive than any Legacy
5:42 implementation
5:45 using DFS as an example clients don't
5:48 respond well to DFS hits
5:50 they might not scan certain channels and
5:53 they might make poor AP choices in our
5:56 implementation we reorder the channels
5:58 in a pecking order based on what we've
6:00 seen in that environment over time so
6:03 certain channels are automatically
6:05 prioritized
6:07 so you might see channels that appear to
6:09 be a good choice based on current
6:12 Channel and Spectrum utilization but we
6:14 know there exists the high degree of
6:16 risk of DFS hits based on what we've
6:19 learned over time so these channels are
6:21 de-prioritized this is truly a
6:24 self-driving system and it's not solely
6:27 focused on channel reuse
6:29 stepping back Legacy RRM systems lack
6:32 the tools to measure if things actually
6:34 got better for your users with mist the
6:37 capacity SLE is exactly that measurement
6:40 that you've never had
6:42 if the capacity SLE takes a hit
6:45 and it's due to Wi-Fi or non-wifi
6:47 interference and RRM is not able to make
6:50 any changes then you obviously know
6:52 there's something in your environment
6:53 you need to take a look at
6:55 or if RRM is making changes and things
6:59 are not getting better then you have
7:01 some other issue that needs to be
7:03 addressed
7:04 but at least you know being able to
7:07 quantify the system is getting better is
7:10 super important especially once you
7:13 start deploying a lot more devices
7:15 today's requirements may not warrant
7:18 this level of sophistication but once
7:20 you start throwing a lot of iot devices
7:22 and other unsophisticated RF devices on
7:25 the network our system will learn to
7:27 accommodate them to see the channel
7:30 distribution you can take a look at this
7:31 graph
7:32 this is from our office and it's not the
7:35 perfect RF environment
7:37 this graph shows you what the channel
7:38 distribution looks like but when you
7:41 have hundreds of thousands of APS and
7:43 thousands of sites you need automations
7:45 that Baseline and monitor using metrics
7:49 that you trust
7:50 what we've done is added this top level
7:52 metric into RRM
7:54 so instead of pulling all of your APS
7:57 and manually inspecting Channel
7:59 assignments you can simply use our API
8:01 to pull a single metric
8:04 we have a distribution and a density
8:06 score
8:07 we have an average co-channel Neighbors
8:10 average number of neighbors so if you
8:12 have a standard deployment policy which
8:14 an installer did not follow you will see
8:17 the site isn't in compliance based on
8:19 these values immediately
8:21 you can pull this from the API and
8:24 create a post deployment report
8:26 so if any of these metrics are deviating
8:29 you will know exactly where to focus
8:30 these slues and metrics are available on
8:33 an ongoing basis compare this with
8:36 existing vendors where you would have to
8:38 pull raw metrics to create your own
8:40 formula to see if you need to take any
8:43 actions we don't want you to pull raw
8:45 data we just want you to use site level
8:48 metrics
8:49 but if you want to maintain your own
8:50 reports we already have done the dedupe
8:53 and aggregation for you from a deep
8:55 troubleshooting perspective
8:57 why is this APN a particular channel is
9:00 a common question asked when chasing an
9:02 RF issue that you suspect to be due to
9:05 Wi-Fi interference each missed AP has a
9:08 dedicated radio that scans all the
9:10 channels all the time and continually
9:12 maintain a score for each of the
9:14 channels that it scans this is the data
9:16 that RM uses to score the channel
9:19 so whenever it gets a trigger from the
9:21 capacity SLE to make a change it uses
9:24 this AP and site score to determine the
9:26 channel to assign
9:28 if an AP is on a channel that doesn't
9:30 seem optimal you can look right here and
9:33 then at the capacity SLE to see if the
9:35 decision making makes sense
9:37 if the SLE doesn't show a user hit that
9:40 explains why the AP hasn't changed
9:42 Channel yet it will defer to the global
9:45 plan and make the change at night
9:49 if there were user impact the system
9:51 would have made a change right away
9:53 in short we have a self-driving
9:55 reinforcement learning based RM
9:57 implementation
9:59 at the same time we're also providing
10:01 you with the visibility into the
10:03 decision making process so you can
10:04 validate decisions made by RRM
10:07 you also have the ability to pull
10:09 information at scale via our apis and
10:12 maintain a Baseline and Trend data for
10:15 all your sites this is valuable if
10:18 you're asked to deploy a bunch of new
10:20 devices and the question comes up hey do
10:22 we have the capacity to support this
10:25 with the Baseline and Trend information
10:26 you can make informed decisions without
10:29 having to pull all kinds of raw data and
10:31 make a guess typically you want to make
10:34 adjustments in two to three dbm
10:36 increments so you have enough wiggle
10:38 room unlike Cisco and Meraki we will go
10:40 up and down in increments of one so
10:42 there's more granularity but as best
10:45 practices suggest we always give it a
10:47 range plus minus three dbm from a median
10:51 value typically the target used by your
10:54 site survey predictive design we had one
10:57 customer ask us why their coverage SLE
10:59 was 99 when they had excellent coverage
11:02 in their warehouse which was full of
11:04 Wi-Fi driven robots in the past when
11:07 there was a robot problem the client
11:09 team would inevitably blame the
11:11 infrastructure team the infrastructure
11:13 team would request detailed logs from
11:15 the client team and most of the time
11:17 that led to no actions when Mist was
11:20 installed and we saw the 99 coverage SLE
11:23 we looked at the affected clients and it
11:26 always seemed to be the same robot when
11:29 they asked the client team about it they
11:31 said yeah that robot has always been a
11:33 little quirky so when they took the
11:35 robot apart they found a damage antenna
11:37 cable
11:38 this was eye-opening to this customer
11:40 and their quote to us was you guys
11:42 solved the needle in the haystack
11:44 problem
11:46 coverage SLE is a powerful tool in
11:49 another customer a driver update was
11:52 pushed to some of their older laptops
11:54 they have over a hundred thousand
11:56 employees so they did do a slow rollout
11:58 but they started getting Wi-Fi
12:00 complaints almost right away
12:03 their laptops are configured with 5 gig
12:05 and 2.4 gig profiles
12:08 already installed because each of their
12:11 sites are a little different in their
12:12 capabilities
12:14 what happened in this update it caused
12:16 laptops to choose 2.4 when they normally
12:19 would have chosen 5 gig so the sles
12:22 immediately showed a significant
12:24 deviation from Baseline that correlated
12:27 to those specific device types and the
12:29 sites that were having the problem
12:31 they immediately stopped the push
12:32 because the correlation was obvious this
12:35 customer told us that in the past they
12:37 would have asked a user to reproduce the
12:39 problem so they could collect the
12:41 Telemetry they needed to diagnose the
12:43 problem
12:44 now they realize Mr Ernie has the
12:46 Telemetry needed to tell them they have
12:48 a growing problem what that problem was
12:50 and save them a ton of time
12:52 that is the power of missed AI RRM