Automated Congestion Management in the AI Data Center with Juniper Networks
Cloud Field Day 20: Automated Congestion Management in the AI Data Center
To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.
You’ll learn
The importance of adjusting parameters on real-time data rather than static, manual settings
How Apstra’s DCQCN auto-tune app monitors network KPIs and adjusts configs in real time
Who is this for?
Host
Experience More
Transcript
0:10 hi my name is Vikram Singh I'm a PLM in
0:12 the AI data center team and I'm going to
0:15 talk about automated congestion
0:17 Management in AIML backend
0:21 Fabrics so if I may draw an analogy what
0:24 we're going to present today we all
0:26 familiar with these metering lights um
0:29 you know and these are primarily used to
0:31 manage congestion on our freeways during
0:32 rush hour right and the way this works
0:35 is you usually there are cameras or
0:37 sensors that detect how much traffic is
0:40 Flowing or how much congestion is there
0:41 on the main freeway and then the am the
0:44 they change the duration of time this
0:47 light is red right so practically how
0:50 much traffic you're allowing in if you
0:52 allow too much then it adds to the
0:54 congestion and it just worsens the
0:56 situation if you you know if these
0:59 lights are longer uh read for longer
1:01 then you actually are creating
1:02 congestion at a different point and not
1:04 utilizing the whole bandwidth available
1:06 right so what we're trying to do is give
1:08 that flexibility of fine-tuning um you
1:11 know depending on when congestion
1:14 happens
1:16 so if you see just to describe the
1:18 challenge uh let's put ourselves in the
1:21 uh shoes of network admin right like um
1:23 you're managing this backend cluster
1:26 which PR will talk about and so there
1:29 are two typ types of complexity right
1:30 one is monitoring comp complexity there
1:32 are so many entities that you have to
1:34 monitor for example here you see 256
1:37 gpus if it's a small cluster like 256
1:40 gpus you may have 12 switches with 64
1:43 ports each so there are 768 ports and
1:45 then quickly you see each one can have
1:47 six cues on um you know each Port so you
1:52 can have 5,000 plus um entities that you
1:55 have to monitor right and you may need
1:58 historical view as Jay was showing
2:00 uh from a provisioning complexity there
2:02 are only a few dials available to tune
2:05 the congestion and they're listed on the
2:07 uh bottom right but um in terms of you
2:12 know it's today it's a manual process
2:14 it's tedious it's Error prone somehow
2:16 the network admins is supposed to know
2:18 what these what are the optimal values
2:20 for this network without knowing
2:21 anything what the traffic from these
2:24 machine learning workloads will come
2:26 right so sometimes you may need to tweak
2:29 like as said you may need to tweak a
2:31 certain certain things for a particular
2:33 model um and more importantly in a
2:36 multi-tenant training uh Network where
2:38 there are you know either multiple jobs
2:40 running and the traffic is Flowing from
2:41 the same fabric by the time you you get
2:44 a ticket or somebody tells you hey my
2:45 machine learning job is slow you may not
2:48 be able to recreate that other job that
2:50 was interfering at that time so you
2:51 don't know what to tweak right so these
2:54 are some of the challenges um you know
2:56 today and that's what we're trying to
2:58 solve through automation right so let's
3:00 see how abstra and all the things
3:02 terraform and um probes that you have
3:05 seen can we can build something to solve
3:07 it so here what we have done is we've um
3:10 written this dcq and autot tune app what
3:13 it does is it leverages all the things
3:15 you saw J showed you from uh you know
3:16 the probe so we are continuously
3:18 monitoring key kpis uh that tell you
3:22 when congestion happens or performance
3:24 kpis right and we'll do a demo soon so
3:27 you'll know and then once we detect that
3:29 hey uh you know I think the intend or
3:31 intended config is not doing what it's
3:33 supposed to do we make a decision and
3:35 then we do a Clos Loop automation we use
3:38 terraform to actually tweak that
3:39 configuration on throughout your Fabric
3:42 and then continue to monitor the effect
3:43 of that and until you you have an
3:46 optimal optimal set of these congestion
3:48 uh configuration
3:51 right so this is like there was a
3:53 question how are we creating so in this
3:55 demo what we are doing is this is our um
3:57 lab that you'll be touring um shortly
4:00 and we use IIA to generate this Rocky V2
4:03 traffic and actually we're generating so
4:05 much traffic that it overwhelms the
4:08 receiving Nicks right so it goes through
4:10 um um you know four leaf and spine
4:13 switches and that congestion manifests
4:17 um you know at the egress leaves and
4:19 that's where we are filling up the
4:20 buffers that's where we are applying the
4:22 algorithm that I'm going to show you uh
4:24 so we monitor the K uh EC and PFC and
4:27 tail drops um via abstra and then we
4:30 basically make a decision and then we
4:32 push out this config now there are a few
4:34 options here you know you can you can
4:36 say hey wherever the congestion is
4:38 happening only tweak the congestion uh
4:41 parameters on that switch or as J show
4:44 you abstra has this concept of leaf and
4:46 spine so if it's happening in a leaf you
4:47 can choose like hey only apply on all
4:49 the other leaves because it may happen
4:52 uh on the other leaves as well or spine
4:53 right so that's the thing and also we
4:56 have service now integration so you know
4:58 we will update everything that all the
5:00 decisions uh this app makes onto the
5:02 service now
5:04 right so profl went through this and I'm
5:07 going to go in a little bit more detail
5:08 before I just uh show you the algorithm
5:10 we are using so as you know DC qcn so
5:14 there are two primary uh mechanisms in
5:16 Ethernet to control congestion right so
5:18 the first one was
5:19 PFC um this came first and it's more of
5:22 a you know brute force or a hammer
5:24 approach in this case if you see the um
5:27 the spine the red link
5:30 if let's say that's congested what the
5:32 switch does is switch will monitor its
5:35 buffer and if it's around 90% it's it's
5:37 configurable but if it's getting to a
5:39 point where it will imminently the
5:41 packet drops is imminent because buffers
5:42 will fill up it it sends a message pause
5:46 frame to the neighboring switch
5:48 Downstream and it says hey don't send me
5:51 uh stop sending me traffic for
5:53 particular amount of time ties for this
5:55 whole priority and the way priority
5:57 works is usually the Nyx will uh Mark a
6:00 dscp value for all RDMA traffic there is
6:03 a default value for NVIDIA and others
6:05 and then what you do is you map that you
6:07 take that and map these things in a
6:09 queue where you apply both these
6:11 techniques right so um So eventually
6:13 this pause the other switch will get in
6:15 trouble and its buffers will fill up and
6:17 the pause frames reach the Nyx and they
6:20 they stop right so the goal is pre how
6:23 to make ethernet lossless and this is
6:25 what makes it lossless but the the
6:27 problem with this approach is you know
6:29 it does it doesn't have the granularity
6:31 so it will use let's say one link is
6:33 blocked it's going to block all traffic
6:35 from that priority from the neighboring
6:37 switch it has 63 other or 62 other ports
6:40 where the traffic could be fine right so
6:42 that's what is uh you know um you know
6:46 Troublesome with this technique but
6:48 there is
6:48 ecn uh with ecn what happens is wherever
6:51 let's say they take the same example you
6:53 have the the spine one of the interfaces
6:55 is uh
6:57 congested um and uh uh what ecn does is
7:01 it it and you have these two knobs where
7:03 you can configure um you know
7:05 probabilistically it starts marking
7:07 packets right and these are oneway flows
7:09 one GPU is trying to write some memory
7:10 chunk onto another gpu's memory and uh
7:14 the all the Nicks are continuously
7:16 looking for this marking so in the dscp
7:18 bid there is a bit that it flips the
7:20 receiver actually detects that there
7:22 some there is congestion somewhere in
7:24 the fabric and then it sends a explicit
7:27 congestion notification packet back to
7:29 the sender and then cender backs off the
7:32 flow so if it was sending at 400 it step
7:34 steps down to 200 and then eventually
7:36 you know can so this is more effective a
7:40 it's granular you're only impacting the
7:42 flows that are impacted uh by a
7:44 particular congested q and B it actually
7:48 uh you know remediates it because it's
7:50 the flows will back down the congestion
7:51 goes away for temporarily and you know
7:53 so DC qcn actually is judicially using
7:56 both these techniques or sometimes they
7:59 you know as Prof was saying you can only
8:00 use uh uh ecn as your primary method uh
8:05 and this is what is DC qcm right so so
8:08 you mentioned that
8:10 PFC causes all the traffic at that
8:13 priority to stop
8:16 yes from the neighboring
8:19 switch Yeah on that interface
8:23 okay so let's see like so what we have
8:26 done is you know so I'm simplifying here
8:28 a little bit but this is easy to
8:30 understand what you're seeing is
8:33 um the x-axis here is actually the the
8:38 buffer occupancy in percentage right so
8:40 whenever congestion happens a switch uh
8:42 is congested at some que on some Port
8:45 this is the m uh this is what you apply
8:47 and the there are two configuration
8:49 parameters K Min and K Max so for ecn so
8:53 what it does is as buffers fill up hit
8:55 the cman lower water Mark um uh they
8:59 will probabilistically start marking the
9:01 packets so if let's say you are
9:02 somewhere at 10% the switch will start
9:04 marking one in 10 packets as ecn right
9:07 and if the if the the ecn doesn't kick
9:11 in fast enough you keep filling the
9:13 buffers you cross kmax that's when all
9:16 the packets in the flow in the queue
9:18 will be marked ecn that means all the
9:20 flows will be impacted right
9:23 and then you have PFC so sometimes so
9:26 there is a function of these control
9:28 packets being sent to the sender if
9:30 there is some delay in there uh it may
9:32 still not act fast enough so that's when
9:34 you have in uh some you know PFC as
9:36 insurance just to kick in and buy some
9:38 time for rec to eventually kick in and
9:40 slow uh you know um so what we have done
9:43 is so that this is the set of value that
9:46 a network admin has to know somehow
9:49 automatically like hey what my pattern
9:51 should be and the the practice is they
9:53 all shift uh very they start very right
9:56 because you don't want to slow
9:57 unnecessary down but with that approach
10:00 what happens is you don't have much uh
10:02 buffer left to actually uh you know
10:04 prevent drops so what we have done here
10:06 is we keep monitoring uh on the top
10:09 right uh right what you see how do we
10:11 adjust Les left so everybody starts with
10:14 a very uh you know uh um relaxed ecn
10:18 config so that all the applications can
10:20 breathe they're running at the uh
10:22 maximum possible uh you know your load
10:24 balancing is your first defense but
10:25 still if congestion happens what you do
10:28 is if we we keep keep monitoring the
10:29 buffer occupancy by seeing hey are there
10:32 any pfcs being triggered or is there a
10:35 tail drop happening right so if it is
10:37 happening that indicates that your ecn
10:40 is not uh reacting fast and whatever
10:43 config you did um you know we need to
10:46 move a little bit left trigger easn
10:48 sooner so that you know that eventuality
10:51 of dropping packets is is is is elevated
10:54 right so that's how this uh what we do
10:56 is we make a decision we keep moving
10:58 left and then we see what's the impact
11:00 of this if it's still pfcs are triggered
11:03 or packet drops are happening we keep
11:05 moving left until this system is stable
11:07 right and once we do that then um how do
11:11 we move right is when when you have
11:13 avoided this boogy man the uh situation
11:15 which may cause packet loss we start
11:17 saying okay if EC if the buffer
11:20 occupancy is between your low and high
11:22 Watermark that indicates that hey my ecn
11:25 is perfectly tuned and the round trip
11:27 times for these cnps are are perfect so
11:29 let me actually allow the applications
11:31 to run at higher bandwidth breathe more
11:34 and move right and it will help you find
11:37 the most optimal value based on whatever
11:39 workloads come in in real time right so
11:42 the so the metric it's trying to drive
11:44 is Packet drop rather than job
11:47 completion time or something like that
11:48 which is a higher level metric so so
11:51 you're right so because it doesn't have
11:52 the notion job completion time can only
11:54 be measured by the application right so
11:57 but if there is congestion happening uh
11:59 packet losses uh and uh you know can be
12:03 can slow that job completion time so
12:05 it's trying to avoid that by saying hey
12:06 let me
12:08 slow uh slow applications to a like not
12:11 a full crawl because you know but let me
12:13 slow them down reasonably they're still
12:15 running that may impact that may help
12:18 the job completion time rather than if
12:20 packet losses happen retransmissions are
12:21 triggered no I was going to just add to
12:24 what vickram said this is an application
12:26 that's running on top of abstra with the
12:28 apis this is one implementation we've
12:30 done but like at the end uh you can take
12:33 it and do what you want but you're right
12:34 that there could be if you have access
12:36 to job completion time you can
12:37 absolutely make that part of the
12:39 algorithm that you implement and we're
12:41 going to experiment with a bunch of
12:42 things we're going to learn along with
12:43 our customers in terms of what to make
12:45 how to make this you know as effective
12:47 as possible so this is just one example
12:49 implementation of what you can do yeah
12:51 but the application people are never
12:53 going to tell the uh network engineer
12:55 about their jobs no I mean yes no so
13:00 what we seen in terms of like the Divide
13:01 is that we're they're never going to
13:02 allow us to configure anything in the
13:04 application it's kind of like the inter
13:06 the the the uh the Integrations we have
13:09 with like VMware we don't go and imple
13:12 and and configure VMware however yes we
13:15 do get visibility into it we get access
13:18 to that Telemetry yeah but the end we're
13:20 controlling the network allow us if they
13:22 D to allow us if if yeah it'll be even
13:25 harder in the AML applications yeah so
13:28 um I I know you Matas you had a question
13:30 earlier like hey how much drops and
13:32 stuff once we know that we can tweak
13:34 like hey how much drops you want like to
13:36 move this further right like we could
13:38 you could do that but um just wanted to
13:40 say that hey you know future so this is
13:42 one idea we are testing and you know as
13:44 we do more experiments we are going to
13:46 find more opportunities what else can we
13:48 T tune what other parameters load
13:50 balancing other things that we will uh
13:52 enhance this right and um so abstra now
13:56 allows you to not only like you know
13:58 deploy will operate but also find help
14:01 you fine-tune some of these performance
14:03 kpis to help you um you know run this
14:06 optimally so with that I I will um let
14:09 Raj do the demo and just to set the
14:11 stage right um like Vikram said uh we
14:15 want to start like I I think of packet
14:19 drops as an edge right so you want to
14:21 get away from the edge as quickly as
14:23 possible and then because the view is
14:25 really nice from The Edge you want to
14:27 creep closer and closer and then find
14:29 the spot where you're safe right so
14:31 that's kind of the the my mental image
14:34 when I when I wrote this code um so this
14:38 application uh what it does is it
14:41 basically uses abstra uh to manage the
14:44 fabric right uh Jay spoke about
14:46 configlets which is basically like these
14:48 config Snippets that you can push out
14:50 into abstra um and Jay spoke about
14:52 probes which basically you know tell you
14:55 what's happening right um so I'll just
14:57 let it go
15:04 and we'll run it on two times the speed
15:06 because we like Fast um so the situation
15:10 we see here is there are both ecn and
15:13 PFC anomalies happening on the fabric
15:16 because the exia is pushing traffic out
15:19 whenever we see pfcs or Trail drops we
15:21 start moving left right and that's
15:24 that's basically what's going to happen
15:26 um and what you also see here is you
15:28 know because
15:29 it's so much fun to uh to see um to see
15:33 log messages I just put everything up on
15:35 service now so it basically updates
15:37 incident ticket right um the general
15:40 idea is like we want to make the network
15:43 admin's life easy right so with the
15:45 network admin might would much rather
15:48 see what happened uh than be told
15:51 something really bad happened right so
15:52 the the general idea is uh as it sees as
15:57 it sees this application as it sees pfcs
16:00 is going to start shifting left
16:01 basically shifting that window left um
16:04 hopefully slowing the traffic down and
16:06 it keeps updating service now and says
16:08 hey man I just dropped it down I dropped
16:10 it down some more I dropped it down some
16:12 more and so on and that's basically what
16:14 we seeing here uh this was written uh
16:17 while this is going on uh this was
16:19 written in Python uh using the rest API
16:22 uh the the rest API that you know Jay
16:25 spoke about and um you can see already
16:29 that it started at 6090 it's already
16:32 gone to 4070 right so two cycles have
16:35 happened and it's already gone to 4070
16:37 and then it went to
16:39 3060 right now at this point
16:43 um in a few couple of seconds it's going
16:46 to notice that there are no pfcs right
16:48 so pfcs are gone the ecn are gone like
16:51 there are still ecn so basically we
16:53 moved far away from the edge right now
16:57 we still want the application to breathe
16:59 so we know we are close to the edge so
17:01 we start moving right now to see to
17:04 actually find the edge right so we go
17:07 right and we find edges right so it
17:11 basically says uh it basically knows now
17:14 that if it sees pfcs again that's the
17:16 danger zone right um if I was a singer
17:20 I'd start singing The Danger Zone song
17:21 but I'm not everyone is safe um so it
17:25 sees the edge now it knows the EDG is
17:28 there so it wants to find a spot where
17:31 it can still breathe but you know it's
17:33 not going to fall over so now it's going
17:36 to start moving left slow like we went
17:40 when we were when we when we moved
17:42 initially we moving in steps of 10 now
17:44 we're going to move in steps of five
17:45 just to find the edge right um so from
17:50 uh 40 70 it went to 3565 you know it
17:54 moved a little bit and I'm at this point
17:57 I'm willing to take better on whether
17:59 it's going to stop here I know what's
18:00 going to happen
18:02 and go further uh so yeah we have like
18:07 80 Seconds for your bets uh no but the
18:10 the general idea is right like because
18:13 because you know you're kind of safe
18:15 here you pause here longer so that like
18:17 you know the the changes can propagate
18:20 uh you know the the system can settle
18:22 down for a bit um and once it settles
18:26 down in a particular Zone pH long enough
18:30 this particular this particular
18:32 application will just stop right it'll
18:34 basically say I I think I found the safe
18:36 spot and you know it'll just stop at
18:38 that point um it could also potentially
18:42 uh just keep running right it can
18:44 basically say like I don't see any pfcs
18:47 I've not seen any pfcs in the last
18:49 minute minute and a half maybe I can go
18:52 I can move again maybe I can go to a
18:53 better spot right uh so uh it can
18:57 basically run in like like infinite mode
19:00 not infinite like not stopping mode like
19:02 while one right it can do that or it can
19:04 basically say like uh you know I'm done
19:08 uh this is good this is where I'm going
19:09 to stop so you can basically
19:13 um um it's basically up to you right
19:16 that's that's kind of the power of
19:17 abstract you can make your application
19:19 like this whole concept did not exist uh
19:23 like 3 4 months ago right and it now
19:26 it's it's now a thing and you know we
19:28 we're talking to abstra PLM it's going
19:30 to become a thing in abstra right so
19:32 like that's kind of where we want to go
19:34 here like we want to use this as a
19:36 general a general kind of uh framework
19:40 thank you uh where um anybody can
19:44 develop their own applications with Abra
19:46 right this code is going to going to go
19:48 out on GitHub like everything else uh
19:50 and you know anybody can do whatever
19:53 they want you know like you want to make
19:55 like config backups make config backups
19:58 be happy you know uh use our code be
20:00 happy that's that's really what we want
20:02 and as you can see 3060 if anybody had
20:05 3060 you won so at 3060 everything is
20:08 happy it's stopped
20:11 um that's 3060 306 yeah um there's like
20:17 yeah there's like you know thesis
20:18 quality research to be done here to be
20:20 honest because like you know we are
20:22 doing something like very simple and
20:24 straightforward like the the windows
20:26 only 30 wide we move the window
20:29 you know like in in lock step we could
20:32 you know expand our contract a window uh
20:35 we could expand we could change the
20:37 percentages by which these like right
20:39 now the percentage is 0 100 right when
20:41 you before you hit the when you hit the
20:43 window zero the probability is like is a
20:46 linear probability you could play with
20:47 that um somebody said like can we uh you
20:52 know can we change this based on
20:54 completion time right of course we can
20:56 right see this is a pattern right the
20:58 pattern in which this window was moved
21:00 is a is is like is something um if you
21:03 can match that against completion times
21:06 over a period of time for different
21:08 kinds of loads you can figure out what
21:11 the ideal pattern is and then you know
21:13 like
21:15 um again I wish I had a joke about
21:17 learning but I don't know like this
21:21 application can be smart right it can do
21:23 things where it knows this kind of
21:25 workload would probably need this kind
21:28 of window and automatically do that this
21:31 python application you mean the python
21:33 the the yeah I just wrote python code
21:34 man like you well yeah you're you're
21:37 describing an ability of the of the
21:40 software the networking software
21:42 engineer yes using whichever platform
21:44 they prefer you prefer python yeah to do
21:47 all these kinds of things yeah exactly
21:49 yeah yeah that's that's what I meant but
21:51 we we would like to see is that abstra
21:54 has an AI in it that deter no we
21:57 don't we don't to see that this moment
22:00 my my question on the python is is there
22:02 a uh good python library for abstra or
22:06 are we directly making uh rest API calls
22:09 inside the python code yeah f.o is
22:12 coming up with like a a brand new API
22:15 Library I used an older version but the
22:17 latest I like sdks I'm
22:19 I'm API like a really fancy SDK is
22:23 coming out uh so you know you you'll
22:25 have that for sure
22:29 you you didn't talk about flets and the
22:31 in the optimization but that's another
22:33 potential thing that you could turn on
22:34 or off based on how things are going
22:37 yeah yes I mean I I I don't make those
22:39 decisions but I would get some grad
22:41 students for this work you know they get
22:43 a thesis we get an application
22:44 everyone's happy but you're right so we
22:47 we took the hardest problem first
22:49 congestion Management in Ethernet and
22:51 then we are looking at others uh and one
22:53 of what you pointed out is our is on our
22:56 list just to expand
23:07 to expand on that there was earlier
23:08 question on Smart NX as well so one of
23:10 the ideas that we had that we started
23:11 building an app two let's call this app
23:13 one app two is we're looking at
23:15 congestion metrics on not just the
23:17 switches but we're looking congestion
23:18 metrics on the smart NYX if you see for
23:21 example out of order packets on the
23:22 smart Nick as an example then you know
23:25 that your DLB is not configured properly
23:27 so you can go twe your DLB Timeout on
23:29 the switches to respond to congestion
23:32 that you see on the smart Nick so that's
23:33 an app that is going to follow very soon
23:35 after this app as
23:37 well yeah thanks Prof i' forgotten like
23:40 you know Michael Michael hle from
23:42 juniper is working on that you know uh I
23:45 one the next thing I'm going to do is
23:47 integrate his work with this code so
23:48 that you know we can have uh more
23:51 smartness