Congestion Management in the AI Data Center

0:00 [Music]

0:05 hello everyone welcome to the third

0:07 episode of the video series for the AI

0:10 data centers in the last episode with

0:13 himansu and Mahesh we discussed how load

0:16 balancing is enabled in the AI data

0:18 center fabric with features such as

0:21 Dynamic load balancing and Global load

0:24 balancing improving the efficiency of

0:25 the fabric to kick off our discussion

0:28 today I'm joined by Mr special guests

0:31 and good friends Mahesh suban and Mikel

0:34 cinski today we will discuss the cutting

0:37 edge condition management Technologies

0:39 needed for the lossless Rocky V2

0:42 transport networks when connecting the

0:44 GPU clusters Mahesh and Mikel welcome to

0:48 episode

0:49 three hey Aron hey Mahesh hey AR great

0:52 to see you again hi Aron happy to be in

0:56 your video again and uh looks like today

0:59 I did not get the

1:01 memo yeah so let's start with you Mahesh

1:04 u in the previous video we discussed the

1:08 advanced load balancing techniques in

1:10 the AI there is in the

1:12 Clusters also these backend AI clusters

1:15 often one is to one over subscribed or

1:18 have very efficient load balancing in

1:20 place so why is congestion management

1:24 still needed in the AI clusters yeah

1:27 Aron congestion management even it's

1:29 starts from the efficient and effective

1:31 load balancing everybody knows that in a

1:36 cluster infrastructure the overall goal

1:38 is to have a reduced job completion time

1:41 that's a key kpi but in a data center

1:46 infrastructure our key KP is a lossless

1:49 fabric because during the RDMA

1:52 transition from 1GB to other GPU even a

1:55 single drop will make the job cycle to

1:59 run

2:00 again so the real game here is to how we

2:03 are handling the congestion proactively

2:05 with reactively what I mean here is that

2:08 in a GPU cluster on the fabric uh we

2:11 have to control the congestion

2:13 proactively using a efficient load

2:15 balancing little bit in detail in the

2:18 proactive method uh we'll make sure if

2:20 any congestion happen on the particular

2:22 link or a path we we will switch that

2:25 particular flow from congested link to

2:28 less congested link right during various

2:31 method last video also we discussed like

2:33 a DLB Dynamic load balancing or

2:35 selective DLB identify the BT header

2:38 elephant flows and accordingly will

2:40 spray the packet various methods are

2:41 there but in at any point of time due to

2:46 microburst if any link got congested we

2:50 have to immediately jump in do some

2:53 reactive methods that's where we have a

2:56 proper congestion method coming into the

2:58 picture via ecn and bfcs up today this

3:01 is an interesting start Mahesh can you

3:04 share a few scenarios in the reactive

3:07 approach yeah there are multiple

3:08 scenarios are there but I want to go a

3:11 little bit of basics in any data center

3:14 uh we have a different congestion points

3:17 and we have a called it as in uh incast

3:19 congestion points right what is incast

3:21 means basically multiple input to only

3:24 one output it will make a congestion

3:27 this particular incast congestion uh

3:30 generally it will happen in three points

3:32 where a GPU cluster to Leaf that's a

3:34 point one or Leaf to the spine which is

3:37 point two or spine to the leaf this is

3:40 point three so how we are going to

3:42 handle the congestion in the switch or

3:44 switch fabric that's where we need a

3:46 proper congestion methods and since you

3:49 asked about the scenarios like for

3:51 example we have a kind of a uh in every

3:54 A6 in every switch in our A6 have a mmu

3:57 which is memory management unit this

4:00 memory management unit has two parts one

4:03 is that itm that Ingress traffic manager

4:06 and uh Ingress traffic manager zero and

4:08 Ingress traffic manager one this Ingress

4:10 traffic manager itm will allocate the

4:13 buffer to our switch it can be a

4:16 dedicated buffer or it can be a shared

4:18 buffer depends upon the AC can switch

4:20 what you are using it right in the queue

4:23 in the buffer it got filled then became

4:26 a congestion so we need to avoid the

4:27 congestion that's where we start started

4:30 using some kind of a key congestion

4:32 mechanism for a rock2 traffic we can

4:35 call it as a DC qm that data center

4:37 quantize uh congestion uh method in the

4:40 DC equation is nothing but the

4:42 combination of ecn plus PFC that's what

4:45 we are going to see it this in the video

4:47 in detailed way this is very helpful and

4:50 and uh I want to move to Mel here so M

4:54 can you remind us uh what are the main

4:56 building blocks of the condition

4:58 management yeah yes AR so in fact there

5:01 are two methods right for congestion

5:03 management right the explicit congestion

5:05 notification is in fact the the most

5:08 popular uh portion of the congestion

5:10 management and there is obviously the

5:12 second mechanism which is the priority

5:14 flow control using the

5:17 dscp both of these mechanism run on the

5:20 native IP Fabrics where you have Leaf

5:22 spine super spine topology but in some

5:25 scenarios only the ecn the explicit

5:28 congestion notification is used as a

5:30 primary one which simply informs the

5:34 destination uh server about the the

5:38 congestion right so of course there is a

5:40 prerequisite to put this functionality

5:43 on the switches and without uh having

5:47 that functionality on the ni cards we

5:49 can't really uh rely on that congestion

5:52 management so that's the portion which

5:54 is very important making sure that the

5:56 congestion management we are triggering

5:58 on the switches is also Al supported on

6:00 the ni card itself right so the driver

6:03 of the server have to has to be uh

6:06 capable of of interpretation of the ecn

6:10 messages received from from the from the

6:13 network right and so the ecn is is

6:17 simply mechanism which will inform the

6:20 destination server about the congestion

6:22 when actually it reaches the threshold

6:25 set at the perq level on the switches

6:28 right so for example set in your uh

6:31 spine devices on a specific queue on the

6:33 lossless queue that the when you reach

6:37 that threshold you will start marking

6:39 your pocket with one one bat which is

6:42 the most significant beats of the to uh

6:46 on the IP and you will inform the

6:48 destination server about the the the

6:50 scenario of the uh of the congestion

6:52 right in this case the server will get

6:55 that information and will react to it

6:57 right so it will send back the

6:59 information to the originating server

7:02 and it will just mark it with specific

7:04 values to inform it that he needs to

7:07 slow down a little bit for very short

7:09 time and then you know reduce the

7:11 congestion or simply uh eliminate the

7:13 congestion right so when it it sends

7:16 back the the CNP message to the

7:18 originating server the source of the of

7:21 the of the rocky V2 packet then it will

7:24 get this information with the uh Des CER

7:29 information so that we know exactly for

7:31 which session it needs to reduce the the

7:33 rate and also the partition key is part

7:36 of the information right when it sends

7:38 the CNP packet back to the originator

7:41 right so then the partition uh

7:43 information The Logical information is

7:45 also leveraged so that the originating

7:47 server knows exactly on which of the

7:50 session it needs to reduce the uh the

7:52 rate in order to prevent uh from of of

7:56 of having this congestion inside the

7:57 network right and then the second

8:00 mechanism PFC uh is actually triggered

8:03 just after uh the the explicit

8:06 congestion notification when actually

8:08 the congestion still happens right so if

8:10 we if we trigger the ecn and ecn

8:13 actually reduce the congestion then a

8:15 PFC sometimes is not even triggered

8:17 right back to the originating server

8:19 right so instead of having a situation

8:22 of of of a switch which sends both PFC

8:26 and ecn to the destination PFC actually

8:28 is sent to the original ating Ser

8:30 directly from the switch hop by hop it's

8:32 acting on each of the segments and can

8:35 inform also additionally uh on each of

8:38 these segment to slow down reduce the L

8:40 rate and then again uh you know prevent

8:43 from a from a from a scenario of

8:45 congestion inside the the network right

8:48 both of these mechanisms are are are

8:50 supported on the switching side but

8:52 again I will repeat that it's also

8:54 important that we we have a right

8:55 interpretation of these values on the KN

8:57 card the good thing is that both of

8:59 these mechanisms are the most widely

9:01 supported congestion mechanisms uh in

9:04 the industry there are other ideas about

9:07 how to run the congestion Management on

9:08 the fabric but the reality is that you

9:11 need both of these portions so this one

9:13 is really the most popular one and is uh

9:16 uh is available across multiple vendors

9:19 so you can build your Rocky V2 uh AI ml

9:22 fabric using different vendors and they

9:25 can support both of these standardized

9:27 mechanisms for the congestion management

9:29 right I'm glad you mentioned that they

9:32 are very well known and methods both ecn

9:35 and

9:37 PFC so are there any nuances in these

9:39 techniques for the AI data

9:42 centers that's a good point so uh AR

9:45 last time we discussed about some more

9:46 advanced functionalities of uh how to

9:49 for example manage uh the PFC using the

9:54 uh the concept of alpha values right so

9:56 you can set different values of Alpha

9:58 and then provision the way the ex off

10:02 off is actually triggered to actually

10:04 pause the the the streams to the

10:08 originating uh uh the originating server

10:11 right uh so these values can be actually

10:14 enabled at the perq level and then you

10:17 can for example decide that for very

10:19 specific uh large L model you will

10:23 actually provision a little bit more of

10:24 the buffer so the trigger of the X off

10:27 the PFC messages will will happen uh

10:30 with lower probability right so that's

10:32 one thing the other thing that we

10:35 discussed last time was also the the the

10:37 TFC Watchdog quite popular functionality

10:40 across different vanders where if you

10:43 rely on PFC if it's the case of your

10:45 network you can actually control uh uh

10:49 the situations where where the network

10:52 is experiencing some abnormal triggers

10:55 of the PFC where the rate of the PFC is

10:58 received too often then on the switch

11:00 you can say okay if I get too much of

11:03 these PFC back pressures in the given

11:05 window of time I will simply ignore them

11:07 and just uh you know stop pushing it

11:09 down to the originating servers right

11:12 but you ask me about some nuances right

11:14 there are other nuances so one of the

11:16 nuances is simply at the switch level

11:19 how you handle your buffers right so the

11:22 way you can handle it is of course uh do

11:25 it at the per level for specific

11:27 congestion management mechanisms as said

11:30 using the alpha values for the pfcs uh

11:33 but also on the switch itself you can

11:35 manage these buffers and are reallocate

11:38 dedicated buffers from the interfaces

11:40 you're not using for example on your

11:43 switch reallocate the the buffer memory

11:46 to the shared pool and then give a

11:48 little bit of a more of the of the of

11:50 the buffers for the shared pool in order

11:52 to be used in case of of the congestion

11:55 situation right so that that would be

11:57 another area to to explore whenever you

12:00 have a the spine supine IP fabric

12:02 topology making sure that you have a

12:05 consistency at each of the level in

12:07 terms of the treatment of the of your

12:09 buffers right so very important buffers

12:11 associated with the congestion

12:13 management that we discussed so far

12:15 making sure that the switch is

12:17 provisioned in the right way great

12:19 points Mikel uh I think this been a

12:21 great session so far but before we uh

12:25 close I must ask uh this question and

12:28 I'll you know go to Mahesh here Mahesh

12:31 what other Technologies are being

12:33 enhanced to support the AI data centers

12:36 Aron you asked about a data center in

12:38 the fabric perspective there are two

12:40 things you know that one is that

12:41 congestion management and load balancing

12:45 in a congestion Management in both areas

12:47 there are a lot of things happening let

12:49 me start with the congestion management

12:51 of course there are uh how we are going

12:53 to handle this congestion and queuing

12:56 mechanism at the edge we'll call it as a

12:59 qds HQ datagram services and also of

13:02 course you guys know Amazon already

13:04 started using a scalable uh reliable

13:06 datagram and in the transport level Plus

13:10 in the ultra ethernet Consortium we are

13:12 very well part of it uh there is a f CNP

13:15 we'll call it as a source flow control I

13:17 think a. 2.1 qdw uh the standard which

13:21 is evolving drop congestion notification

13:24 uh congestion signaling how you're are

13:26 going to handle the uh congestion tag

13:28 and how you going to reflect that we'll

13:30 call it as a congestion tag and

13:31 congestion reflection tags how we are

13:34 going to handle it these are things are

13:35 happening in the congestion um uh

13:38 management area and forgot to say one

13:41 more thing there is a credit based flow

13:43 scheduling or control that is also one

13:45 of the key thing it's happening in the

13:47 ultra ethet Consortium we already myself

13:50 Mikel and one of my colleagues also

13:51 started doing a lot of research on it

13:53 and soon it will be uh once it's

13:56 available definitely it will be

13:57 incorporated in our switch as well

13:59 that's for congestion management for

14:01 load balancing side uh there are many

14:04 things happening again uh one of the key

14:07 uh feature under the cognitive routing

14:09 broadcom introducing a glb that is a

14:11 global load balancing mechanism the

14:14 dynamic load balancing is a local

14:15 significance where the glb can

14:17 communicate the quality table and the

14:19 link Q depths from one switch to another

14:21 switch uh using their uh propriety e

14:24 mechanism that already we started

14:27 supporting in our Juniper switch

14:29 portfolios and the second is that in

14:32 network collectives that's going to be

14:34 soon it's going to be the Talk of the

14:35 Town and basically what we are going to

14:38 do how the collectives going to

14:40 communicate from one fabric to another

14:42 fabric via one switch one server to

14:44 another server via fabric that's nothing

14:47 but in network collectives how we are

14:49 going to reduce the data copies and how

14:52 we are going to reduce the data movement

14:54 in the consequence we are going to

14:56 reduce the uh link utilization

14:59 and automatically job completion time

15:01 will be reduced uh these are the two

15:03 things we are working on and of course

15:05 the thermal management power efficiency

15:07 you of the key things uh Even in our

15:09 strategy also we are incorporating uh

15:12 which we will be uh doing a deep dive in

15:15 for the videos as well it's been a great

15:17 learning experience and I'm sure viewers

15:19 are enjoying as well uh listen to both

15:22 of you so with that we conclude our

15:25 today's session uh thank you both and to

15:28 our viewers

15:29 uh stay tuned to learn more and we'll be

15:32 coming up with a new episode shortly

15:37 [Music]

Congestion Management in the AI Data Center

Congestion Management in the AI Data Center

You’ll learn

Who is this for?

Host

Guest speakers

Resources

Transcript