Load Balancing in the AI Data Center
Load Balancing in the AI Data Centers
AI/ML workloads in data centers generate distinct traffic called “Elephant flows.” These large amounts of remote direct memory access (RDMA) traffic are typically produced by graphics processing units (GPUs) in AI servers. It is essential to ensure that the fabric bandwidth utilization is efficient and works well even in situations of low entropy workloads. Juniper’s Arun Gandhi, Mahesh Subramaniam, and Himanshu Tambakuwala discuss the efficient load balancing techniques and their pros and cons within the AI data center fabric.
You’ll learn
The pros and cons of various load balancing techniques within the AI data center fabric
How “elephant flows” from large GPU clusters are handled by Juniper’s AI data center load balancing
Who is this for?
Host
Guest speakers
Experience More
Transcript
0:05 hello everyone welcome to the second
0:07 episode of the video series for the AI
0:10 data centers in the last episode with
0:13 Mikel we discussed the popularity and
0:15 Adoption of Rocky V2 and the advanced
0:19 options it supports for proper
0:21 congestion control we also discussed how
0:24 rocky V2 components coordinate inside
0:27 the DC fabric particularly the FC in the
0:30 ecn inside the IP CLA Fabric and briefly
0:34 touched upon the advanced options of
0:37 Rocky V2 to get the proper congestion
0:39 control settings with a few load
0:42 balancing uh efficiencies emerging in
0:44 the industry today we will discuss
0:47 Cutting Edge Technologies for load
0:49 balancing and techniques for balancing
0:52 flows from GPU clusters and to kick off
0:55 our discussion today I'm joined by
0:58 another special guest and good friends
1:00 Mahesh and himansu Mahesh and himansu
1:04 Welcome to our episode two thank you AR
1:06 I'm glad to be in the hot SE with you
1:09 thanks it's great to be part of this
1:11 discussion with you you and mahes so
1:13 Mahesh I'm going to kick off by with my
1:15 first question to you you know U load
1:18 balancing Is Not A New Concept and has
1:20 been a key feature in improving
1:22 application performance by increasing uh
1:25 the response time and reducing Network
1:28 latency for a long time
1:30 but why has it become a Hot Topic in AI
1:33 infrastructure today short answer
1:36 elephant flows with less
1:40 entropy lead us to focus more on
1:42 efficient load balancing in a data
1:44 center fabric
1:46 AR but uh to elaborate
1:49 more the a any a infrastructure have two
1:53 faes training and inference specifically
1:56 on the
1:57 training we'll connect all the GPU
2:00 together we'll call it as a GPU cluster
2:02 to train the model MH in the GPU
2:05 cluster a GPU will move the memory Chun
2:09 to another
2:11 GPU and uh we'll call it as a a gradient
2:15 or memory chunk end of the day it is the
2:17 result of the training if within the
2:20 server for example if you're using
2:22 Nvidia server the memory chunk will
2:25 communicate within the server between
2:26 the gpus Via nickel N Video Collective
2:29 communication libraries if you're using
2:31 other vendor they will call it as a rle
2:33 and all but if you're moving that memory
2:36 Chunk from one server to another server
2:40 that's where you need a
2:41 fabric moving the memory chunk
2:44 technically we'll call it as RDMA
2:47 traffic the RDMA traffic is so
2:50 crucial because we need to synchronize
2:53 those result across all the GPU in the
2:56 cluster because it's so crucial and also
2:59 so sensitive because it's RDMA is a
3:02 remote direct memory access you are
3:04 moving the memory memory from one place
3:06 to Other Place using
3:08 qad also the RDMA traffic is
3:13 large less entropy MH because of the
3:17 less entropy less entropy means it's a
3:19 differentiation in the packet right if
3:21 you have a good differentiation in the
3:23 packet easily we can segregate the
3:26 traffic and spray or load balance across
3:29 the parall l in the particular switch
3:31 but if you don't have a much
3:33 differentiation less entropy it's so
3:36 difficult to load balance the traffic if
3:38 you don't have the proper load balancing
3:41 in the fabric of course everybody knows
3:44 there will be a congestion there is high
3:46 probability of making a congestion in
3:47 the fabric when you have a high
3:50 probability of making congestion in the
3:51 fabric of course there will be a packet
3:53 drop when you have a packet drop
3:56 automatically your job com job
3:58 completion time will go high so to have
4:01 a lossless fabric to have a congestion
4:06 fabric load balancing become prominent
4:09 in the a data center cluster
4:11 specifically in the training cluster so
4:13 we must have a proper efficient load
4:16 balancing in the AI data center fabric
4:18 now that we understand uh the importance
4:20 of load balancing in the AI data centers
4:23 immanu that brings my uh brings us to my
4:26 next question what are the load
4:29 balancing methods uh prevailing in the
4:31 industry and also supported by junos and
4:35 uh why are they good fit or may not be a
4:38 good fit for AI data centers yeah so our
4:41 junifer qfx 5K switches currently
4:44 supports two mechanisms of load
4:46 balancing the first one is static has
4:48 based load balancing and the second one
4:50 is dynamic load balancing so whenever a
4:53 packet comes to a switch uh what switch
4:56 does is it looks into the packet header
4:58 and it creates a h out of it then it
5:01 takes the has and checks it into the
5:03 flow table and tries to find if there is
5:05 an entry existing in the table if there
5:08 is an entry in the table it is
5:10 considered that the flow is uh already
5:13 present and active so it takes the
5:14 outgoing interface map to that entry and
5:17 forwards the packet out of that outgoing
5:19 link but if the entry is not there in
5:22 the table in that case it is considered
5:24 as a new flow so the switch has to make
5:27 that entry into this into the flow table
5:30 and for that it needs to find
5:32 out a good outgoing interface so with
5:36 SLB it basically looks into the flows
5:40 map to each of the outgoing interface
5:43 and it finds which link has the least
5:45 number of flows mapped to it right and
5:49 then it assigns this flow to that
5:52 outgoing
5:53 interface so this works out best in most
5:56 of the cases where there are more number
5:59 of flows but the the bandwidth within
6:01 the flow is not huge but in cases of
6:03 elephant flows this does not result in
6:06 efficient load balancing and that's how
6:08 Dynamic load balancing come into picture
6:11 so with what dynamic load balancing does
6:13 is it has an algorithm wherein it takes
6:16 the link utilization and the Q
6:18 utilization and and taking that into
6:20 account it goes through the algorithm
6:23 and comes up with the quality band it
6:26 assigns quality band to each of the
6:28 outgoing interfaces so this quality band
6:31 can range from 0 to 7 seven being the
6:34 best quality link and zero being the
6:36 worst quality link so now looking at
6:39 this quality band it assigns the flow to
6:41 that link and the traffic starts
6:44 following so with TB there are three
6:47 modes of operation the first one is
6:49 assigned flow so once it assigns the
6:52 flow to a link it continues to use that
6:54 link Forever Until that link is act
6:57 until that flow is active the second is
7:00 mode is flowlet mode so with flowlet
7:02 mode what what it does is uh once it
7:05 assigns the flow to to a link it keeps
7:08 on monitoring the flow as well so it
7:10 keeps on checking for a pause in the
7:11 flow if the pause in the flow is greater
7:14 than the inactivity timer configured
7:16 then it cons it basically considered
7:18 that the flow has is over and it the
7:21 next packet it treats it as a new flow
7:23 and it again goes through the process of
7:25 identifying the best link and assigning
7:27 it to that link so this is how it it
7:29 keeps on doing the rebalancing based on
7:31 the uh pauses in the flow the third mode
7:35 is the per packet mode wherein basically
7:38 switch the packets within the flow are
7:40 balanced across multiple links uh based
7:42 on the link
7:43 utilization uh but as we know per packet
7:46 mode results in the reordering issue on
7:48 the Nick side so the destination Nick
7:51 has should have the capability to handle
7:52 the reordering so that is a challenge
7:54 with the per packet mode from AIML data
7:57 center perspective as mAh explained that
8:00 it has a lot of elephant flows coming to
8:03 the picture uh we think that DLB has a
8:06 good potential there compared to the
8:09 snv thanks amanu so that brings that
8:12 brings my next question uh to Mahesh
8:15 what's next with load
8:17 balancing yeah so uh as himu explained
8:23 um about SLB we can call it as option
8:25 one and static load balancing and the
8:28 static load balancing doesn't have the
8:31 intelligence to go and check the link
8:34 utilization or link health and so that's
8:36 the reason we are going to option two
8:39 which is DLB Dynamic load balancing it
8:41 has the algorithm has Intelligence to
8:44 understand the link Health as well as
8:46 the Q depth and we'll call it as a
8:48 quality band and you since you ask what
8:51 is next and the DLB the problem is that
8:54 the quality band the quality information
8:56 always will keep it in the same local
8:59 switch it will not propagate that
9:01 information to other side of the node
9:03 leaf or spine so that's where the next
9:06 one the option three is coming to the
9:07 picture that's called Global load
9:09 balancing Global load balancing is from
9:12 broadcom as6 th5 supports Global load
9:14 balancing in our product also we are
9:16 doing it but what it does it will it
9:20 will create the quality band and
9:22 understand the link utilization and the
9:24 Q depth and it will propagate that
9:27 information where DLB cannot do the GB
9:29 can propagate that information from
9:32 local switch to the remote switch maybe
9:34 leaf or spine MH which means the
9:37 advantage here is that you will not only
9:40 know the local link Health you will
9:44 understand the whole pop Quality Health
9:46 so accordingly you can spray or load
9:49 balance the traffic specifically elepant
9:51 flows on the RMA traffic
9:54 efficiently we have option four as well
9:57 option four we'll call it as a DLB
9:59 version two or I will call it as a
10:01 selective load balancing selective DLB I
10:04 would say what it does as I mention in
10:07 the earlier U discussion elephant flow
10:10 is the key and it's a crucial to look
10:12 for it right so the selective DLB it
10:15 will go into the RDMA traffic that's a
10:18 bth header it will look for that
10:21 elephant flows what are the elephant
10:23 flows coming out of the RDMA traffic and
10:26 the we'll call it as RDMA right verbs
10:29 which it's nothing but the elephant
10:30 flows so it will identify the elephant
10:32 flows accordingly it will spray the
10:34 traffic so to summarize that there are
10:37 option one for SLB option two for DLB
10:40 option three for glb option four is for
10:42 Selective
10:44 DLB but one point I want to make it
10:47 clear here this all those load balancing
10:50 we are supporting reordering is another
10:52 key uh criteria we need to look into
10:55 that the reordering of the packet uh in
10:58 the n
10:59 right now the N vendors are who some
11:02 other vendors are supporting reordering
11:04 which we need to handle it carefully so
11:06 all the spraying can happen in the
11:08 fabric the Nick has to handle the
11:10 reordering that one point I want to make
11:12 sure
11:14 second we are part of the U Ultra ether
11:19 Consortium technical working groups and
11:22 already U started working on the
11:24 flexible reordering so it will create a
11:26 new standard once the flexible
11:28 reordering will emerge there will be
11:30 other options also come into the picture
11:32 in the fabric level as of today we have
11:35 four options as I mentioned earlier
11:38 fantastic um so himansu is the industry
11:42 thinking about any further enhancements
11:44 uh to DLB and uh what is Juniper doing
11:47 in particular with respect to DLB uh
11:50 broadcom has come up with the cognitive
11:53 routing set of features among them there
11:55 is a feature called reactive path
11:56 rebalancing which is an enhancement of
11:58 DL be as we spoke about flowlet mode
12:01 wherein it monitors for the inactivity
12:03 timer in a flow to do the rebalancing
12:06 but uh with withl workloads it may not
12:10 be that always there is a pause in in
12:12 the flow large enough to cause the
12:13 rebalancing so what this feature does is
12:17 uh it keeps on monitoring the flow as
12:19 well as the link quality of the of the
12:21 link where the flow is going through so
12:23 if the link quality is not good in that
12:25 case it tries to find a new link and
12:28 assigns this particular flow to that
12:30 link so in in a way it will do the
12:33 rebalancing even without an inactivity
12:36 timer so this is this is going to be
12:38 very helpful with the long L flows uh in
12:41 particular other than that uh chiper is
12:44 also working on providing some
12:45 configuration option to to be able to
12:49 control the bucket size or the table
12:52 size assigned for for each of the ecmp
12:54 next stop so these are something these
12:56 are some some of the aspects that junipa
12:58 is working on
13:00 fantastic this is very insightful
13:02 actually uh and more like a crash course
13:05 for on on load balancing for me and for
13:08 the viewers but before we wrap uh I need
13:11 to ask uh what other Technologies do we
13:14 see being enhanced in support of AI data
13:17 centers and mahes you want to take that
13:18 question that' be great oh
13:22 yeah whenever wherever we are talking
13:25 about AI data center Aron the the key
13:29 word is that lossless fabric right
13:32 correct so to build the lossless fabric
13:36 we can do it in different different way
13:38 but to categorize it proactively we can
13:41 make the lossless fabric using lot of
13:44 load balancing efficient load balancing
13:46 mechanism what we discussed now the
13:48 second is the reactive mode so if if if
13:52 there is a congestion how we are going
13:54 to handle what are the mechanics uh to
13:56 control the congestion that's where the
13:58 rocket we to come to the pictures so the
14:01 next will be congestion management topic
14:05 uh which will be really helpful to uh
14:08 whoever watching this video it will be
14:10 helpful to understand how we can tweak
14:12 that uh for the a fabric the other one
14:16 um is that thermal management and it's
14:20 gpus are greedy and more than switches
14:23 and that it will consume more lot of
14:25 power so how and how efficient ly we can
14:29 do the thermal Management in the racks
14:32 in the data inside the data center that
14:34 may be another topic
14:35 yeah fantastic uh I think it's a great
14:39 point to conclude here but thank you
14:41 both thank you Mahesh and and himansu
14:43 for U for very insightful discussion and
14:46 to all our viewers who are listening in
14:49 uh stay tuned uh to learn more about the
14:51 a data center clusters in the next video
14:54 and all the Technologies uh supported so
14:58 uh stay well till
15:02 [Music]
15:07 then