Arun Gandhi, Senior Product Marketing Manager, Juniper Networks

RDMA Over Converged Ethernet Version 2 for AI Data Centers

AI & MLData Center
Arun Gandhi Headshot
The top portion of the title slide shows an image of a city at night with a graphical treatment of blue lines running through and around the buildings. The bottom portion includes the “AI Data Center – ROCEv2” title over the images and titles of the host and guest. The text describing the host’s image reads, “Arun Gandhi, Senior Product Marketing Manager,” and the text describing the guest’s image reads, “Michal Styszynski, Product Manager.” The lower right portion of the slide includes the Juniper logo with the “Driven by Experience” tagline.

Could ROCE supplant InfiniBand as a go-to solution for AI and ML networking?

AI and ML applications are growing at a fast pace in data centers. Ethernet-based networks are gaining interest as an alternative to InfiniBand in AI data center networking. RDMA over Converged Ethernet version 2 (ROCEv2) encapsulates RDMA/RC protocol packets within UDP packets for transport over Ethernet networks. Juniper’s Arun Gandhi and Michal Styszynski discuss the data transfer protocols and congestion considerations for AI workloads.

Show more

You’ll learn

  • How the value of ROCEv2 has increased with the rapid rise of AI and ML models that require lots of parallel processing capacity at scale

  • How ROCEv2 components are coordinated inside the data center fabric

Who is this for?

Network Professionals

Host

Arun Gandhi Headshot
Arun Gandhi
Senior Product Marketing Manager, Juniper Networks

Guest speakers

Michal-Styszynski headshot
Michal Styszynski
Product Manager, Juniper Networks

Transcript

0:08 hi everyone welcome to another season of

0:10 the video series to learn about cutting

0:13 a Technologies for your data center in

0:16 the last series Almost 2 years ago we

0:19 discussed bgp unnumbered RDMA over

0:22 converge eanet version 2 otherwise known

0:26 as Rocky

0:27 V2 and uh I Fabrics from Modern data

0:32 centers well as you know recent advances

0:35 in generative AI have captured the

0:37 imagination of hundreds of millions of

0:40 people around the world data centers are

0:42 the engines behind Ai and data center

0:45 networks play a critical role in

0:47 interconnecting and maximizing the

0:50 utilization of costly GPU servers that

0:54 perform the compute intensive processing

0:57 in an AI training Data Center

1:00 today I'm joined by my special guest and

1:03 a good friend Michael cinski Michael

1:07 always a pleasure to sit down with you

1:09 and discuss cuttingedge technological

1:12 advances in the data center spaceon hi

1:15 everyone uh thanks for having me AR

1:18 today you are absolutely to the point

1:20 Aruna uh as a matter of fact you know

1:23 after two years uh we see that the

1:25 Technologies we discussed uh just uh

1:28 after the covid ERA with sending we

1:31 realized that in the last two years the

1:33 technology of RDMA over converge

1:36 ethernet just exploded in terms of

1:39 popularity while we we we can say that

1:42 the Baseline of the technology stays the

1:44 same actually the way we use it uh

1:47 changed a little bit and that's of

1:49 course in the context of uh of the

1:52 explosion of the popularity of the chat

1:54 GPT Louge language models artificial

1:57 intelligence and machine learning so

1:59 when building these infrastructures we

2:01 need a a lot of parallel processing

2:04 capacity at scale so it means we are

2:07 using additional components inside the

2:09 server such as the gpus which are

2:12 accelerating the way we can treat the

2:14 data at the same time in parallel right

2:17 so instead of using the CPU approach

2:19 which is a serialized approach of

2:21 processing data using the gpus which are

2:24 in the same server we can actually

2:27 process the data in parallel at the same

2:29 time and deliver the outcomes to the uh

2:31 to the user quickly on time right so in

2:34 order to exchange the data and have the

2:37 cycles of the of the data processing uh

2:41 to occure on different servers we need a

2:44 technology to actually synchronize the

2:45 datas between the servers and one of

2:47 these technology is in fact the rocky V2

2:49 we discussed over two years ago now uh

2:52 so did anything change like you wanted

2:55 to make to to make sure that we are

2:57 actually up to date about this

2:58 technology so there are some components

3:00 that change uh popularity of the

3:02 technology increased but there is a

3:05 reason for that also is because simply

3:07 the technology is offering much more

3:09 better resiliency comparing to for

3:12 example a centralized model of infin

3:14 band where there is also a controller

3:16 which actually uh you know gives the

3:19 access to uh to the resources of the

3:21 fabric in case of Internet IP Fabrics

3:24 where we leverage we transport the infin

3:26 band uh across the the fabric we have a

3:29 fully distributed architectural model so

3:32 that's that's one of the advantage so

3:34 what changed comparing to what used to

3:36 be is for example the scale actually

3:39 requirements are much bigger now right

3:41 so uh we we have for example the

3:44 concepts of uh of running dedicated

3:46 rocket to uh networks in the back end of

3:49 the of the AIML clusters and that's

3:52 where also the technology is use in the

3:54 leaf spine super spine type of

3:56 deployment but the over subscription

3:59 ratio for example example is one: one

4:01 comparing to what it used to be

4:02 traditionally in the data center

4:04 deployments where we had like one to

4:06 three one to2 sometimes even one to six

4:09 so in this case even if the over

4:11 subscription ratios from Leaf to spine

4:13 goes to one to one in some situations

4:16 still rocky V2 uh will play a

4:19 significant role it's also the case

4:21 whenever we have a a rail optimized

4:25 deployment uh where actually we connect

4:28 the gpus on the same switch but then we

4:30 have interconnectivity between different

4:33 Stripes of the leaf devices to go

4:35 through the spin that's where also the

4:37 strip to stripe connectivity may require

4:40 the rocket2 which is available in the

4:42 industry uh for quite some time now so

4:46 that that that's what changed right some

4:47 architectural aspect some additional

4:49 things that we'll be discussing later on

4:51 probably but we we definitely see that

4:54 uh there are advances which actually uh

4:57 confirms that whoever started to work on

5:00 this technology some time ago I did the

5:02 right choice in terms of decision making

5:05 process fantastic I'm glad now that we

5:09 understand uh the value of the

5:11 technology has increased over time you

5:13 know especially when we hearing more

5:16 about Rocky

5:17 V2 uh that brings to my next question

5:21 how are the Rocky V2 components

5:24 coordinating inside the DC fabric I

5:27 recall uh the two components

5:28 particularly the BFC and the ecn inside

5:32 the IB glass fabric no it's it's a good

5:35 point Ain so uh in fact so you cited the

5:38 two big components so I mentioned in the

5:39 first uh uh part of our discussion that

5:43 we want to transport the the the the

5:45 payloads of the infin band using the UDP

5:48 encapsulation across the fabric so

5:50 between the leave devices from one GPU

5:53 uh to another GPU we are just sinking

5:55 these chunks of datas uh using the the

5:57 transport over over internet IP UDP but

6:01 the the the two aspects that are

6:03 important here is to make sure that

6:04 whenever there is a congestion that

6:06 occurs on on the network that we

6:10 actually handle this congestion in the

6:11 right way so you have these two

6:13 mechanisms prial control uh using the

6:16 dscp uh markings and also the ecn right

6:20 so now the question is once you get into

6:22 the situation with the with the

6:24 congestion uh which one will will kick

6:26 in first right so quite often it depends

6:29 on the implementation in case of most of

6:32 the implementations we have the ecn

6:34 which will will kick in first and we'll

6:36 actually sand on the data plane Pockets

6:40 markings of the one to one uh into the

6:43 destination GPU and so the destination

6:45 GP will realize okay there is a there is

6:48 an information that we I have some

6:50 congestion so I need to react to it and

6:52 it's going to S the information to the

6:54 originator of the of the flow right to

6:57 slow down a little bit and then if after

7:00 some time there are still congestions

7:02 occurring only then the PFC will kick in

7:05 on the back end and then will inform one

7:08 by one hop by hop on the point-to-point

7:10 level that there are uh switches on down

7:13 the road to to slow down uh the the the

7:17 the the the speed at which the data is

7:20 is being sent right so there is that

7:22 level of coordination at the per node

7:24 level for these two mechanisms that you

7:26 you mentioned it's good to know that the

7:28 components of Rocky V2 are really well

7:31 coordinated inside the fabric but how

7:34 can an operator be sure that they're

7:36 really working oh that's that's a

7:38 challenging actually question because as

7:40 a matter of fact uh let's be precise so

7:43 the the the mechanisms that we discussed

7:45 the ecn and PFC they are acting actually

7:48 at the subsecond level right so we talk

7:50 about the microsc accuracy here so it's

7:54 it's it's fundamental to make sure that

7:57 actually at the switch level and the uh

7:59 actually lower level of the Asic we have

8:03 also the components which are capable of

8:05 tracking the urance of the of the

8:08 congestions so let's take a a simple

8:11 example so as an operator I want to make

8:14 sure that my uh GPU fabric my AIML

8:19 cluster fabric is actually in the steady

8:21 state so I'm I'm checking on my

8:23 monitoring stations that actually the

8:26 occurence of these congestions pretty

8:27 low right I in in case I see them on and

8:30 on being triggered it means that

8:32 something is wrong in my design or my

8:34 settings of my Fabrics are not good

8:36 right or maybe I over subscribe my my my

8:39 infrastructure so it's important to make

8:41 sure that we have this ecn and PFC

8:43 Telemetry information streamed at the

8:46 perq level uh to the uh station such as

8:49 for example the abstract fabric manager

8:52 where we can visualize okay for this

8:54 component on this component of the

8:56 fabric I do have some situations which I

8:59 need needs to fine tune right so that's

9:01 that's the example and so we can see on

9:04 the slide we have this capability of

9:06 visualizing also the buffer ring

9:08 utilization right so that aspect of

9:10 buffer is key because it means that as

9:13 long as I'm utilizing uh intelligently

9:16 my my my buffering at the ESS ports in

9:19 my in my fabric then in theory I should

9:22 not trigger any of these PFC or ecn

9:26 mechanisms right to control the

9:27 congestions right so so monitoring

9:30 having Telemetry for Rocky V2 per Q is

9:33 instrumental right to make sure that we

9:36 have a full understanding of the

9:37 performances of the fabric so besides

9:39 the ecn and PFC monitoring are there any

9:43 more advanced Rocky V2

9:45 capabilities well yes we we have

9:48 situations where actually we can

9:49 mitigate some of the situations where

9:52 well the fabric maybe uh a little bit uh

9:55 in a in a wrong situation where the

9:57 congestions are accur Ing and so we want

10:00 to make sure that the occurence of the

10:02 congestions inside the fabric are not

10:04 actually having repercussions on the

10:06 rest of the of the GPU uh uh workload

10:10 exchanges right so there is for example

10:13 a functionality called PFC watch do

10:15 pretty popular in the industry where uh

10:18 we actually uh want to go through this

10:22 situation where uh if there is an

10:26 avalanche of pfcs which are which are

10:30 received on the Ingress Port that if we

10:33 consider that this Avalanche is is a

10:35 continuous rate of the PFC so the push

10:38 backs are received on and on from a

10:41 specific uh Upstream switch then we can

10:44 consider okay this is not a normal

10:46 situation and we can simply ignore these

10:49 packets right ignore these packets

10:51 instead of pushing them down uh to the

10:53 uh actually uh Downstream switches we

10:56 will simply say okay I ignore these

10:58 buckets or eventually as an option I can

11:01 say okay in order to mitigate that

11:04 situation of the congestion on that

11:07 specific queue I will decide to actually

11:09 drop the packets in order to stop the

11:12 congestion on a specific segment of the

11:15 fabric right so it goes through the

11:17 distinct three states this PFC watchop

11:19 implementation used on a specific note

11:22 we have this on the diagram where the

11:24 spine one was enabled with the PFC watch

11:27 do which is actually spine AG creating

11:29 connectivities from different Leaf

11:31 devices on which the GPSs are connected

11:34 and so the detection timer will actually

11:36 go through the cycle and check how much

11:39 of these PFC messages were received and

11:41 so if it considers that in the window of

11:44 time it received too much of this it's

11:46 gonna just simply say okay probably

11:49 there it's not a normal situation so I'm

11:51 going to ignore them instead of

11:53 penalizing all the rest of the fabric

11:56 from from The Slowdown right so in order

11:59 to conserve the good performances for

12:01 all the rest of the gpus I will actually

12:04 consider this as a function to control

12:07 how uh actually the the reach of the of

12:09 the PFC is spread across the fabric so

12:12 pretty good uh function whenever we rely

12:15 a lot of PFC I think it's worth to

12:17 consider that kind of of mechanism so

12:20 that's one and then there is another one

12:22 which uh is uh is is trying to help

12:25 actually optimize the way we use the PFC

12:27 the priority flow controls is the is the

12:30 the PFC X on x off so whenever we

12:34 discuss the uh the rocky V2 PFC priority

12:38 flow control we also need to think about

12:40 how the buffering utilization happens so

12:43 if we have a topology you can see on the

12:45 on my uh left hand side so we have the

12:48 the leaf spine topology uh where

12:51 actually on the et2 of the spine one

12:53 there was a congestion that happened

12:55 right so in this situation uh well we

12:58 need to to think about at which moment I

13:01 should start generating my for example X

13:03 off messages which are the PFC uh uh

13:07 control messages which are saying to the

13:09 downstream device in this case for

13:11 example the leaf one to actually slow

13:13 down keeping buffers a little bit the

13:15 packets instead of continuously send

13:18 them and then get the congestions on the

13:20 et2 of the of the spine one so as you

13:23 can see we have this uh opportunity here

13:26 to uh actually control the uh the buffer

13:29 utilization at the per Q level by

13:31 setting something that is called Alpha

13:33 values so by setting different values

13:37 for uh the alpha at the per Q level we

13:40 can actually control how often actually

13:44 the x of messages will be uh sent right

13:47 so if for example I set my uh ex of

13:50 alpha value messages to uh a higher

13:53 value that the chances of the occurrence

13:55 of of the PFC uh will actually go down

13:59 right so if you see on the left hand

14:02 side we have a typical buffer thresholds

14:05 and so the x of X on as well as the head

14:08 Headroom they are representing these

14:10 three thresholds and in function of the

14:12 settings of the either Alpha value or

14:14 Exon offset we can control actually on

14:18 how often uh the the PFC messages are

14:21 actually triggered right so uh in order

14:24 to be actually a little bit more precise

14:26 it's always better to uh to take an

14:28 example of it so I have an example of a

14:32 simple topology we have the two ports

14:34 et0 and et1 connected on the same switch

14:37 for example a qfx

14:40 5230 uh and then we have an outgoing uh

14:43 uh interface et20 right so in this

14:46 situation we have on the same outgoing

14:48 interface two q's q0 and Q3 and they are

14:53 set with two different I of alpha

14:55 volumes so let's say we decide to set uh

14:59 the alpha value of nine for the q0 and

15:02 then the alpha value of seven for the Q3

15:05 so you can see that depending on the

15:07 values of these Alphas the outcome is

15:10 that I'm going to get a different number

15:12 of cells different number of uh buffers

15:16 for each of the cube so I share also the

15:19 formula on how we can actually calculate

15:21 fine tune these these buffers and so in

15:24 reality uh where exactly uh we would

15:27 consider Advanced complicated

15:30 calculations so we may have situations

15:33 where actually the specific large

15:36 language models are more important in

15:38 terms of data processing than the others

15:41 right they need to get the data synced

15:44 faster right so for this kind of L large

15:48 language model we large language models

15:50 we don't want to actually slow down uh

15:52 the the data exchanges and so we would

15:55 typically allocate higher Alpha values

15:58 right that's that's that's one of the

15:59 example so so we can see we have these

16:02 two functionalities I explained one is

16:04 the PFC watch doog the other one is the

16:07 X off Alpha values where actually the

16:11 the the administrator of the fabric can

16:13 fine-tune the fabric to make sure that

16:15 if he gets the best of the

16:18 bandwidth deployed in the fabric so with

16:21 with the best of the bandwidth of 400 G

16:23 800 gig deployed in an AML cluster so I

16:26 think this is important a little bit

16:28 more advanced but this is what actually

16:30 the the the industry uh is proposing to

16:33 make sure that actually this AML cluster

16:36 deployments are really good in terms of

16:38 performance and then case scale at the

16:40 larger volume right this is interesting

16:43 at the as these uh Advanced options for

16:46 rocky2 help to get the proper congestion

16:49 control

16:50 settings but as the Enterprises and the

16:53 cloud providers are building the AIML

16:55 clusters are there any other

16:57 Technologies that they must consider

16:59 Michael good point actually the aspect

17:02 of innovation still is increasing right

17:05 in the context of AML clusters there is

17:08 even more Innovations in terms of

17:10 software where uh instead of having a uh

17:14 traditional load balancing mechanisms

17:17 there are a lot of more efficiency uh

17:20 which uh comes into play especially when

17:22 you consider the AIML workloads the

17:25 entropy of these workloads so the the

17:27 variation in terms of characteristics of

17:29 the flows are relatively low right

17:32 comparing to traditional server uh type

17:35 of communication so in order to make

17:37 sure that we still use the max of the

17:40 capacity of our AIML cluster fabric uh

17:43 there are a lot of enhancements around

17:46 the load balancing uh so some some some

17:49 of them are for example the dynamic load

17:51 balancing where the characteristics of

17:54 the flow are not the only entry but also

17:56 the band with the real time bad with

17:58 utilization of the outgoing links on the

18:01 IP CMP groups are Incorporated in the

18:04 calculation of the hashing that's one

18:06 example another one is for example the

18:08 glb the global load balancing where

18:11 actually we track also the situation on

18:13 the next to next hop node in order to

18:16 check what are the performances on the

18:18 next hop nodes in order to consider the

18:21 the right path locally on the device and

18:24 then the last one is the traffic

18:25 engineering uh inside the fabric where

18:28 as the administration uh task we can

18:31 decide okay that there are for example

18:33 elephant flows and mice flows that they

18:35 they going to always go on two diverse

18:38 paths inside the fabric and will have a

18:41 control on which of the next hope they

18:42 going to use so a lot of advancements a

18:45 lot of innovation around also load

18:47 balancing right which is key inside the

18:49 AML cluster networking our own right

18:52 awesome so once again thank you Michael

18:56 for uh just being p patiently ly

18:58 answering a lot of questions and this is

19:01 uh this is going to be great because uh

19:03 I'm learning a lot of new things as well

19:05 uh as we as we talk and uh thank you to

19:07 all the viewers I would uh my suggestion

19:11 is going to be stay tuned to learn more

19:12 about the AI data center networks in our

19:15 next set of videos with that have a

19:18 wonderful rest of your day thank you

19:25 man

Show more