Day 2: Operating your AI data center with Juniper Networks

0:00 i'm Jeremy Wallace and I work with Kyle i run what we call the product

0:05 specialist engineering team in Abstra and we just kind of we're like the Abstra janitors we do just about

0:12 everything so um I get to talk to you guys about taking all this stuff and now

0:17 running one of these networks with it right so we've seen a lot of the theory with the load balancing we've seen a lot

0:23 of the theory with how these AI data centers are built how they run k talked about standing them up and now I get to

0:30 talk about it from the perspective of the operator right what is if if I'm a if I'm sitting in a knock what am I

0:36 going to be looking at when I'm looking at these things so the thing the first thing that I want

0:43 to there there's a word that I like to get across your minds and it's the word context okay so when we build these

0:49 networks we have the full context of everything that should be happening

0:54 right so we know exactly the way that it's built and because we know the way that it's built we know exactly what

1:01 routes to expect in what locations we know exactly what interfaces should be where so this word context is critical

1:07 right so that's going to come up as a as a repeated theme the other thing that I

1:12 want to impress on your minds is this isn't necessarily new it's a new model of a network for the AI networks but

1:18 it's not new what we're doing we've been pulling this data out of the networks as Abstra for quite a while in different

1:26 kinds of formats whether it's an IP fabric or an EVPN vxline network the AI is just a new model right there's some

1:32 new data in there we got we've got some new agents that we can deploy out on these nick clust or the the GPU nicks to

1:40 pull data directly out of the servers there's a little bit of new things but what we're doing isn't actually new so

1:47 with that in mind these are some of the things that we're going to talk about right so if we're operating these

1:53 networks we know that any kind of an issue as in any other network is going

1:58 to be catastrophic we have a small glitch in a fabric or in a in a cluster

2:04 it's going to cause uh immense issues with the training jobs to be run or um

2:09 customer customer utilization whatever it is these issues are going to be catastrophic there's a ton of data

2:15 coming at us it's amazing how much data actually lives in the network in these

2:20 devices in the nicks there's a ton of data coming at us and and as operators I

2:26 remember my days as an operator trying to you know sift through the CLI and trying to find where the problems are right it's it's just an immense amount

2:32 of data coming at us all the time and then taking that data and digging down

2:37 and finding what is the actual problem where where am I congested where is my load balancing broken um those those

2:45 kinds of things so again it's all about the context we

2:51 used to take these networks and we'd stand up all these devices all over the place right so we have a whole bunch

2:57 of dots we have a whole bunch of devices and interfaces whatever it is sitting out there well if we can actually look

3:04 at it and say "This is a cat i now know how to in I now I now know how to

3:11 um my brain just uh interact with this cat right i know that if I go and I touch in this way it's not going to like

3:18 it but if I stroke it nicely we're going to have a happy cat networks are no different if we go and we push the wrong

3:24 configuration out we're going to have bad reactions we're going to have things go sideways we're going to have these these critical issues it's all about the

3:32 context if we have the context of what's actually supposed to be happening we can

3:37 pull out the right data and get to the root causes of the issues much

3:42 quicker all right so now with that in mind now that we're we know we're looking at a cat right we're looking at

3:48 this AI network we're going to look at some of the slides here and then I'm going to move from I'm going to move

3:54 through these pretty quickly i'm going to move into a live demo i like doing demos better than slides

3:59 this is just a representation of the body of the cat the body of the network right so this is a called the honeycomb

4:07 view and this shows all the GPUs that are available in the network we can break it down by different different

4:12 kinds of views so we have the view of the overall network all right we can dig down a little bit

4:19 deeper and we can get down into the pipes into the traffic dashboard so we can look at what's actually flowing

4:26 through the network we can see where uh where there's buffer utilization we can see where there's out of out of sequence

4:32 packets um congestion notification we can talk about that a little bit more in detail as well we can get down deeper

4:40 and pull this information out of the out of the networking devices and out of the GPU nicks

4:46 themselves we can see skip forward we can pull data out of the GPU nicks as

4:54 well we have like Kyle mentioned we have an agent that we deploy out on the deep on the GPU uh the GPU servers so now we

5:01 can pay attention directly to the nicks themselves and we can correlate that data with what's happening upstream in

5:07 the leaves and then from the leaves into the spines or you know across the rails however the the network is

5:14 built uh we can look at the congestion control stuff so we have and I I have a

5:19 I'll talk about this in just one second but there's different mechanisms for congestion control within an AI network

5:26 right when you're talking about Rocky and someone's might ask a question what does Rocky stand for RDMA which you know

5:34 remote direct memory access RDMA over converged Ethernet so it's just doing RDMA calls over our Ethernet network

5:41 within Rocky there's a bunch of congestion mechanisms to help us notify when when things are getting um bottle

5:49 bottlenecked and helping us to avoid that congestion and and provide back pressure to to slow things down so we

5:56 can go and we can actually configure these thresholds to to meet whatever

6:02 needs we have every workload is going to be a little bit different every network's going to be a little bit different every customer is going to

6:08 want to see slightly different thresholds and um drop at different levels that kind of stuff so we can we

6:16 can we can configure all those different uh thresholds and we can we can look at

6:22 look at it to say I want to create an anomaly if I have data that or if I have

6:28 um to if I have a certain number of drops over 10 minutes create an anomaly if I have one

6:36 drop in one second I might not care about that right or if I if my buffer

6:41 utilization gets to 90% for half a but drops back down that might not be a

6:47 problem but if it's at 90% for 10 minutes now we've got a problem so all of this is configurable to each

6:54 individual customer's needs and and training jobs and

7:03 networks with the the agent we can actually look at the health of the

7:09 servers themselves right so not only are we now looking at the CPUs on the switches and the memory on the switches

7:16 we're paying attention paying attention to the servers as well what's the CPU

7:21 utilization on the servers what's the RAM utilization on the servers right so that we can have all the way out to the

7:28 far reaches of of the environment and have it holistically um

7:35 monitored okay the goal for all of this stuff is

7:40 nothing more than keeping everything moving as fast as possible for as long as possible and the two mechanisms that

7:46 we have for that to help with that are congestion control so we have to make sure we can get cars on the freeway we can get the packets on the wire at the

7:54 appropriate time the appropriate speed make sure they're being delivered to the other end that's the congestion control

7:59 mechanism the other mechanism is the load balancing and they work hand in hand right so the load balancing make

8:05 sure all of our paths are utilized and then the congestion controls making sure the packets are going on at the right

8:11 time and all the stuff is working together to make sure no cars end up flying off the freeway inadvertently you

8:16 know that's when we start losing packets in a job things go sideways we start losing the effectiveness of our of our

8:24 uh of our expensive servers so the two me two of the

8:30 mechanisms that we use for and this is just a primer we're not going we're not going to go deep into the congestion

8:35 control the load balancing that's been discussed in other field days and in other sessions but I just want to give a

8:42 little bit of a refresher on what these are because we are going to see these in the in an upcoming demo here so the we

8:49 have two different mechanisms here the ECN's the um explicit congestion notification and the priority flow

8:55 control so the explicit the ECN's would be like you're on the freeway and it's

9:00 getting congested and now the brake lights start coming on going forward right and you're saying "Hey we might need to start slowing down a little

9:06 bit." And that's going to get to the end point and he's going to send back a congestion notification packet to the

9:13 beginning to the source and he's going to say "Pump the brakes a little bit you might need to you might want to slow down a little bit." Okay so that's good

9:20 ecn's are good that's that's indicating we're going to start behaving nicely with each other pfc's is like I'm coming

9:28 up on a traffic jam and I smash my brakes and everybody else behind me gets backed up right so now I'm telling everybody else behind me you got to slow

9:34 down but it's just it's a hard jam on the brakes okay so they're two different mechanisms so ideally we have more ECN's

9:41 in the network than we do PFC's question is there also a management view

9:47 specifically about PFC negotiation where I might be able to see PFC negotiation failures between switches or between my

9:53 endpoints um we can we can look at we can see

9:59 which device is initiating the PFC's backwards is that what you're is that what you're asking right is there

10:04 monitoring specifically yeah yes i'll show you a dashboard specifically for the PFC's being initiated we can see

10:11 exactly which device and which interface is initiating the PFC backward okay as

10:16 well as who's sending the ECN forward okay is the PC PFC that you're talking

10:22 about here is that part of the data center bridging is that the the same type of uh priority flow control that

10:28 was available through data center bridging yeah same kind of idea yes so this is this is Yes yes short version

10:35 yes okay yeah so I the way you were describing it it's typically you you do

10:41 that by assigning it to a particular VLAN and and you know that's the way

10:46 that you know which has priority so I wasn't sure how you could configure it other than just playing with the VLAN

10:52 yeah so this is we're actually going to see how this actually applies to a queue within the within the AI workload so

11:00 we're we're we're looking at this from the Rocky perspective and we can see a specific queue sending the PFC's back

11:06 upstream okay right so we don't have VLANs all through the network it's all layer three it's all BGP signal it's all

11:12 IP but it's all doing it back backwards um for the PFCs based on the the queue

11:18 and didn't mean to derail you just wanted to make sure that the definition hadn't changed on me

11:24 yeah okay so a year ago and I'm not going to redo this demo but a year ago

11:32 um we we presented a demonstration on what we called an autotuning DCQCM power

11:37 pack okay so DCQCN that's a crazy acronym data center quantized congestion

11:47 notification i think I got it right um it's just a fancy term for class of

11:52 service in advanced class of service in in the AI data center that's basically all it is so this autotuning power pack

11:59 what it the for DCQCN what it was doing is looking at the PFC's and the ECN's

12:07 and it would slide make move the sliding window for drop profiles left and right

12:12 right so the idea is I'm going to get as fast as I can and as close as I can to the edge of this cliff without jumping

12:19 off the edge of the cliff and dropping packets right so I'm going to back up a little bit and I'm going to move forward and get it exactly to where I need it to

12:26 be okay so this is this is a a tool looking at the APIs pulling data out of

12:32 the APIs looking at these PFCs looking at the ECN's and moving that sliding

12:39 window to get it exactly right so that the AI network is moving exactly where it needs to be

12:46 the reason I bring this up is because we're going to do I'm going to show a video of a of another demo similar to

12:52 this but it's another power pack that we've leveraged for load balancing

12:57 specifically so we talked about load balancing and you'll notice the one that's missing here is the

13:03 um the new the new one the RDMA based load balancing um excuse me

13:11 within these mechanisms of load balancing they kind of get more advanced as you go from you know kind of weave

13:17 through it and there's different places where you're going to use different versions of load balancing so the demo

13:22 that I have specifically shows the dynamic load balancing primarily because the network that it's built on is based

13:30 on Tomahawk 3 A6 which supports the DLB the Tomahawk 5 A6 supports the GLB right

13:38 so this when when we look at the difference between DLB and GB the DLB

13:44 like Vic was talking about we're paying attention to the quality of my local links right so it's a hop by hop

13:50 mechanism local links look on the spine my local links glb takes a whole bigger

13:55 picture right so I can look at the whole end to end what we're going to do in this demo when I get to that that demo

14:02 for the the DLB is we're going to see that we're paying attention to the entire network go back to that word

14:08 context we're paying attention to the entire network and doing almost the same kind of thing as GLB by looking at the

14:16 whole big picture and being able to move the different um settings the inactivity

14:23 intervals specifically to where we can have optimized load balancing and and

14:29 eliminate out of sequence packets right so that's I that's what I

14:34 just described here this is what we're going to see in the demo is this autotuning another autotuning mechanism

14:40 so if you have you have the autotuning for DCQCN running that's going to manage your congestion move that sliding window

14:47 then you have the autotuning for load balancing running and that's going to change that inactivity interval and as

14:53 those two things are moving together we're going to get that that network moving as close to the edge of the cliff

14:59 as possible and running as hot as it possibly can okay

15:05 any questions about any of that stuff so far great so with that I am going to

15:13 move into the Appstra UI here hopefully this works if I'm kind of angled a

15:19 little bit we got it mirrored so I'm I'm going to do a live demo and this is

15:25 running directly on our AI ML PAC lab that we have living in that building over there okay so this is this is real

15:33 hardware real GPUs real real switches um

15:39 nothing's nothing's virtualized here this is all this is all real stuff and I figured this is this is more interesting

15:45 than just kind of going through some theoretical stuff this is this is where I like to live we're going to try and jam some of

15:52 this stuff into like 15 20 minutes of demo and I recently spent four hours

15:58 with a with a customer going through this stuff because there's so much data coming at it so on that note one thing I

16:05 want to point out jump in here real quick one thing I

16:12 want to show is exactly how much data there is so if I come back over here and

16:17 I look at this this tab here this is this is a representation of our

16:26 cluster built on A100 um Nvidia A100

16:32 servers and it's a relatively small cluster it's going to be eight servers this is the this is the danger

16:39 with live demos it'll get there this is going to

16:44 it's going to it's built on eight servers with 16 leaves and eight GPUs

16:50 per per server so you can see it's it's a relatively small cluster right so there's 64 GPUs um it's all it's all railstripe

16:58 optimized so we each each one of these leaves is a rail if I zoom in here we can see that each one of them is a color

17:05 is that can you see that well enough so each one of these each each rail is identified by a color and each um each

17:13 one of these groups is a stripe so you can almost you can almost say a stripe

17:19 is like a pod almost all right it's just just kind of a loose analogy there so

17:25 this is what our network physically looks like we're talking about topologies this is what our network physically looks like now this is

17:31 something that I like to show just to identify how much there is to keep track of when I go into the graph explorer

17:39 this is looking at the actual graph database and we show the full blue the full blueprint

17:46 every one of these dots is something on the network and every l every line in here is a relationship that all this

17:52 stuff is a non-zero entity some you have to track it somewhere right either in

17:57 your brain in spreadsheets and multiple tools whatever it is this is where the context comes together these are all

18:03 these dots that make up that cat right we're not going to deep dive into the graph database just this is just kind of

18:10 a wow moment to say there's a lot of stuff to track in here right and those

18:15 have properties and stuff on them yes that's what makes it Yes so if I I can come in here and I can hover over this

18:21 and I can get contextual data out of every one of these dots

18:26 the UI for Abstra is nothing more than a first class citizen A API caller that

18:32 talks to the graph database that's it so everything in this graph database if a

18:38 if there's something that's not in the UI that a customer wants to do they can use uh whatever restful API tool they

18:44 want to use to create their own API calls directly into the graph and we have we have a number of customers that

18:50 do that themselves so and the relationships that can have properties as well yes yes yeah so so everything in

18:58 here has its properties and the the relationships will identify like this one is a link that goes between a leaf 4

19:07 and ixia i'm glad you can translate that yeah it's it's kind of it's hidden if I

19:12 click on that I think it'll stay up it's hidden right down in here so there you go it's pretty yeah I like it it's it's

19:20 cool and you know we can drag things around if I want to look at Ah here we go i want to look at this one and then I

19:26 can see where it links into right so we can play around with the graph database we can look at it the real power and

19:32 this is where I am not an expert we have other guys who are is you can go and you can create queries to pull the exact

19:40 data you want out of it then you can take that query throw it into an API call pretty cool all right that's enough

19:47 four hours to demo this thing how long does it take for somebody to learn to use this mhm

19:54 um it's actually easier than you would than you would think okay and you actually I'm not a graph query expert

20:02 myself and I almost never interact with the graph the graph is just there the

20:07 graph DB is just there picked a pretty thing so the CEO can show pretty pictures

20:13 this is what you spent your money on this is this is the secret room this is the secret sauce run yeah and in reality

20:22 it took Abstra a number of tries to get to the right kind of graph database to

20:28 make this work right so yeah it's it looks and and I like to bring this up

20:34 because this is what you have to keep track of right and it's difficult and if

20:41 you take this and you turn this into an EVPN

20:46 network gets even bigger okay so with that I want to step back and show this

20:54 so now here we have all these different networks that we're managing with a single instance of Abstra we have one

21:00 cluster that is based on EVPN right so this is where our H100's and our um AM

21:07 AMD MI300s live right now they're in this EVPN we're building multi-tenency

21:12 JVD uh Juniper validated designs based on EVPN VXLAN we've got the different um

21:20 o other backend fabric clusters so the the A100s that I'm going to be showing you are based in this cluster one we've

21:27 got a front-end fabric we've got a backend storage storage fabric all managed from the same instance we

21:34 understand the context of all these different networks and what's required in all these different

21:39 networks okay all right so when I go into my

21:45 cluster one dashboard here we're presented with what an operator would be looking at the operator would be sitting

21:51 there in his cubicle looking at pretty screens all the time right and ideally

21:58 and I'm going to come back to these probes here i'm not I'm not glossing over that ideally when he's in here looking at this this is what it looks

22:05 like because we have the reference design the best practices we push out

22:10 the correct um configurations and one story that I actually like is one customer called us

22:16 up one time they're like "Why do I do this this i never see anything red anymore i don't have any problems." And

22:22 I was like "Well that's because we pushed out the correct," and this was an EVPN customer this is some years back

22:28 this is an EVPN customer that's because we push out the best practices that you need for your environment they're like

22:36 Ah got it. So that that's a real story by the way that was I I talked to that

22:41 customer personally but this is ideally what you see and we have representations of our BGP connections our cabling this

22:49 is what Kyle talked about the cabling map if we have if someone goes in and and replaces a server and puts a cable

22:55 in the wrong switches the ports we'll we'll tell you that hey these cables are switched they're reversed um there's

23:02 ways we can automatically remediate that or we can tell them "No go physically remove it or we can just use an LLDP

23:08 discovery and pull that back in and fix it dynamically." So there's different this this

23:13 this is what we want to see all healthy all green

23:20 right user before I jump into the probes this is the interesting part and I can

23:25 tell you that something is now broken on our Ixia because I've been playing with this enough but this is the honeycomb

23:32 view and I showed you a screenshot of that a little bit ago this represents

23:37 every GPU in the fabric and I can take this and I can group it in in different

23:43 groupings so if I want to look and see what each one of my physical servers itself is doing this is now her server i

23:50 can do that if I want to see what my rails are doing I can look at each rail and I can see that I've got some rails

23:55 misbehaving i'm going to explain these colors here too this is a heat map of our GPUs it's

24:03 not a status indicator so if you think of a heat map the hotter it is the redder it is the better it is this

24:11 indicates how hot our GPUs are actually running ideally we want them all to be

24:16 running at this um this brown like 81% and higher right the the closer we can

24:22 get to 100% the better we're going to be the the the better our training model is going to run the quicker the job

24:28 completion time is going to be i happen to know that the Ixia is broken right now because this is the exact traffic

24:34 pattern that we saw when the Ixia broke couple days ago so the Ixia is doing

24:39 something weird we're sending a bunch of traffic with the Ixia creating a bunch of congestion i'm going to show you where the out of sequence packets are

24:46 and stuff like that so but what this indicates and this is actually cool i actually like this because now that if

24:52 I'm an operator I can look at this and go "Ah crap i've got some of my GPUs are

24:58 not operating at their peak efficiency what's going on?" Um right now I'm sorted by rails i

25:05 can take this and sort it by servers and I can select a specific server i can see

25:12 these if I zoom in here um and each one of these GPU 0 GPU one that is a rail so

25:19 zero rail zero rail one rail two i can see that rail one and rail three there's

25:25 something broken there now I can tell you that rail one happens to be a switch

25:31 that all the Ixia ports are coming in on and rail three is the other switch where all the Ixia ports are coming in on so

25:37 it makes sense that Ixie is dumping traffic in there creating a lot of back pressure a lot of congestion and that's

25:43 sending it back to the the servers okay but if I'm an operator I

25:49 can look at that and go dang you know what's going on here so when I come up and I look at my

25:56 probes this is going to give me an example of what's going on

26:03 in the network and this is going to tell me there's going to be a whole bunch of out of sequence packets and a whole

26:08 bunch of uh ECN frames and um things like that going on quick quick question

26:14 about how you're getting that data i'm assuming it's some sort of telemetry subscription and you're just getting the telemetry off the ASIC for that correct

26:21 correct there's a combination of RPC calls out to an agent so that's actually

26:26 a great question and I'll take a step back and explain that in Abstra we have the Abstra here we go this is the danger

26:33 of real demos there we

26:40 go okay in in Appstster we have the Abstra core server and then there's

26:45 agents that live out in in either on the devices themselves or that's an onbox

26:51 agent or we have an off-box agent that sits as in a small Docker container with the Abstra server the ideal is to get

26:58 them on the onbox out to the devices so we're we have and then we have a either an RPC call that will go out and pull

27:05 data it's not SNMP just an RPC call right or we'll do gRPC streaming data

27:11 back in on some some of the some of the counters all right there we go okay so here we we

27:19 have a whole bunch of out of sequence stuff and we can look at this and see here's a system ID and we'll correlate

27:26 this a little bit easier here in just one second because when I look at a serial number that doesn't mean much to me but I can see that on GPU3 E I've got

27:34 a whole bunch of out of sequence packets coming in out of sequence packets is indicative again of core load balancing

27:41 somewhere right I've also got CNPs going on and the CN MPs is indicative of

27:48 congestion so I've got something going on within my network where I've got both out of sequence packets and I've got

27:54 CMPs so I've got bad load balancing and bad congestion happening you know it's the Ixie is doing all kinds of crazy

28:01 stuff to to make this a mess so if I go back over here to my dashboard just

28:06 going to scroll down and show some of the parameters that we then that an operator would go and look at as he is

28:13 trying to figure out what's going on so we can create we can pull this data

28:19 out of the network and some of these graphs are based on 7-day intervals some of them are based on one hour

28:26 intervals and a user can go and create these graphs based on what they want to see the intervals they want to report

28:33 back up to their VPs right hey over the past seven days we've had this this much bandwidth in our AI network kind of a

28:39 thing so as we scroll down and look through this I'm just going to show you some of these graphs that we've created

28:45 really quickly um we've got spine spine traffic and if you hover over these you

28:50 can actually see the the amount of not supposed to leave the white line you can

28:56 see the amount of bandwidth on on each interface on the spine right so one of them is operating at like 400 gigabits

29:02 per second that's that's cool that's good that's what we want um and same thing you can go down and here's here's

29:08 our leaves our leaves for stripe one and our stripe two leaf so we can we have this broken out now per stripe we're

29:15 getting just a little bit more granular as we keep scrolling down through here we can look at total traffic over the

29:22 last seven days so you can you've got bandwidth u this you know here's here's

29:28 where gaps where the job stopped running ideally it's all just running really hot

29:33 right and when we look at this if you look at the the value it's 4.95 terab

29:39 right now if I come down I get a little bit more granular now I'm going to look at my intrastripe so stuff within my

29:47 stripe this is 3.1 terabytes and if I go

29:55 interstip between my stripes now we have the uh 1.76 terb so

30:02 now we take that 1.76 and the three you know and that adds up to the total traffic right so we can keep getting a

30:08 little bit more granular as we keep moving and these these dashboards can be swizzled and moved around however

30:14 however an operator wants to see it yes hi Denise Donahghue can you set it

30:20 up to notify you proactively when things start creeping yes yes there are there

30:26 are different ways to to set that up um yeah in fact in the in the demo that

30:33 we did last year with the congestion notification they actually had that set up with Service Now to where the

30:40 congestion notification would pop up and Service Now would open a ticket so the users would get that scrolling right

30:47 through Service Now and as things changed in service in the in the network the ticket in Service Now would change

30:52 and when the congestion got relieved or got completely fixed the ticket would get closed in service now

30:59 do you provide any integration with the other third party uh observability

31:05 stacks or monitoring um the the the we don't necessarily

31:12 provide that D or from from juniper ourselves but that's all available with

31:17 the rest APIs so there are other customers doing things with other kinds of observability platforms yes we do

31:24 have a telegraph plugin yeah that we can use and we work with Graphfana a ton so

31:30 that that's all there and then and and of course we have the the flow where we

31:36 can get lots of flow stuff and all that can go to different places so yes okay gotcha

31:43 other questions okay yeah maybe a question um I'm looking further as well

31:49 to miss one is there also because notifications is always a good thing if there's a problem but wouldn't it be

31:56 better that there's also a recommendation engine saying that hey there is a problem and we advise you to

32:02 do this or that yes so yes push the button and it happens yeah so with the with the cloud

32:10 services with the mist piece yes um that that is going to be more part of that

32:15 where you have a a a the um AI ops engine running up in the

32:22 in mist and we'll be able to have different kinds of things you can visualize in there we can already do

32:29 that with the EVP and VxLand stuff where you can you can look at a service and you can see traffic flows through the

32:36 network with a service you can see what would happen if I it's called a failure analysis what happens if I create a

32:42 break here you can do things like that and it will create or it can it can create recommendations but so that's

32:48 another UI that I need to use then so it's not in here yet correct that's not part of Abstra okay that that would be a

32:55 different window that would be running and you can launch Abstra from ACS oh

33:01 okay so you can you can click on it'll say this instance of Abstra is is reporting this issue click here to go do

33:07 this thing okay thanks yes jim Sprinsky

33:12 CDC the whole concept of a rail I'm still trying to get my head around that and is there what to elucidate on that

33:20 is there a specific thing you would monitor a rail for instead of some of the other things here yeah so let me

33:28 just jump back over here so I'll show you I'll show you a

33:34 comparison of what it looks like with a rail versus a non-rail okay so think of

33:39 let me just zoom this in a little bit better so we can get a little closer okay so if

33:46 you if you look at these you can see that you I have I've got four purple links and four blue links right gotcha

33:52 so a rail is nothing more than every server has eight nicks gotcha i've got

33:59 eight leaves ah nick zero goes to leaf zero nick one

34:04 goes to leaf one got it so rail zero is everything going to that one leaf got it

34:10 okay that makes sense that's somewhat simplified you can you can do it where if I have a leaf maybe I have two links

34:16 in a rail you know so there's different things you can do there but that's kind of the idea behind it is the nicks map

34:22 to the same place okay and so would there be a there's obviously specific

34:27 metrics that you're watching there and if you saw a failure on a rail that

34:34 would indicate what typically that something's wrong at the in at the

34:39 server at the nick or it depends on it depends on I mean the failure is going

34:44 to be wherever the failure is right I see kind of kind of a dumb statement but

34:49 if and I don't see that necessarily too terribly different from any other network if I've got so let me show you

34:55 what I mean by that real fast please if I can zoom this out I'm just going to

35:00 close that window there okay let's go back over here to um our blueprints so

35:06 if I look at this particular instance the the EVPN instance this is where like

35:12 I said where the H100's and the MI300 the AMD A6R if we look at this one this

35:18 is our multi-tenant so this would be like your GPU as a service cluster okay Right inside here you'll see that this

35:24 one is not built necessarily in a rail and stripe fashion okay but if I have a

35:29 GPU failure in a rail it's still a GPU failure if I have a an interface failure

35:36 I'm I'm just going to get it reported in a in different language so it'll say um

35:41 stripe one rail one interface et000 got had a failure okay versus here it's

35:48 going to say it's just a classification essentially exactly okay we're still working with Rocky between the GPUs

35:55 right so we're still looking at the the F the PFC's backwards ECN's forward CNPs

36:01 we're still looking at those same kinds of things okay it's just the terminology is going to change a little bit so just the topology term is really what it

36:07 comes down to yeah got it and one and one thing that we are actually working towards it's not in appstream yet but

36:14 we're working towards is having rail optimized and evpn together

36:21 okay okay so now imagine that one okay uh-huh that's not the system blowing up

36:27 that's like my mind just blowing i I work with this stuff every day and I'm still blown away by the

36:34 things that we're moving forward this the speeds we operate at um I actually loved our hashtag and this this kind of

36:41 goes along with what we do with Abstra i love the ha the the tag we used to use engineering simplicity

36:48 uhhuh get it get it this stuff is not simple

36:54 so we we tried to take and and there was a guy that I used to work for and I we were in a meeting with a bunch of people

37:00 talking about APIs and all this stuff and he just stopped the meeting prophet might remember this he stopped the

37:07 meeting and he goes "Hold on nobody cares about this piece under the iceberg

37:13 what's the very shiny tip that our users are going to interact with?" Right this is Abstra up here there's a whole bunch

37:20 going on underneath here this is the important part this is what we interact with that's how we interact with all

37:26 these other right you know think about the cat right we interact with the eyes we fur we don't we're not touching the

37:34 intestines and the heart but we know it's there and we know it's important we got to keep it healthy we've all had the

37:40 experience of a user that we really respect walk up and go "It's running slow we get it." Yeah exactly and so so

37:48 so back to this as we as we get a little deeper down it's running slow right that

37:53 that is a it's such a common theme that we don't know it's slow until 3 days

38:00 later right and someone comes to us and says "Hey my job ran slow three days ago

38:07 right over the weekend." The cool thing one of the cool things about Abster is we can do all this stuff in a time

38:13 series database we can keep all this data and I can take and I can run a dragging slider back and forth it's

38:21 actually something that I'll that's I think it's kind of cool i'll show you real fast here um if I go to the the

38:26 active tab here and we look at this heat map

38:36 um so with this traffic heat I can turn this into a time series and I don't this

38:41 is on the this is on the EVPN cluster so I'm not actually sure what's there but I can take and I can drag this back and

38:47 forth and I can show and they can say "Hey Saturday morning at 2 a.m there's a

38:54 noticeable slowdown and we can go and look at the network and it will show us exactly where congestion happened where

39:01 bottlenecks happened right that that kind of a thing so it was probably a DNS failure yeah it's always it's always DNS

39:09 right yeah it's always DNS now uh because it holds a lot of data um and it

39:16 collects it from all those devices um and so on and interfaces where does it

39:22 store the data it's got to be a really big database is it in that AppScra server is it somewhere else in the cloud

39:28 or a combination of how long do you keep that data that's a great question so on

39:33 on the Abstra server we keep 30 days worth of data uh if you want to keep a

39:39 year or two years or more worth of data then you export it out to something like Graphana Telegraph or up into the cloud

39:47 services and we can keep the data longer in in the cloud on the server itself in

39:52 order to keep it from becoming like terabytes in size we just keep 30 days

39:58 worth of data because it's a VM right correct it's a VM okay correct

40:03 yep so that's 30 days of completely non-aggregated or compressed data correct granularity yep that's all the

40:10 raw data that we pull from all the agents all the RPC calls everything that's coming into us we keep that for

40:15 30 days yeah okay in the interest of time I'm

40:21 going to jump over to this um autotuning load balancing demo and I'm just going

40:27 to start this and this is playing at 2x speed so this is it's sped up a little

40:32 bit and you're going to see what's what what's happening here is initially the inactivity interval timer was set for 64

40:40 right and so it's set really really small and when it's set really really small we're going to start seeing a lot

40:46 of out of sequence packets and as as we're monitoring the network for these out of sequence packets we're going to

40:53 start backing that number up to get to where we eliminate completely the out of sequence packets

41:00 um the way to think of the inner the inactivity interval in

41:06 um in DLB I I I like to think of it as like the professor timeout interval

41:13 right we've all been there with college or high school when you walk into the class and you're like "How long do I have to wait for the professor?" Right

41:20 how long and and if you wait for 30 seconds and you bail out that's too quick you might you're probably going to

41:26 miss something you're going to miss a professor something important is going to happen if you wait for too long and you sit there for 3 hours well now um

41:34 you know who knows what's changed in the world of the news in the past 3 hours and all you you miss all kinds of errors

17:39 and all kinds of things happen the inactivity interval isn't really any different than that if it's set too

17:45 small I'm going to create my own problems there's there's going to be uh out of sequence frames i'm going to

17:51 receive things on the wired incorrectly if it's set too big I'm going to miss the opportunity to detect the actual

17:59 problems on the network right so this is the whole purpose of this is to monitor the network

18:05 continuously and then make incremental changes this is like I'm touching the

18:10 whole cat now right so I'm looking at the whole cat and saying "Okay there's there's a little piece here that's

18:17 unhealthy let's holistically heal the cat." Um kind of a thing so I'm just

18:23 going to speed this up a little bit you'll see right now where we've got the um the it's the out of sequence packets

18:30 is at 1675 there's a value up there if I click forward a little bit here in time

18:36 the value is now dropped down to six out of sequence packets it might jump up back up as as the timer is going back

18:43 and forth and the idea with this autotuning mechanism is to get this down

18:48 to where it is actually down to zero out of sequence

18:53 packets and now we have an ideal inactivity interval for this specific

19:00 job the goal here is not necessarily to be faster than every algorithm in the

19:06 world or every every computer in the world the goal is to be faster than humans right if if we're sitting there

19:13 looking at this and I'm waiting for an out of sequence packet I now have to go and touch every device I have to figure

19:18 out where it's coming from the goal is to be faster than humans so if if we

19:25 start a job and this job is going to run for three hours or it's going to run for three weeks or three months whatever the

19:30 job's going to run for however big it is if I can resolve some of these things

19:36 the congestion or I can resolve the load balancing within the first couple minutes of that job running I

19:42 dramatically increase the odds of my success of the job brain to completion in a time that I want it to run right

19:50 so that's the that's the idea behind these autotuning things and one of the

19:56 cool things that we can do with Abstra is and this is something that we're work

20:01 actively working on within my team is building more of these power packs as we're calling them

20:08 that we can move features to AppStra just a little bit quicker right and then

20:14 we look at them we say "Okay this is a good one let's pull this and actually pull this into the product."

20:20 And the power packs are just Python scripts this one's a Python script yep and the the congestion map notification

20:26 is a Python script as well yep how often are you seeing like rate of change in

20:33 this particular perspective and and what I mean by that is how often am I am I reconfiguring the cluster enough to have

20:41 additional problems created for myself are you saying this is more of an enterprise perspective where I maybe

20:46 have a cluster that I don't mess with very often um what would you say is like the give and take on rate of change and

20:52 how often I'm redesigning a cluster that's a good question um it's probably something we'd have to take more and

20:59 discuss a little bit more offline i don't have I don't have really have good data behind that necessarily um ideally

21:06 not very often right you know ideally you get it to where the you you solve

21:12 the problem with a few configuration changes at the beginning maybe it's just with an inactivity interval timer mhm

21:18 ideally you're not touching it a whole lot you know so I've built it day 0 day

45:23 1 day two right and I'm running it yep i'm just kind of doing a a reactive monitoring at that point right because

45:29 I'm not rebuilding my cluster very often so I have to imagine it gets into a stable state pretty quick right correct

45:36 that's the idea now the the place where I could see it changing is maybe I run job type A and it has certain size of

45:43 packets and certain inter um inactivity interval right or the the congestion

45:52 notification pieces are have a certain place in that sliding window job B it

45:58 slides a little bit to the right here and the inactivity interval slides a little bit down here you know so having

46:04 those things running constantly I may have to make a configuration change here depending on the job that's

46:11 going to change right could potentially yeah okay yeah the idea is that every customer is going to be a little bit

46:17 unique just like every cat's going to be a little bit unique you're going to have a little bit different requirements for food this one's going to make him sick

46:23 this one's not so changing those things when needed as early as possible and

46:28 then having it just set stable is the idea yeah

Day 2: Operating your AI data center with Juniper Networks

GPYOU: Day 2 – Operating Your AI DC

You’ll learn

Who is this for?

Resources

Experience More

Securing AI Clusters, Juniper’s Approach to Threat Protection with Juniper Networks

Day 1: Managing your AI data center at scale with Juniper Networks

Day 0: Designing your AI data center with Juniper Networks

GPYOU: Building and Operating your AI Infrastructure with Juniper Networks

AI Unbound, Your Data Center Your Way with Juniper Networks

Maximize AI Cluster Performance using Juniper Self-Optimizing Ethernet with Juniper Networks

Transcript