Jay Wilson, Architect, Juniper Networks

Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks

Summits Data Center
Jay Wilson

Cloud Field Day 20: Design, Deploy, and Operate AI Clusters like a Pro

Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.

Show more

You’ll learn

  • How to manage AI clusters like a hyperscaler with Apstra

  • How Apstra uses explicit congestion notifications (ECNs) and priority flow control (PFC) counters to monitor the health and performance of AI clusters

  • How Apstra’s anomaly detection and rollback features can benefit the AI data center

Who is this for?

Network Professionals Business Leaders

Host

Jay Wilson
Jay Wilson
Architect, Juniper Networks

Transcript

0:10 uh I am not from product management I am not from product marketing I am what is

0:16 known as part of the X domain architecture team which means I'm in the field I work directly with customers day

0:24 in day out architecting Solutions uh how I got nominated to come

0:30 up here and talk about this is I've prettyy much spent my entire career in the data center uh spent decades in HPC

0:37 so AI is definitely a sister or a cousin to HPC and uh the folks down in our pck

0:45 laab which you all are going to go see or what we call our AI Innovation lab uh

0:50 uh asked me to work with them in figuring out how we Monitor and manage the environment you already heard from

0:58 the earlier presenters about everything leading up to it and I'm supposed to be the demo and I was

1:05 told uh originally I was going to do a live demo then I was told let's record it because as we all know something can

1:12 one way it goes right A Thousand Ways it goes wrong I've been doing this long time so I do have a pre-recorded demo

1:20 but I can also jump to the live setup what I do not want to do there's actually somebody scheduled in there

1:26 right now I do not want to make any changes but I am happy absolutely happy to jump over there but before we get

1:32 there I want to go through a couple things uh this is really important to understand and I'm very pedantic about

1:38 words uh every every presenter up here said that abstra does data centers okay

1:46 abstra absolutely can do data centers but abstra is really focused on Fabrics

1:53 inside of data centers you're going to have one abstra instance that can do

1:58 many Fabrics and it can actually do many data centers we actually have customers that have an single install of the

2:05 abstra instance and they're managing multiple data centers so I kind of bugs

2:10 me when that's our marketing and product people and they come up here and they say Hey you know it does it's data

2:16 center F okay whatever it it really builds fabrics and in the world of AI

2:24 there's typically four of them and we've talked about three of them and Prof actually mentioned all four he talked

2:29 about the front Network or the inference side of it so this was our setup and

2:34 you've seen this slide already but I want to break it down so it makes more sense these are the three Fabrics uh we

2:43 tried to rearrange this picture a different way to to make it look more logical but it just didn't fit well in

2:49 the geometry so the gpus are right here in the center this is the GPU to GPU

2:57 traffic down here the purple is the GPU to

3:02 storage and what you really see is that's a set of leaves over there if

3:08 you're thinking spine Leaf architecture that's the core for it that's the only way we can make this picture fit and

3:14 make some sense and then this is our front end management fabric where slurm and everything else lives good level

3:23 set okay by the way I'm pretty interactive person

3:30 the theme is think of on Prim AI clusters like you

3:37 would if you're a hyperscaler and Nick and Raj talked all about the terraform

3:42 stuff they've done absolutely uh abstra has been around for 10 years it's actually it's 10 year anniversary this

3:49 year uh Juniper acquired them about three years two months ago and one of

3:56 the things that I really like about abstra is the fact that at the very base where at the foundation is

4:02 intent and as RZ mentioned think of it like the declarative nature that you get with terraform and intent is very very

4:10 critical and the fact that it's the foundation is even more critical a lot of vendors have been talking about

4:15 intent the last couple years what they when they talk about intent they talk about the fact they've layered it on top

4:21 and they're trying to press it down into what they'd already built this is on the bottom and it permeates every level

4:29 within the stack of the abstra software so the day Zero the day one and the day

4:35 two plus everything is predicated on that intent and there was a question earlier this morning about validation

4:41 and you know how to if something isn't right because of that intent at being at the base every change you make goes

4:49 through that intent validation process if it doesn't match it kicks it back and

4:54 will not let it go but this is a really really important differentiator when you start thinking about it the other thing

5:00 that it does is it implies that ABS is the single source of Truth so in this workflow whether you're

5:08 using terraforms or something else in this workflow abstra is the source of Truth and I will show you how that comes

5:15 about and what I'm really going to focus on is not only the intent I'm going to focus on the Telemetry because this is

5:21 where it gets very specific about what we're doing for R AI cluster I want to

5:27 point out that what I'm going to show you the Abra I'm going to show you is the exact same asra every one of you

5:33 here and everybody out there in The Ether can go download from our website and deploy we have not changed it in any

5:42 fashion from the core features all we've really done is leverage the features that exist no

5:52 modifications whoops backwards it doesn't help uh I wanted to give uh Nick

5:57 and Raj another hand up because they already did the day Zero and day one stuff for us right so this is again it's

6:05 it's a valid Ur uh QR code it's not going to sneak up and bite you you think there's something hidden

6:13 there and I can't emphasize this enough at the heart of our jvds is

6:19 abstra and even if you're not using Juniper which you know as a juniper

6:25 person I just say we would love for you to use a juniper but even if you're not it's a great reference architecture to

6:31 go see what we have done inside of the environment specific to AIML so I just

6:37 want to point that out again and what I'm going to demonstrate is how we how we've tailored the

6:43 environment again we haven't put anything special in there but we've leveraged features that exist already

6:50 then what we do day-to-day operating it yes some of it is terraform driven but we also have instances where we manually

6:57 do stuff with it and then how develop against it all right so this is the

7:02 pre-recorded demo uh when you first come in you land and you've seen this page a couple times you land on what's known as

7:08 blueprint page think house right design

7:14 build deploy so everything starts with blueprint or as I like to call it it's Fabrics these are three Fabrics those

7:21 are the three Fabrics the yellow red purple and in the far left we have the

7:27 GPU to GPU so the back end fa fabric for the GPU traffic in the middle is the storage to GPU traffic and on the far

7:35 right we have the front end management fabric as you can see the this screen tells you exactly what's happening in

7:42 the environment at any given moment you can see that we have some anomalies on the backend fabric for the gpus but

7:50 notice that there's two types there is a service anomaly and there's a probe anomaly service anomaly means it's

7:57 impacting actual traffic in the environment probe means there's some

8:02 probes that have been turned on and that we're having some conditions you need to go investigate them but it's not

8:08 actually impacting traffic flow all green outstanding so where I'm going to jump

8:14 now is how are we actually tailoring the environment we're tailoring the environment by creating some custom

8:21 Telemetry again all we did was make custom Telemetry the feature exists

8:26 already we've went out and we pulled telemetry for ecn so earlier profa

8:31 talked about these which is explicit congestion notifications so what happens is is a switch will say I'm seeing

8:39 congestion by the time it gets to the in device the in device will say oh I've

8:45 seen that you've marked a packet send a notification to the far end when it gets to the far end you may

8:52 or may not need to turn on priority flow control in other words pause frames which is we've Al we're also collecting

8:59 pause frame counters and we're monitoring Q drops so three collectors we made three pieces of custom Telemetry

9:06 why did we need to make three pieces of custom Telemetry we had to make those because it's a the actual fact that

9:13 we're tracking them is relatively new in the OS on the boxes we don't have

9:20 streaming grpc sensors yet so we're we literally are taking a uh CLI command

9:27 pulling back the data and we're saying give us these three fields map them to something we can do with them and then

9:34 let me build off of that at some point and that's the first key to what we're

9:40 doing specific to AIML are these collectors the second key that we're tailoring the environment with is known

9:47 as configlets abstra is a fantastic tool for being able to take an entire fabric

9:55 look at it holistically spin it up holistically monitor holistically what

10:00 operate doesn't do is it doesn't turn every Bell knob and whistle inside of an

10:07 OS think about it it's got us Dell Sonic Cisco and Arista it does all of these

10:15 Fabrics it's a uplift from them as in it masks all the details but it's really

10:22 hard to try and do thousands and thousands and thousands of knobs so we use configlets or little pieces of code

10:30 when we don't have a knob inside of B to do something specific to what we're doing for AIML we wrote 20 I'm sorry we wrote 41

10:39 of them if you look up there in the far corner on the right 41 of these we're not using all of these at the same

10:46 time we have five teams inside of juniper all using that cluster sitting

10:52 over there all of them needing different tweaks and twists and knobs and uh profl

10:58 talked about flets and uh doing per packet spraying

11:03 and stuff we got to test all that so we do those with configlets so those are the two pieces of tailoring from an

11:10 operation standpoint once you drop into a blueprint you get all of the probe

11:16 setup uh and you get all of the gauges from an AIML standpoint we created part

11:23 a custom dashboard what this custom dashboard tells us is the three things we're inter

11:29 rested in and it's those three pieces of telemetry that we wrote ecns are we seeing them pfcs are

11:37 we're seeing them and are we seeing drops this allows us to tell us if the

11:45 environment is adhering to the lossless nature specifically what I did for this dashboard was this is a grouped

11:53 dashboard this is 12 hours of every interface in the fabric anybody want to

11:58 guess how many interfaces in the fabric we got 12 people in this room a lot somebody give me a number we we we got

12:05 96 gpus one: one over subscription how many

12:12 interfaces 152 too late I know everybody's gonna say 42 because it's the answer to the

12:17 universe but it's 152 you go if I want to get specific and I want to go down and see

12:24 what individual interfaces we're doing I actually wrote some probes I I'm sorry I wrote some widgets these widgets are

12:31 telling me buy Leaf buy interface if it's seeing ecn at any moment in time I

12:38 wrote another one that goes through and looks at the pfcs because if you got ecn turned on

12:45 you're more than likely we have pfcs turned on the belt and suspenders approach to making sure it stays

12:50 lossless and it's really important to try and understand where you're having the congestion in the environment and

12:56 who is being who's being told to back off right so I wanted to be able to one give

13:02 you a high level overview say everything's good or think yes sir what's running when you're with during

13:08 this demo I mean what's running on the cluster uh there's actually an ixie in the background when they were generating

13:13 all this for me uh I'm actually going to show you live and last time I looked like an hour ago all my graphs were

13:21 zero so it doesn't make for a really interesting demo so to speak um that's

13:28 the other advantage haven't recorded I I actually had them generate bad traffic for me the uh on the pfcs you can see

13:35 that the pfcs are coming in on different interfaces which is what we'd expect we expect the pause frames to be in places

13:41 other than the ecn so you can actually kind of in your brain if you know your cluster you can say oh yeah I can

13:47 understand the traffic flow and how that reacted now with the abstra flow that uh

13:53 uh Monsour talked about that actually adds another dimension where we can map things together

13:59 but if you want to get even more specific if you want to go real time my

14:04 personal belief is like I said I've been doing this a long time uh since 81 I've been in this business most people don't

14:10 come to to networking person and say I have a problem right now right most

14:15 people come to NW working person says I had a problem yesterday or I had a

14:21 problem an hour ago that's really what the graphs are about the probes are real time happening

14:29 now somebody walks into your office and says I got a problem I got that problem right now you can say boom let's go to

14:35 the probe let's see what we got and see how it's happening once we get beyond that part of it other pieces we use

14:42 operationally is we all we do we do look for anomalies and there was a question earlier about how do you troubleshoot

14:48 how do you see what's going on anomalies is how you see this and like I said

14:53 earlier right we had a number of anomalies but they were prob related and what this is telling us oops

14:59 see that up yeah I'm sto right there what those

15:07 what those anomalies were telling us was one we're having a power problem and when you go and see the tour you're

15:14 going to find if they let you look at the back you'll see they pulled a cord out of every one of my switches they

15:19 took the Redundant power out because we needed power in the other parts of the lab so that's why I would expect

15:26 anomalies I always have 70 anomalies every day I look at it regardless of

15:32 everything's perfect I always have 70 anomalies because there's always a cord missing in the back end the other

15:39 anomalies we had was it said it was missing some Telemetry and it was having some high

15:46 dis usage I'm missing Telemetry because some of my boxes as you can see on the screen are running at 96%

15:54 capacity on their discs they're basically out of space and when I looked last night uh this was recorded last

15:59 week when look last night two of them were 100% dis usage there is no space

16:05 left on on those boxes and the reason being is since we have five teams working on this they like to keep

16:11 different OS builds and flip them on the fly on those boxes and they've literally

16:16 chewed the dis Bas which is causing some of the Telemetry basically say I haven't got anywhere to do anything with

16:22 Telemetry at this point and so real time absolutely use it

16:30 we use it quite frequently in the lab because we want to know on what's going on right now the other thing and mon sir

16:37 talked about this there's an staged and an active so you always have the what's running in production is

16:44 active staged is what am I going to do and from the staged you can actually

16:50 get a real-time heat map so if you're if you if you're seeing congestion and you're seeing uh pfcs you can actually

16:59 go in here and you can crank up any box click on it and then click on

17:05 somewhere else in the environment and get what's known as a uh neighbors head in room topology I call it the butterfly

17:13 and in real time it'll give you a heat map of what's happening traffic-wise within the

17:19 environment so if you do have the luxury of somebody coming to you and saying right now I've got this problem you can

17:24 go boom this is what we're seeing in the environment and you can roll over any link get all the details you want to get

17:30 this gets back to the cabling question you know how do you know and Nick was pretty precise but not exactly

17:39 precise and we can give you a print out of cabling we can do it through llddp we

17:45 can detect mismatches right now we don't need a diff it'll tell you if there's something wrong uh when it all goes out

17:51 there and does it the other thing that we we do

17:56 operationally if they're not using the terraforms and I use this personally because I'm not a terraform terraform

18:02 person I go in and I can manually tell it to apply configlet if

18:09 you're not a point click person you don't want to use this but if you're a point click person you do want to use this and what I'm showing you right here

18:15 is I'm going in I pulled up a box and I said what's the current setting for

18:20 DLB which is dynamically Lo balancing and it was disabled so I'm going to take a box and I'm going to tell it you know

18:28 what go go and apply a configlet for me and that configlet is I want you to

18:33 change the balancing to do flowet but I'm proving to you right now that they

18:39 all are currently in a disabled State and that's literally going to the

18:45 Box real time grabbing it and saying you know what I don't have any balancing turned on at this moment in

18:52 time and now I'm sorry to interrupt but I I'm going to interrupt your flow

18:58 because I'm really curious it seems like you're sort of manually going in and

19:03 tweaking a predefined or an abstra developed configuration right and ab was

19:08 managing that so how does that then affect abstra and what does abstra say

19:14 and and how do you pick that up into everything in your configs and snapshots excellent question it's about to come up

19:20 right here in this demo thank you so uh uh one of the things you can do uh when you're doing the configit is you can

19:26 apply Leaf level spine level all levels you're going to apply it at at a specific device level everything like

19:32 that once you do the apply it will go into the uncommit state and the uncommit

19:39 state is where we get into what you were just ask asking about it will go through and see what I'm going to do is it going

19:46 to violate intent if it's going to violate intent it will not let it go through and if

19:54 it uh adheres to all the rules let's say I Tred to apply it to a box where this

20:00 won't work it won't let it go through and as part of that process once it's in the

20:07 uncommit you have to go and commit it and that's exactly what I did I pushed a

20:12 little button called commit and it's going through that validation process what it did during that

20:18 validation process is three things it did exactly what you just asked which is is it going to cause me a problem if I

20:23 go and do this if it is don't do it it's it's literally going to reject it at that point

20:29 if it's not go and update the graph DB it's a graph DB back in uh and then push

20:36 it out to the individual pieces of the fabric that it's supposed to whether

20:42 that's all leafes all spines individual boxes whatever it is abstra is atomic

20:48 and what I mean by that is you go into Stag and you may do one two 100 changes

20:55 it puts it all together stages it it Compares it and makes sure that all

21:02 those different changes put together first work and then doesn't match intent

21:08 so it's it's very rigorous when it goes to that stage of it uh and then it sends

21:15 it out to the boxes when it sends it to the boxes very much like junos it is all

21:21 at once so there's never a moment of a config this little bit of the config

21:27 especially when it gets security this little bit of config does a change this little bit does a change no no this is Atomic Boom it all goes and what I've

21:34 seen with abster is uh typically and I've talked to many customers about this

21:39 typically less than 30 seconds every single box in the environment will get the build and we have a couple of

21:45 customers in the multiple two to 400 range yes sir so so couple of questions

21:54 so uh in a background this is all done through the SSA or are you using netcon

21:59 for like how how does it apply the configuration there's actually an agent that sets up on every switch and it's

22:07 talking it makes a secure connection to those agents and it's talking back and forth oh okay so it's like a container

22:14 running on each of these switches and then connecting back into yeah it's an agent of some sort not every not every

22:19 platform supports containers right right but like yeah yeah okay great and like how smart is this like staging so you

22:27 mentioned like you know it's like boom but sometimes that's not that like you know simple like if if it tries to go

22:34 and push to a box and it can't get there for some reason right it's pretty smart because I've actually run into this

22:39 personally I have I have my own lab uh since I'm in the field and I've had instances where they've cut a box or two

22:46 off from me my lab sits in Virginia and and I don't sit here I actually live in Phoenix and I'm freezing in this room

22:52 but you guys don't care about that the uh uh I've had it cut up box

22:59 off it it has come back and told me I can't get to that box it literally will save those changes so next time it cont

23:06 talk that box and then it'll push and make sure that that it works those boxes oh okay so so it's smart enough to say

23:12 like if I'm going to apply this change you're going to lose this particular service it actually well yeah it depends

23:20 on how you apply it but yeah but in my case it was my box wasn't reachable and instead of it just killing

23:28 all boxes it pushed the change in that one box it waited till it came back online and then it said oh I can see the

23:34 Box again and it notified me I couldn't see the box and it and then it took the change and it pushed it okay so there is

23:41 intelligence in there there is AB absolutely intelligence in there so how does that Atomic push work for um like

23:48 Juniper has the the um you can stage the config committed all all happen once within the CLI other vendors you you'd

23:55 have to do stuff like shut down an interface make the change re it it it puts it all it packages it all

24:01 together in in the Stream for appropriate for each V for each vendor it pushes it to the agent on the box and

24:07 the agent handles it from there okay it's a it's a solid push to the agent and then the agent will determine what

24:13 what steps have to happen Y and that it's that's really the beauty of it so you know like most people in this

24:18 industry I haven't always worked for Juniper right I've worked for others names vendors I just mentioned and you

24:25 know it's uh it it it really is nice to have that abstraction layer and not have to worry about all the did I do this did

24:32 I shut this down first did I do a no or do I do a a delete or you know what's what's what do you what's the right

24:38 nomac clature right it really abstracts all of that so Jay there was some

24:43 mention that uh after the change is actually committed there's some post

24:48 validation activities go on and if the change doesn't do something properly with post validation it can be rolled

24:55 back you yes you you can absolutely roll back and I'm actually going to roll back

25:01 right here um as soon as as soon as this proves to you that it actually worked

25:07 that that what we did uh we it will roll it back because uh I did this uh last

25:14 week Raj gave me a time slot part of his time and I went out and recorded this

25:20 and after I was done I I needed to make sure I rolled the environment back and we use this very heavily between our

25:25 five teams and you'll see that once it goes to it's actually going to look at three boxes proven to you that it

25:30 actually went to leaves actually went to spines because I told to go to leaves and spines and you can see that the

25:37 different Leaf it's a different leaf and it's now at flowet mode as opposed to disabled mode and on a dayto day basis

25:44 do we run this command we don't I really just ran it for you so you can see yes it really did do what it was supposed to

25:49 do it didn't right when when abster says it does it we trust that it's done what it's supposed to do right and now it's

25:57 going to go to spine and do the same thing does this commit process allow you

26:02 to schedule things so you you know your team's working during the day they're staging all their configuration changes

26:09 but you know your policies require they roll at midnight yeah I thought that I thought that uh answer earlier I don't

26:15 remember who answered it was uh interesting I'm I'm a I'm a very blatant

26:22 person I'm old I don't care anymore I'm very blat um itself does not have a timing

26:30 mechanism the statement when it was asked earlier was put in the context of using tools like terraform and stuff to

26:36 do the staging right but apture itself does not have a notion of schedule for this

26:43 time and oh oh so this was the how do you get it back so that's called time

26:48 Voyager let me pause it here and as you can see on this screen uh we have some

26:53 golden config States and this is why we find it very valuable matter of fact yesterday I speaking to a group of

27:00 higher education individuals and as you can imagine their clusters are based on

27:06 researchers inside the university so a lot of different teams have to use it and this was a feature they kind of like

27:12 latched on to the fact that you can save every change within the environment and

27:19 tell it I want to go back to a point in time so looking at this you can see that we have a golden config at 1: one we had

27:26 one at 1: two we have one for a completely different spine type because

27:31 we're we're testing different equipment inside the environment and being able to quickly take everything back

27:37 holistically is very important and I pushed that flowet

27:44 configlet and what I want to do is pull it back and regardless of how far back I

27:49 go yeah go ahead are the snapshots uh change driven so as as each of these

27:54 change commitments are made there's a you know a pre and a post or something like that or are they time driven this

28:00 like every half hour I'm I'm taking a snapshot no it's not taking this it's not it's every time you actually make a

28:05 change and this is really really really important yeah regardless of where it

28:12 comes from and what I mean from that is whether it comes from terraform or if you're into Python and

28:18 you do it through our python SDK which I'm not a python person I test it um or

28:24 you're like me I'm a person who uses sedr o and curl maybe a little JQ once in a while right

28:31 so uh everything goes to the commit and it gets queued up in that

28:39 uncommit you have to issue an API if you're not using abstra called that says

28:45 commit it and then it goes through the validation process and if it all gets validated it puts it on through and you

28:52 get your okay back if it doesn't get validated gives you a error code and says you don't get validated that's why

28:59 I say abstra is a single source of Truth everything any way you touch it once

29:04 abstra is involved it is the gatekeeper and when you're in a mixed

29:10 mode environment and you're working with a lot of different teams I think that is

29:17 fantastic like I said I've been in this business a long time I've liked that notion of having this gatekeeper and

29:23 more importantly than that I can go back and see who did it when they did it and put this change and if I don't like it I

29:30 can tell get rid of it I want to get rid of it is there um backup recovery and

29:35 other security and protections on the uh database on the single source of Truth

29:41 database that would be the one area where uh what most customers do and

29:46 again I'm I'm out in the field with customers most customers in the field will take a snapshot about every hour of

29:52 the database of the database thank you and it's actually a bunch of databases M

29:58 right so it's it's a graph DB on the back end so you're actually going to see that here in a second um the terraform

30:05 stuff again if you want to find the terraform stuff uh what I did was there's a tab inside of the UI called

30:13 developers down here I clicked on it and we went to the terraform we also have our Postman Library if you're into

30:19 python all the python stuff is out in the postman library with the SDK if you want to go against that and if you're

30:27 not in that and you're somebody like me said graach uh you just go straight to the apis and you can test them right

30:33 here real time uh on the platform itself and it came back and says I have three

30:39 Blueprints and we do have three Blueprints and you get all the details so it's really straightforward really

30:45 well documented I've worked with a lot of systems over the years where it's not nearly as nicely documented and if you

30:50 really want to see the graph DB there it is sweet I know lines and circles woohoo

31:01 right but more importantly if you really want to understand abstra so there I'm going to

31:07 pause it right here there was a statement and I forgot who asked the question earlier about something about

31:13 the configuration and could you get the configuration when you inside of abstra

31:19 there is a button and I I love this question I've had a lot of customers asked this where's the configuration

31:24 stored inside of abstra yes it's a graph DB that's right it's all

31:30 about the relationship when you go into abster and you click and you say I want to see the the running configuration you

31:37 click on a button that says render you don't go and say dump me the stored configuration because it's

31:44 rendered real time from everything graph DB knows because it knows exactly what's

31:50 on the box so all of these little Bubbles and circles have bits and pieces of the

31:56 configuration in it and it pulls it all together in real time for you so I think that's fantastic but again I'm old

32:05 school and you can get a lot of detail about every little entity and how

32:10 entities interact and I think of apture this

32:16 way it it's it's it's the heart of our management for our AI cluster

32:22 environment and that's how we use it but more importantly what I want to do is let's go to the

32:29 after so this is real time what's happening in that building right now oh

32:36 it's changed since I was last in here look at all that red look at all that red service anomalies so uh and this is

32:44 the big reason I wanted to show it the uh I'll have to go see about this this wasn't there a while ago the 18

32:50 wasn't there the 70 I was expecting 70 because 70 is always there because of the power stuff the more interesting

32:57 ones to me me right now are the fact that the storage and the management both are showing

33:02 two and the question was about how do you troubleshoot knowing my

33:08 cluster I almost never ever see a red in storage or in a management and the fact

33:14 I see two and two tells me that more than likely something is broken at the

33:20 GPU side because the GPU has a lot of interfaces the G the actual GPU

33:28 side of it that's doing the back and GPU was fine earlier but the storage and the management wasn't and if I go in here

33:34 and I look at one of these and I click on service anomaly it tells me on Leaf 2 the

33:43 physical port s and The Logical Port assigned to that physical Port is having

33:49 a problem on Leaf 2 and if I look at it and click down into it at the

33:56 physical is telling me that the h004 is having a connectivity problem to

34:03 Port 7 on the storage side if I go back up and I go to this

34:12 service anomaly this is telling me that leaf one

34:20 again Port 7 with the logical interface 7.0

34:29 and I look at the physical guess where it's going same

34:35 place so that tells me right away that h100 04 needs to be looked at the funny part

34:42 is an hour ago I sent a message to somebody said you need to go look at this and all of a sudden we got 18 other

34:48 anomalies so I have no idea what they went and looked at but that's how I we

34:53 use it for troubleshooting we look for the service anomalies and after is very different uh alerts and alarms somebody

34:59 used the term alerts and alarms earlier uh when talking opst is about

35:05 anomalies and it it's a very different context than alerts and alarms to me

35:13 alerts and alarms traditionally are we send it to a management system that management system somebody will look at

35:19 and go oh yeah you know that's a minor one let's acknowledge it and say you know we know it's there blah blah blah

35:24 it sits back in a queue apstra will not let you knowledge an anomaly it actually

35:30 expects you to go fix it it will stay red until you fix

35:36 it so you got one option fix it or it

35:41 stays red it won't let you just blow it off if it's if it's important enough that it made an anomaly it believes it

35:48 needs to be fixed so uh a little bit different philosophy and how to address it just follow up on that that note

35:55 there so for the 70 that you have there all the time you said it's for the duplicate uh the Redundant power supply

36:02 you would have to go back into your design then and say I don't have redundant power supplies anymore to

36:07 clear them yes I I could actually go into the probe modify the probe the tell

36:13 it it's only got one power and this is the power that it's using but you know

36:19 yes the the other method is is yeah because they tell me in the lab I'm never going to get the power back I

36:24 would right but it makes for good demo

Show more