Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks
Cloud Field Day 20: Design, Deploy, and Operate AI Clusters like a Pro
Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.
You’ll learn
How to manage AI clusters like a hyperscaler with Apstra
How Apstra uses explicit congestion notifications (ECNs) and priority flow control (PFC) counters to monitor the health and performance of AI clusters
How Apstra’s anomaly detection and rollback features can benefit the AI data center
Who is this for?
Host
Experience More
Transcript
0:10 uh I am not from product management I am not from product marketing I am what is
0:16 known as part of the X domain architecture team which means I'm in the field I work directly with customers day
0:24 in day out architecting Solutions uh how I got nominated to come
0:30 up here and talk about this is I've prettyy much spent my entire career in the data center uh spent decades in HPC
0:37 so AI is definitely a sister or a cousin to HPC and uh the folks down in our pck
0:45 laab which you all are going to go see or what we call our AI Innovation lab uh
0:50 uh asked me to work with them in figuring out how we Monitor and manage the environment you already heard from
0:58 the earlier presenters about everything leading up to it and I'm supposed to be the demo and I was
1:05 told uh originally I was going to do a live demo then I was told let's record it because as we all know something can
1:12 one way it goes right A Thousand Ways it goes wrong I've been doing this long time so I do have a pre-recorded demo
1:20 but I can also jump to the live setup what I do not want to do there's actually somebody scheduled in there
1:26 right now I do not want to make any changes but I am happy absolutely happy to jump over there but before we get
1:32 there I want to go through a couple things uh this is really important to understand and I'm very pedantic about
1:38 words uh every every presenter up here said that abstra does data centers okay
1:46 abstra absolutely can do data centers but abstra is really focused on Fabrics
1:53 inside of data centers you're going to have one abstra instance that can do
1:58 many Fabrics and it can actually do many data centers we actually have customers that have an single install of the
2:05 abstra instance and they're managing multiple data centers so I kind of bugs
2:10 me when that's our marketing and product people and they come up here and they say Hey you know it does it's data
2:16 center F okay whatever it it really builds fabrics and in the world of AI
2:24 there's typically four of them and we've talked about three of them and Prof actually mentioned all four he talked
2:29 about the front Network or the inference side of it so this was our setup and
2:34 you've seen this slide already but I want to break it down so it makes more sense these are the three Fabrics uh we
2:43 tried to rearrange this picture a different way to to make it look more logical but it just didn't fit well in
2:49 the geometry so the gpus are right here in the center this is the GPU to GPU
2:57 traffic down here the purple is the GPU to
3:02 storage and what you really see is that's a set of leaves over there if
3:08 you're thinking spine Leaf architecture that's the core for it that's the only way we can make this picture fit and
3:14 make some sense and then this is our front end management fabric where slurm and everything else lives good level
3:23 set okay by the way I'm pretty interactive person
3:30 the theme is think of on Prim AI clusters like you
3:37 would if you're a hyperscaler and Nick and Raj talked all about the terraform
3:42 stuff they've done absolutely uh abstra has been around for 10 years it's actually it's 10 year anniversary this
3:49 year uh Juniper acquired them about three years two months ago and one of
3:56 the things that I really like about abstra is the fact that at the very base where at the foundation is
4:02 intent and as RZ mentioned think of it like the declarative nature that you get with terraform and intent is very very
4:10 critical and the fact that it's the foundation is even more critical a lot of vendors have been talking about
4:15 intent the last couple years what they when they talk about intent they talk about the fact they've layered it on top
4:21 and they're trying to press it down into what they'd already built this is on the bottom and it permeates every level
4:29 within the stack of the abstra software so the day Zero the day one and the day
4:35 two plus everything is predicated on that intent and there was a question earlier this morning about validation
4:41 and you know how to if something isn't right because of that intent at being at the base every change you make goes
4:49 through that intent validation process if it doesn't match it kicks it back and
4:54 will not let it go but this is a really really important differentiator when you start thinking about it the other thing
5:00 that it does is it implies that ABS is the single source of Truth so in this workflow whether you're
5:08 using terraforms or something else in this workflow abstra is the source of Truth and I will show you how that comes
5:15 about and what I'm really going to focus on is not only the intent I'm going to focus on the Telemetry because this is
5:21 where it gets very specific about what we're doing for R AI cluster I want to
5:27 point out that what I'm going to show you the Abra I'm going to show you is the exact same asra every one of you
5:33 here and everybody out there in The Ether can go download from our website and deploy we have not changed it in any
5:42 fashion from the core features all we've really done is leverage the features that exist no
5:52 modifications whoops backwards it doesn't help uh I wanted to give uh Nick
5:57 and Raj another hand up because they already did the day Zero and day one stuff for us right so this is again it's
6:05 it's a valid Ur uh QR code it's not going to sneak up and bite you you think there's something hidden
6:13 there and I can't emphasize this enough at the heart of our jvds is
6:19 abstra and even if you're not using Juniper which you know as a juniper
6:25 person I just say we would love for you to use a juniper but even if you're not it's a great reference architecture to
6:31 go see what we have done inside of the environment specific to AIML so I just
6:37 want to point that out again and what I'm going to demonstrate is how we how we've tailored the
6:43 environment again we haven't put anything special in there but we've leveraged features that exist already
6:50 then what we do day-to-day operating it yes some of it is terraform driven but we also have instances where we manually
6:57 do stuff with it and then how develop against it all right so this is the
7:02 pre-recorded demo uh when you first come in you land and you've seen this page a couple times you land on what's known as
7:08 blueprint page think house right design
7:14 build deploy so everything starts with blueprint or as I like to call it it's Fabrics these are three Fabrics those
7:21 are the three Fabrics the yellow red purple and in the far left we have the
7:27 GPU to GPU so the back end fa fabric for the GPU traffic in the middle is the storage to GPU traffic and on the far
7:35 right we have the front end management fabric as you can see the this screen tells you exactly what's happening in
7:42 the environment at any given moment you can see that we have some anomalies on the backend fabric for the gpus but
7:50 notice that there's two types there is a service anomaly and there's a probe anomaly service anomaly means it's
7:57 impacting actual traffic in the environment probe means there's some
8:02 probes that have been turned on and that we're having some conditions you need to go investigate them but it's not
8:08 actually impacting traffic flow all green outstanding so where I'm going to jump
8:14 now is how are we actually tailoring the environment we're tailoring the environment by creating some custom
8:21 Telemetry again all we did was make custom Telemetry the feature exists
8:26 already we've went out and we pulled telemetry for ecn so earlier profa
8:31 talked about these which is explicit congestion notifications so what happens is is a switch will say I'm seeing
8:39 congestion by the time it gets to the in device the in device will say oh I've
8:45 seen that you've marked a packet send a notification to the far end when it gets to the far end you may
8:52 or may not need to turn on priority flow control in other words pause frames which is we've Al we're also collecting
8:59 pause frame counters and we're monitoring Q drops so three collectors we made three pieces of custom Telemetry
9:06 why did we need to make three pieces of custom Telemetry we had to make those because it's a the actual fact that
9:13 we're tracking them is relatively new in the OS on the boxes we don't have
9:20 streaming grpc sensors yet so we're we literally are taking a uh CLI command
9:27 pulling back the data and we're saying give us these three fields map them to something we can do with them and then
9:34 let me build off of that at some point and that's the first key to what we're
9:40 doing specific to AIML are these collectors the second key that we're tailoring the environment with is known
9:47 as configlets abstra is a fantastic tool for being able to take an entire fabric
9:55 look at it holistically spin it up holistically monitor holistically what
10:00 operate doesn't do is it doesn't turn every Bell knob and whistle inside of an
10:07 OS think about it it's got us Dell Sonic Cisco and Arista it does all of these
10:15 Fabrics it's a uplift from them as in it masks all the details but it's really
10:22 hard to try and do thousands and thousands and thousands of knobs so we use configlets or little pieces of code
10:30 when we don't have a knob inside of B to do something specific to what we're doing for AIML we wrote 20 I'm sorry we wrote 41
10:39 of them if you look up there in the far corner on the right 41 of these we're not using all of these at the same
10:46 time we have five teams inside of juniper all using that cluster sitting
10:52 over there all of them needing different tweaks and twists and knobs and uh profl
10:58 talked about flets and uh doing per packet spraying
11:03 and stuff we got to test all that so we do those with configlets so those are the two pieces of tailoring from an
11:10 operation standpoint once you drop into a blueprint you get all of the probe
11:16 setup uh and you get all of the gauges from an AIML standpoint we created part
11:23 a custom dashboard what this custom dashboard tells us is the three things we're inter
11:29 rested in and it's those three pieces of telemetry that we wrote ecns are we seeing them pfcs are
11:37 we're seeing them and are we seeing drops this allows us to tell us if the
11:45 environment is adhering to the lossless nature specifically what I did for this dashboard was this is a grouped
11:53 dashboard this is 12 hours of every interface in the fabric anybody want to
11:58 guess how many interfaces in the fabric we got 12 people in this room a lot somebody give me a number we we we got
12:05 96 gpus one: one over subscription how many
12:12 interfaces 152 too late I know everybody's gonna say 42 because it's the answer to the
12:17 universe but it's 152 you go if I want to get specific and I want to go down and see
12:24 what individual interfaces we're doing I actually wrote some probes I I'm sorry I wrote some widgets these widgets are
12:31 telling me buy Leaf buy interface if it's seeing ecn at any moment in time I
12:38 wrote another one that goes through and looks at the pfcs because if you got ecn turned on
12:45 you're more than likely we have pfcs turned on the belt and suspenders approach to making sure it stays
12:50 lossless and it's really important to try and understand where you're having the congestion in the environment and
12:56 who is being who's being told to back off right so I wanted to be able to one give
13:02 you a high level overview say everything's good or think yes sir what's running when you're with during
13:08 this demo I mean what's running on the cluster uh there's actually an ixie in the background when they were generating
13:13 all this for me uh I'm actually going to show you live and last time I looked like an hour ago all my graphs were
13:21 zero so it doesn't make for a really interesting demo so to speak um that's
13:28 the other advantage haven't recorded I I actually had them generate bad traffic for me the uh on the pfcs you can see
13:35 that the pfcs are coming in on different interfaces which is what we'd expect we expect the pause frames to be in places
13:41 other than the ecn so you can actually kind of in your brain if you know your cluster you can say oh yeah I can
13:47 understand the traffic flow and how that reacted now with the abstra flow that uh
13:53 uh Monsour talked about that actually adds another dimension where we can map things together
13:59 but if you want to get even more specific if you want to go real time my
14:04 personal belief is like I said I've been doing this a long time uh since 81 I've been in this business most people don't
14:10 come to to networking person and say I have a problem right now right most
14:15 people come to NW working person says I had a problem yesterday or I had a
14:21 problem an hour ago that's really what the graphs are about the probes are real time happening
14:29 now somebody walks into your office and says I got a problem I got that problem right now you can say boom let's go to
14:35 the probe let's see what we got and see how it's happening once we get beyond that part of it other pieces we use
14:42 operationally is we all we do we do look for anomalies and there was a question earlier about how do you troubleshoot
14:48 how do you see what's going on anomalies is how you see this and like I said
14:53 earlier right we had a number of anomalies but they were prob related and what this is telling us oops
14:59 see that up yeah I'm sto right there what those
15:07 what those anomalies were telling us was one we're having a power problem and when you go and see the tour you're
15:14 going to find if they let you look at the back you'll see they pulled a cord out of every one of my switches they
15:19 took the Redundant power out because we needed power in the other parts of the lab so that's why I would expect
15:26 anomalies I always have 70 anomalies every day I look at it regardless of
15:32 everything's perfect I always have 70 anomalies because there's always a cord missing in the back end the other
15:39 anomalies we had was it said it was missing some Telemetry and it was having some high
15:46 dis usage I'm missing Telemetry because some of my boxes as you can see on the screen are running at 96%
15:54 capacity on their discs they're basically out of space and when I looked last night uh this was recorded last
15:59 week when look last night two of them were 100% dis usage there is no space
16:05 left on on those boxes and the reason being is since we have five teams working on this they like to keep
16:11 different OS builds and flip them on the fly on those boxes and they've literally
16:16 chewed the dis Bas which is causing some of the Telemetry basically say I haven't got anywhere to do anything with
16:22 Telemetry at this point and so real time absolutely use it
16:30 we use it quite frequently in the lab because we want to know on what's going on right now the other thing and mon sir
16:37 talked about this there's an staged and an active so you always have the what's running in production is
16:44 active staged is what am I going to do and from the staged you can actually
16:50 get a real-time heat map so if you're if you if you're seeing congestion and you're seeing uh pfcs you can actually
16:59 go in here and you can crank up any box click on it and then click on
17:05 somewhere else in the environment and get what's known as a uh neighbors head in room topology I call it the butterfly
17:13 and in real time it'll give you a heat map of what's happening traffic-wise within the
17:19 environment so if you do have the luxury of somebody coming to you and saying right now I've got this problem you can
17:24 go boom this is what we're seeing in the environment and you can roll over any link get all the details you want to get
17:30 this gets back to the cabling question you know how do you know and Nick was pretty precise but not exactly
17:39 precise and we can give you a print out of cabling we can do it through llddp we
17:45 can detect mismatches right now we don't need a diff it'll tell you if there's something wrong uh when it all goes out
17:51 there and does it the other thing that we we do
17:56 operationally if they're not using the terraforms and I use this personally because I'm not a terraform terraform
18:02 person I go in and I can manually tell it to apply configlet if
18:09 you're not a point click person you don't want to use this but if you're a point click person you do want to use this and what I'm showing you right here
18:15 is I'm going in I pulled up a box and I said what's the current setting for
18:20 DLB which is dynamically Lo balancing and it was disabled so I'm going to take a box and I'm going to tell it you know
18:28 what go go and apply a configlet for me and that configlet is I want you to
18:33 change the balancing to do flowet but I'm proving to you right now that they
18:39 all are currently in a disabled State and that's literally going to the
18:45 Box real time grabbing it and saying you know what I don't have any balancing turned on at this moment in
18:52 time and now I'm sorry to interrupt but I I'm going to interrupt your flow
18:58 because I'm really curious it seems like you're sort of manually going in and
19:03 tweaking a predefined or an abstra developed configuration right and ab was
19:08 managing that so how does that then affect abstra and what does abstra say
19:14 and and how do you pick that up into everything in your configs and snapshots excellent question it's about to come up
19:20 right here in this demo thank you so uh uh one of the things you can do uh when you're doing the configit is you can
19:26 apply Leaf level spine level all levels you're going to apply it at at a specific device level everything like
19:32 that once you do the apply it will go into the uncommit state and the uncommit
19:39 state is where we get into what you were just ask asking about it will go through and see what I'm going to do is it going
19:46 to violate intent if it's going to violate intent it will not let it go through and if
19:54 it uh adheres to all the rules let's say I Tred to apply it to a box where this
20:00 won't work it won't let it go through and as part of that process once it's in the
20:07 uncommit you have to go and commit it and that's exactly what I did I pushed a
20:12 little button called commit and it's going through that validation process what it did during that
20:18 validation process is three things it did exactly what you just asked which is is it going to cause me a problem if I
20:23 go and do this if it is don't do it it's it's literally going to reject it at that point
20:29 if it's not go and update the graph DB it's a graph DB back in uh and then push
20:36 it out to the individual pieces of the fabric that it's supposed to whether
20:42 that's all leafes all spines individual boxes whatever it is abstra is atomic
20:48 and what I mean by that is you go into Stag and you may do one two 100 changes
20:55 it puts it all together stages it it Compares it and makes sure that all
21:02 those different changes put together first work and then doesn't match intent
21:08 so it's it's very rigorous when it goes to that stage of it uh and then it sends
21:15 it out to the boxes when it sends it to the boxes very much like junos it is all
21:21 at once so there's never a moment of a config this little bit of the config
21:27 especially when it gets security this little bit of config does a change this little bit does a change no no this is Atomic Boom it all goes and what I've
21:34 seen with abster is uh typically and I've talked to many customers about this
21:39 typically less than 30 seconds every single box in the environment will get the build and we have a couple of
21:45 customers in the multiple two to 400 range yes sir so so couple of questions
21:54 so uh in a background this is all done through the SSA or are you using netcon
21:59 for like how how does it apply the configuration there's actually an agent that sets up on every switch and it's
22:07 talking it makes a secure connection to those agents and it's talking back and forth oh okay so it's like a container
22:14 running on each of these switches and then connecting back into yeah it's an agent of some sort not every not every
22:19 platform supports containers right right but like yeah yeah okay great and like how smart is this like staging so you
22:27 mentioned like you know it's like boom but sometimes that's not that like you know simple like if if it tries to go
22:34 and push to a box and it can't get there for some reason right it's pretty smart because I've actually run into this
22:39 personally I have I have my own lab uh since I'm in the field and I've had instances where they've cut a box or two
22:46 off from me my lab sits in Virginia and and I don't sit here I actually live in Phoenix and I'm freezing in this room
22:52 but you guys don't care about that the uh uh I've had it cut up box
22:59 off it it has come back and told me I can't get to that box it literally will save those changes so next time it cont
23:06 talk that box and then it'll push and make sure that that it works those boxes oh okay so so it's smart enough to say
23:12 like if I'm going to apply this change you're going to lose this particular service it actually well yeah it depends
23:20 on how you apply it but yeah but in my case it was my box wasn't reachable and instead of it just killing
23:28 all boxes it pushed the change in that one box it waited till it came back online and then it said oh I can see the
23:34 Box again and it notified me I couldn't see the box and it and then it took the change and it pushed it okay so there is
23:41 intelligence in there there is AB absolutely intelligence in there so how does that Atomic push work for um like
23:48 Juniper has the the um you can stage the config committed all all happen once within the CLI other vendors you you'd
23:55 have to do stuff like shut down an interface make the change re it it it puts it all it packages it all
24:01 together in in the Stream for appropriate for each V for each vendor it pushes it to the agent on the box and
24:07 the agent handles it from there okay it's a it's a solid push to the agent and then the agent will determine what
24:13 what steps have to happen Y and that it's that's really the beauty of it so you know like most people in this
24:18 industry I haven't always worked for Juniper right I've worked for others names vendors I just mentioned and you
24:25 know it's uh it it it really is nice to have that abstraction layer and not have to worry about all the did I do this did
24:32 I shut this down first did I do a no or do I do a a delete or you know what's what's what do you what's the right
24:38 nomac clature right it really abstracts all of that so Jay there was some
24:43 mention that uh after the change is actually committed there's some post
24:48 validation activities go on and if the change doesn't do something properly with post validation it can be rolled
24:55 back you yes you you can absolutely roll back and I'm actually going to roll back
25:01 right here um as soon as as soon as this proves to you that it actually worked
25:07 that that what we did uh we it will roll it back because uh I did this uh last
25:14 week Raj gave me a time slot part of his time and I went out and recorded this
25:20 and after I was done I I needed to make sure I rolled the environment back and we use this very heavily between our
25:25 five teams and you'll see that once it goes to it's actually going to look at three boxes proven to you that it
25:30 actually went to leaves actually went to spines because I told to go to leaves and spines and you can see that the
25:37 different Leaf it's a different leaf and it's now at flowet mode as opposed to disabled mode and on a dayto day basis
25:44 do we run this command we don't I really just ran it for you so you can see yes it really did do what it was supposed to
25:49 do it didn't right when when abster says it does it we trust that it's done what it's supposed to do right and now it's
25:57 going to go to spine and do the same thing does this commit process allow you
26:02 to schedule things so you you know your team's working during the day they're staging all their configuration changes
26:09 but you know your policies require they roll at midnight yeah I thought that I thought that uh answer earlier I don't
26:15 remember who answered it was uh interesting I'm I'm a I'm a very blatant
26:22 person I'm old I don't care anymore I'm very blat um itself does not have a timing
26:30 mechanism the statement when it was asked earlier was put in the context of using tools like terraform and stuff to
26:36 do the staging right but apture itself does not have a notion of schedule for this
26:43 time and oh oh so this was the how do you get it back so that's called time
26:48 Voyager let me pause it here and as you can see on this screen uh we have some
26:53 golden config States and this is why we find it very valuable matter of fact yesterday I speaking to a group of
27:00 higher education individuals and as you can imagine their clusters are based on
27:06 researchers inside the university so a lot of different teams have to use it and this was a feature they kind of like
27:12 latched on to the fact that you can save every change within the environment and
27:19 tell it I want to go back to a point in time so looking at this you can see that we have a golden config at 1: one we had
27:26 one at 1: two we have one for a completely different spine type because
27:31 we're we're testing different equipment inside the environment and being able to quickly take everything back
27:37 holistically is very important and I pushed that flowet
27:44 configlet and what I want to do is pull it back and regardless of how far back I
27:49 go yeah go ahead are the snapshots uh change driven so as as each of these
27:54 change commitments are made there's a you know a pre and a post or something like that or are they time driven this
28:00 like every half hour I'm I'm taking a snapshot no it's not taking this it's not it's every time you actually make a
28:05 change and this is really really really important yeah regardless of where it
28:12 comes from and what I mean from that is whether it comes from terraform or if you're into Python and
28:18 you do it through our python SDK which I'm not a python person I test it um or
28:24 you're like me I'm a person who uses sedr o and curl maybe a little JQ once in a while right
28:31 so uh everything goes to the commit and it gets queued up in that
28:39 uncommit you have to issue an API if you're not using abstra called that says
28:45 commit it and then it goes through the validation process and if it all gets validated it puts it on through and you
28:52 get your okay back if it doesn't get validated gives you a error code and says you don't get validated that's why
28:59 I say abstra is a single source of Truth everything any way you touch it once
29:04 abstra is involved it is the gatekeeper and when you're in a mixed
29:10 mode environment and you're working with a lot of different teams I think that is
29:17 fantastic like I said I've been in this business a long time I've liked that notion of having this gatekeeper and
29:23 more importantly than that I can go back and see who did it when they did it and put this change and if I don't like it I
29:30 can tell get rid of it I want to get rid of it is there um backup recovery and
29:35 other security and protections on the uh database on the single source of Truth
29:41 database that would be the one area where uh what most customers do and
29:46 again I'm I'm out in the field with customers most customers in the field will take a snapshot about every hour of
29:52 the database of the database thank you and it's actually a bunch of databases M
29:58 right so it's it's a graph DB on the back end so you're actually going to see that here in a second um the terraform
30:05 stuff again if you want to find the terraform stuff uh what I did was there's a tab inside of the UI called
30:13 developers down here I clicked on it and we went to the terraform we also have our Postman Library if you're into
30:19 python all the python stuff is out in the postman library with the SDK if you want to go against that and if you're
30:27 not in that and you're somebody like me said graach uh you just go straight to the apis and you can test them right
30:33 here real time uh on the platform itself and it came back and says I have three
30:39 Blueprints and we do have three Blueprints and you get all the details so it's really straightforward really
30:45 well documented I've worked with a lot of systems over the years where it's not nearly as nicely documented and if you
30:50 really want to see the graph DB there it is sweet I know lines and circles woohoo
31:01 right but more importantly if you really want to understand abstra so there I'm going to
31:07 pause it right here there was a statement and I forgot who asked the question earlier about something about
31:13 the configuration and could you get the configuration when you inside of abstra
31:19 there is a button and I I love this question I've had a lot of customers asked this where's the configuration
31:24 stored inside of abstra yes it's a graph DB that's right it's all
31:30 about the relationship when you go into abster and you click and you say I want to see the the running configuration you
31:37 click on a button that says render you don't go and say dump me the stored configuration because it's
31:44 rendered real time from everything graph DB knows because it knows exactly what's
31:50 on the box so all of these little Bubbles and circles have bits and pieces of the
31:56 configuration in it and it pulls it all together in real time for you so I think that's fantastic but again I'm old
32:05 school and you can get a lot of detail about every little entity and how
32:10 entities interact and I think of apture this
32:16 way it it's it's it's the heart of our management for our AI cluster
32:22 environment and that's how we use it but more importantly what I want to do is let's go to the
32:29 after so this is real time what's happening in that building right now oh
32:36 it's changed since I was last in here look at all that red look at all that red service anomalies so uh and this is
32:44 the big reason I wanted to show it the uh I'll have to go see about this this wasn't there a while ago the 18
32:50 wasn't there the 70 I was expecting 70 because 70 is always there because of the power stuff the more interesting
32:57 ones to me me right now are the fact that the storage and the management both are showing
33:02 two and the question was about how do you troubleshoot knowing my
33:08 cluster I almost never ever see a red in storage or in a management and the fact
33:14 I see two and two tells me that more than likely something is broken at the
33:20 GPU side because the GPU has a lot of interfaces the G the actual GPU
33:28 side of it that's doing the back and GPU was fine earlier but the storage and the management wasn't and if I go in here
33:34 and I look at one of these and I click on service anomaly it tells me on Leaf 2 the
33:43 physical port s and The Logical Port assigned to that physical Port is having
33:49 a problem on Leaf 2 and if I look at it and click down into it at the
33:56 physical is telling me that the h004 is having a connectivity problem to
34:03 Port 7 on the storage side if I go back up and I go to this
34:12 service anomaly this is telling me that leaf one
34:20 again Port 7 with the logical interface 7.0
34:29 and I look at the physical guess where it's going same
34:35 place so that tells me right away that h100 04 needs to be looked at the funny part
34:42 is an hour ago I sent a message to somebody said you need to go look at this and all of a sudden we got 18 other
34:48 anomalies so I have no idea what they went and looked at but that's how I we
34:53 use it for troubleshooting we look for the service anomalies and after is very different uh alerts and alarms somebody
34:59 used the term alerts and alarms earlier uh when talking opst is about
35:05 anomalies and it it's a very different context than alerts and alarms to me
35:13 alerts and alarms traditionally are we send it to a management system that management system somebody will look at
35:19 and go oh yeah you know that's a minor one let's acknowledge it and say you know we know it's there blah blah blah
35:24 it sits back in a queue apstra will not let you knowledge an anomaly it actually
35:30 expects you to go fix it it will stay red until you fix
35:36 it so you got one option fix it or it
35:41 stays red it won't let you just blow it off if it's if it's important enough that it made an anomaly it believes it
35:48 needs to be fixed so uh a little bit different philosophy and how to address it just follow up on that that note
35:55 there so for the 70 that you have there all the time you said it's for the duplicate uh the Redundant power supply
36:02 you would have to go back into your design then and say I don't have redundant power supplies anymore to
36:07 clear them yes I I could actually go into the probe modify the probe the tell
36:13 it it's only got one power and this is the power that it's using but you know
36:19 yes the the other method is is yeah because they tell me in the lab I'm never going to get the power back I
36:24 would right but it makes for good demo