Day 1: Managing your AI data center at scale with Juniper Networks

Summits Data Center
Kyle Baxter shown next to a presentation slide asking about load balancing

GPYOU: Day 1 - Managing Your AI Data Center at Scale

This presentation by Kyle Baxter focuses on how Juniper Networks' Apstra solution can manage AI data centers at scale. Apstra simplifies network configuration for AI/ML workloads by providing tools to assign virtual networks across numerous ports, an essential capability in environments with potentially millions of ports. The core of the presentation highlights the ability to provision virtual networks and configure load balancing with ease, using an intent-based approach that simplifies complex network tasks. This reduces the burden of manual configuration and allows users to quickly deploy and manage their AI data centers, regardless of the number of GPUs.

Through its continuous validation features, Baxter demonstrates how Apstra allows users to pre-emptively catch configuration issues, such as missing VLAN assignments. This prevents errors before they impact operations. Furthermore, the system enables bulk operations to streamline the assignment of virtual networks and subnets across the entire infrastructure. The solution includes options for selecting load balancing policies, with clear explanations through built-in help text. Juniper focuses on simplifying tasks through an intuitive interface to minimize the need for command-line configuration and facilitate a faster and more efficient deployment process for AI data centers.

Finally, Baxter mentions the delivery of Apstra as a virtual machine, with the solution available for download on the Juniper website. It can be deployed on-premise to manage a network and integrated with MIST AI for AI-driven network operations. The upcoming release of version 6.0 and support for advanced features like RDMA load balancing were also discussed. The key message is that Juniper's Apstra enables efficient deployment and management of AI infrastructure at scale, regardless of the deployment size, simplifying complex tasks with its user-friendly interface and automated features.

Presented by Kyle Baxter, Head of Apstra Product Management, Juniper Networks. Recorded live in Santa Clara, California, on April 23, 2025, as part of AI Infrastructure Field Day.

Show more

You’ll learn

  • How to configure with confidence using Apstra Data Center Director

  • How to load balance and handle congestion, no matter your cluster’s size

Who is this for?

Network Professionals Business Leaders

Transcript

0:00 welcome everybody my name is Kyle

0:01 Beexter i'm going to be talking about

0:03 how in in deploying and managing your AI

0:08 data center how you can do that at scale

0:11 with

0:12 Abstra so the the challenges is there's

0:16 potentially lots of ports how do you

0:19 assign those virtual networks that you

0:21 need to be able to to run the train jobs

0:23 across all those ports you know you

0:25 don't want to do that manually one by

0:26 one by one by one because there could be

0:28 thousands or millions of different ports

0:30 when you're talking about some of these

0:31 larger um deployments um and then

0:34 there's ever evolving requirements um

0:37 there's there's new techniques that are

0:38 coming out for things like load

0:40 balancing um we saw in a in a previous

0:42 session about some of the new um RDMA

0:45 based load balancing that's coming out

0:47 um but what also about how do I

0:49 configure load balancing without having

0:51 to go read through hundreds and hundreds

0:53 of pages of articles to become an expert

0:55 and that's what we can do within Astra

0:58 is simplify that and give you just a few

1:01 click operations to to get those things

1:03 assigned and up and running so starting

1:07 with virtual network provisioning so how

1:09 do I assign virtual networks across my

1:13 entire network so the first thing that

1:16 that we talked about a little bit

1:17 earlier was you know how do we catch

1:19 things before we deploy so that is

1:21 exactly one thing that we can do here

1:23 that I'm going to show you um is the

1:26 first step is we've determined before

1:28 you've deployed this production in that

1:30 staging version of Abstra that there is

1:33 warnings that there's missing VLAN

1:35 assignments to Rails and so we've caught

1:38 that before you've actually deployed and

1:40 what we can then do is give you an easy

1:43 button to say go ahead and provision a a

1:47 VLAN across those rails and those ports

1:50 and then afterwards we can see yep I'm

1:54 all set we can see the the virtual

1:56 network the connectivity template all

1:58 assigned to it so what this does is it

2:02 simplifies that burden of managing and

2:05 gives you the ability with just a few

2:07 clicks to assign virtual networks make

2:09 sure your connectivity templates are all

2:12 connected to the ports and the rails

2:15 that you want you need

2:18 so let's take a quick look at this i'm

2:21 going to move to this demo real quick

2:24 and we'll see this in action it look

2:27 very similar to those screenshots but

2:29 we're going to go here to that

2:30 uncommitted tab which is showing what's

2:33 the difference so what we played around

2:35 with in staging between what's in

2:36 staging and our production and we can

2:38 see that the warning tab is highlighted

2:41 and so when we click on that warning tab

2:44 we'll immediately see that there's

2:45 interfaces associated with a rail that

2:47 are expected to have a VLAN for untagged

2:50 traffic and it's not there but we have

2:52 resolutions where it says "Hey do you

2:54 want to assign that VLAN?" And yes I do

2:58 i could do it one by one or what I'm

3:00 going to do is I'm going to look at all

3:01 of my rails and say "Let's bulk to it

3:04 let's I don't want to do it one by one

3:05 by one by one by one i can do it but

3:08 that's tedious why do I want to do

3:10 multiple things at once when I can do it

3:11 in one click and so I'll select all of

3:15 my rails and say let's provision VLANs

3:17 across all of them and we'll see that we

3:20 can review it we can look at it they

3:22 were all missing before but we'll add

3:24 them in there and we'll we'll see that

3:26 uh column go from missing to now they're

3:29 all assigned now the other thing Aster

3:32 does is it still catches that there's

3:35 still some red in there so we're going

3:36 to go over there and the one thing that

3:38 we didn't do is tell it what virtual

3:41 network what subnet to use so I'm going

3:43 to pick an IP pool i'm just going to

3:45 pick one of the default ones i can see

3:46 there and there how much of it is used

3:48 so I know there's some some space in

3:49 there and I'll pick that IP pool and so

3:53 then you can see the page all of a

3:54 sudden start everything turning green

3:56 showing me that it's validated this is

3:58 that continuous validation that we're

4:00 doing abstract that's continuously

4:01 looking to see did you do everything

4:02 that we that you need to do and so it

4:05 caught that you know we we didn't have

4:07 the the VLANs set and we didn't also

4:09 have a virtual network um set for a

4:12 subnet of IP addresses but once we fixed

4:14 all that look the warning tab now went

4:17 green and so there in just a few minutes

4:19 in a few clicks I was able to assign a

4:22 virtual network and an IP subnet for

4:24 that virtual network across all of my

4:27 ports and and rails so really cool to be

4:31 able to see how it catches that how it

4:32 does that continuous validation and

4:34 helps make that

4:37 smooth so I'm going to switch back over

4:40 to the

4:41 slides and we'll look at load balancing

4:44 so a question came up earlier about how

4:46 do I do load balancing um and in Abstra

4:50 we have the same kind of simple

4:52 intentbased driven models where we're

4:53 looking at what is the intended outcome

4:56 that you want what do you want it to do

4:58 well how do you want to work not let's

5:00 go switch by switch and configure DLB or

5:02 GLB it's do you want DLB or GLB do you

5:05 want flow lip per packet you know those

5:06 kinds of things and we'll see how that

5:08 how that works and we can go through

5:10 that so there's a simple walkthrough

5:13 configuration where you can go and you

5:14 can pick do I want DLB do I want it to

5:17 be flow or packet do I want to set some

5:19 of the activity in the in the intervals

5:22 but in there and I'll show in a quick

5:24 demo you can hover all those those

5:26 little question mark helps at the end

5:27 it's a little small but um you can hover

5:29 over all those and you'll be able to see

5:30 what all those parameters mean so you

5:33 don't have to go research and be like um

5:35 what is the inactivity interval i don't

5:37 really remember what it should be it'll

5:39 tell you right there when you hover over

5:40 it that hey this is what it is this is

5:42 the default value we set if you want to

5:44 change it you can if you want to keep

5:45 the default value great and we can also

5:48 do validation on there's specific things

5:50 that maybe only work on certain

5:52 hardwares so like GB only works on um

5:56 the 5240 um because of the specific

5:58 hardware it needs specific hardware we

6:00 can do that validation so you don't just

6:02 you know say yeah I want GLB and you

6:04 don't have hardware that can do it we

6:05 don't let you deploy it if it's not

6:07 going to work so we can do that

6:08 validation is RLB handled separately or

6:11 is that coming soon coming soon okay so

6:14 it's some of the the latest innovations

6:15 coming out um so we have DLB and GLB in

6:18 the product some of the great things

6:19 about Abstra is we have flexibility to

6:22 do things like that um separately from

6:24 our intentbased models we can do what we

6:26 call configlets so you can do custom

6:28 configurations when there is you know

6:30 like brand new innovations that come out

6:31 on a switch hardware um so if you want

6:33 to keep up with the latest and greatest

6:35 but we're actively working as we speak

6:37 to get things like the RDMA load

6:39 balancing um into the the configuration

6:41 that we're going to see right here got

6:42 it thank you mhm

6:46 so how this looks like is is a few steps

6:50 so starting with we assign a default um

6:54 policy that's that's DLB across because

6:56 that is that is on all the hardware so

6:58 we assign that default policy but if you

7:00 want to build your own it's only two

7:03 steps first you create a load balancing

7:05 policy and you go through the selections

7:07 on what options do you want again we

7:09 have the help text that tells you what

7:11 all they are what all the default values

7:13 and then you then assign it and you can

7:16 do it just like we saw before one by one

7:18 or you can do it in bulk so if you want

7:20 to do them all at once you can do that

7:22 if you want different ones maybe

7:23 different ones on the spine versus the

7:24 leaf you could have that option where

7:26 you can be able to do them across

7:28 different ones so let's see this in

7:32 action real quick um so actually go here

7:37 and then pull up this

7:39 one and we'll walk through it so it's in

7:43 our staged again we'll go to fabric

7:45 settings where we can look at the load

7:47 balancing and as as I mentioned there's

7:50 a default policy that comes already

7:52 assigned that's running DLB um but I

7:55 want to build my own so let's do it so

7:58 the first thing we go to is over here

8:00 the load balancing policies we will

8:02 create our own load balancing policy we

8:06 give it a name call it my policy or

8:08 whatever you want to call it and then we

8:09 can go through the options and here's

8:11 where I was talking about the help text

8:12 that you just hover over it and it tells

8:14 you exactly what those mean what are the

8:17 default values so you can see do I like

8:19 that default value do I want to change

8:20 it what do I want to do i can pick do I

8:23 want GB and helper load balancing you

8:25 know you can pick all those settings and

8:28 then you can simply go over to the

8:30 assignment and as I talked about you can

8:32 do it one by one so we could pick you

8:34 know just one by one here or if I wanted

8:37 to and this is what I'll do because I

8:38 don't like doing things one at a time i

8:40 want to do it in bulk i want the quick

8:42 um I want to get it done fast um I can

8:44 do that in bulk assign it and now all of

8:47 a sudden here in just a few clicks we've

8:49 created a new load balancing policy i

8:51 didn't have to be the expert in knowing

8:53 what are what are all the settings to

8:55 know do I you know what is an activity

8:57 timer it was already there for me I just

9:00 made a couple selections and I'm now off

9:03 and running

9:05 I have a question uh it looks like on

9:08 every leave and every spine switch you

9:10 can set different load balancing

9:11 policies yes is there a reason why you

9:14 want to do that in the same network you

9:16 have different load balancing policies

9:18 probably

9:19 not wouldn't advice that um you know

9:23 some of them like GB maybe you want um

9:25 different at the spine versus the leaf

9:27 um there's there's a little difference

9:29 there that you might uh want there but

9:31 in in general yes you wouldn't probably

9:33 want to um so that's where the bolt

9:35 comes in bang is you you just want

9:36 everything to have the same load

9:37 balancing policy um but if you really

9:39 wanted to get crazy you you could but I

9:42 wouldn't advise it okay thanks uh Jack

9:46 Pauler Paradigm Technica speaking of

9:48 getting crazy yes is everything exposed

9:51 through this interface or is there still

9:52 stuff that you have to drop down into

9:54 command line and tweak and munch and

9:56 stuff like that for for things like DLB

9:59 and GB no that's it's all configured

10:01 there we saw there was a big list of

10:02 items you could manually configure so So

10:05 for those no you don't um if you wanted

10:07 to use like the we talked about just a

10:09 second ago like the RDMA load balancing

10:11 we don't have that yet modeled yet

10:13 that's coming soon so if you wanted to

10:14 use that today you would then have to

10:16 drop into what we call configlets to to

10:18 help set that up but the the the the

10:21 point is you're going to capture

10:23 everything that you possibly can in this

10:25 tool yes and stray stay away from Yes

10:30 old style stuff yeah the goal is to stay

10:32 away from the CLI cli okay we can still

10:34 show you as I talked about in an earlier

10:36 part the the rendered config if you

10:38 still love to see it and you want to see

10:39 you know did it match exactly what I

10:40 thought um but but the idea is is yes

10:43 you will drive everything through the UI

10:45 or like we talked about um earlier APIs

10:48 via like rest terraform anible things

10:50 like that you can drive it all through

10:51 there if you wanted to to help automate

10:53 that right okay thank you

10:58 yeah so let me go back to the to go back

11:02 to the slides so to sum up we saw how we

11:05 can manage this network at scale from

11:09 from deploying it and it's the same

11:11 process no matter if you have one GPU 10

11:13 GPUs 100 GPUs thousand GPUs a million

11:16 GPUs no matter what it is it's the same

11:18 process um it wouldn't take me any

11:20 longer if it was you know million GPUs

11:21 because I could do that bulkedit thing

11:22 for everything and and have it all set

11:25 up so we can take that complexity remove

11:28 that that burden of complexity out there

11:30 give you the expert to help you deploy

11:33 your AI data centers faster so any last

11:37 minute questions before we go to the Yes

11:39 oh wait before we go to I wonder if the

11:41 next section's better to ask it or

11:43 you're done after this i'm I'm finishing

11:45 now we're going to be moving on to um

11:47 kind of the day two the day-to-day

11:49 operations and we'll see a lot about the

11:51 the visibility we get and and monitoring

11:53 the networks heat maps all that fun

11:55 stuff okay so my question is how is this

11:58 delivered to customers how do they get

12:00 what

12:01 is how is it consumed so how much of it

12:04 is through SI how much get it from u

12:08 MSPs how do they get it directly from

12:11 you and how much of it is on the

12:13 complete solution I'm just curious how

12:15 it how it gets to them and how they

12:17 consume it once it gets to them yes yes

12:20 great question so thank you for asking

12:22 um so Abstra is delivered as a virtual

12:26 machine so on juniper.net net in our

12:29 downloads page we have OVAs KVMs um

12:32 Microsoft HyperV um we're looking at

12:34 adding Nanix versions as well um because

12:37 that's getting popular now um so it is

12:39 deployed as a virtual machine so almost

12:42 all of our our users and customers they

12:45 get that OVA they deploy it inside their

12:47 their network inside their data center

12:49 in the management plane so it has that

12:51 management connectivity to all of the

12:53 managed switches um so that's typically

12:55 that's deployed there's there's some

12:57 cases where um you talked about like SIS

12:59 or MSPs as part of the bigger solution

13:02 will come into your data center yeah and

13:03 they'll they'll help deploy it for you

13:06 um you know same thing like professional

13:07 services could do to come in and help

13:09 you know deploy it for you um but it is

13:11 traditionally um delivered that way as

13:13 an on-prem application um we do have as

13:17 I talked about the very beginning in the

13:19 previous session was the integration

13:21 with with Mist AI um so we do have a

13:24 component that we can take a lot of this

13:26 data and a lot of data we'll see in in

13:28 the upcoming session and be able to

13:30 build AI ops on top of it so um AI for

13:33 networking um where we can use AI and ML

13:36 to help improve network operations um so

13:39 we'll be talking about that probably in

13:41 in other sessions because today was

13:42 focused on the the building of the infra

13:44 but we can absolutely do that to be able

13:46 to bring more AI and insights and be

13:49 able to help you troubleshoot faster so

13:51 more direct sales versus going through

13:52 partners well we still have you know

13:54 partner channels um but even through a

13:56 partner channel they'll still get the

13:58 software and deploy it you know on

13:59 premises okay thanks

14:03 when do we expect 60 to be released

14:07 thank you for asking should be in the

14:09 next couple weeks excellent so we are in

14:11 actively in early trials with several

14:13 customers that are actually using this

14:15 and deploying it in their own AI and ML

14:17 trading jobs as we speak got it but

14:20 it'll go GA here in in about next couple

14:22 weeks

Show more