Mansour Karam, GVP of Products for Data Center

Hybrid Is NOW – Why Ease and Speed Matter in Deploying the Right Infrastructure for AI

Data CenterAI & ML
Mansour Karam Headshot

Hybrid Is NOW – Why Ease and Speed Matter in Deploying the Right Infrastructure for AI

Mansour Karam, Juniper’s GVP of Products for Data Center, talks how AI is changing the data center networking world, from AIOps to the networks enabling multimillion dollar GPU clusters.

Show more

You’ll learn

  • How Juniper’s AIOps tools are changing the way networks operate

  • What organizations like Fujitsu and 650Group think about the future of networking for AI

  • Why our customers are seeing 85% reduced deployment time, 90% reduced OpEx, and 9x reliability

Who is this for?

Network Professionals Business Leaders

Host

Mansour Karam Headshot
Mansour Karam
GVP of Products for Data Center

Transcript

0:01 [Music]

0:04 [Applause]

0:08 hi I'm mansur Karam gvp of products for

0:11 data center here at Juniper Networks I

0:14 am fortunate to be able to talk to a lot

0:16 of you about data center needs and I

0:19 consistently hear the same

0:24 challenges everyone wants to move faster

0:27 to make their businesses more

0:29 competitive and the network is an

0:31 integral part of this you are all facing

0:35 huge operational complexities in the

0:37 data center there are fewer resources to

0:39 manage this complexity due to Chronic

0:42 skilles shortages you need to run at the

0:46 speed of the business yet any error

0:49 could lead to a disastrous outage

0:51 costing millions of dollars and there

0:54 are lingering supply chain issues and

0:57 other challenges in implementing hybrid

0:59 clouds

1:00 strategies ultimately what you need is a

1:03 way to build and operate private unram

1:07 data centers that are as easy to use as

1:11 the public

1:12 Cloud many companies are trying to

1:15 address these challenges with

1:17 do-it-yourself also called DIY

1:20 automation but this only seems to add

1:23 complexity increase technical debt and

1:27 increase dependence on scarce Talent

1:30 it's just not working for

1:33 them Juniper is in a perfect position to

1:37 help solving these pain points better

1:39 than anyone else in the industry whether

1:43 addressing the needs of data centers

1:45 with traditional application

1:46 requirements or building new data

1:49 centers specifically for AI workloads

1:52 Juniper has you covered this is thanks

1:56 to three sustainable areas of

1:59 differentiation that drive real and user

2:03 value first we simplify operations and

2:07 save time and money with an operations

2:10 first approach to design deployment and

2:14 troubleshooting abstra is the first

2:17 intent based networking solution for

2:20 data center operations and the only

2:23 solution that works across any vendor's

2:26 products this let you automate data

2:29 center fabric with maximum Assurance

2:32 greater efficiency and using fewer

2:36 resources second we provide open

2:40 flexibility that allows customers to

2:43 design their networks using proven

2:45 Technologies such as ethernet with Abra

2:49 you completely avoid vendor lockin for

2:52 better flexibility and cost

2:54 savings and third we have reliable and

2:59 secure TurnKey Solutions with endtoend

3:02 validated designs that include switches

3:06 routers automation software and

3:10 security this ensures confidence in the

3:12 choice of products expedites deployment

3:15 time and delivers on the promise of zero

3:19 trust Juniper's comprehensive portfolio

3:23 of qfx switches and PTX routers will

3:26 scale up to 800 gig and use a mix of

3:30 merchant silicon from broadcom as well

3:32 as custom silicon from juniper this

3:35 diversity provides even more flexibility

3:38 to you our customers and since taking on

3:42 the role we've committed to live deliver

3:44 the hardware platforms faster than our

3:48 competitors which is why we were the

3:50 first to announce the 800 gab ethernet

3:54 51.2 terabit per second Tomahawk 5 based

3:58 qfx platform

4:00 all of this results in up to 90% lower

4:03 Opex 85% faster deployment times and 9x

4:09 more

4:10 reliability and you have peace of mind

4:13 that Juniper is 100% interoperable with

4:16 leading gpus switch and data center

4:20 fabric fendors but you don't have to

4:23 just trust me when I tell you we are

4:25 driving real change and real outcomes we

4:28 have seen a one

4:30 143% growth in new data center logos in

4:33 the last 12 months when a company

4:37 deploys Juniper in the data center they

4:39 love it they're willing to talk about it

4:41 and there's no going

4:43 back I'm excited to tell you that things

4:46 are getting even better from here a

4:50 couple of months ago we made a huge

4:52 announcement to bring even more agility

4:55 Automation and Assurance to the data

4:58 center you just heard sudir talk about

5:01 all the great things we can do with

5:03 aiops in the campus and Branch

5:06 specifically with Marvis the industry's

5:09 first and only AI native virtual Network

5:12 assistant well we are bringing the same

5:16 VNA capabilities to the data center

5:20 Marvis VNA for the data center is the

5:23 first step on our data center aiops

5:26 Journey integrating the best intent

5:28 based networking with the best AI op

5:31 solution creates an Unstoppable

5:34 Force but talk is cheap let's show you a

5:39 demo in this demo you can see the

5:41 standard Marvis actions dashboard which

5:44 shows you the top issues and actions

5:47 associated with wired access wireless

5:50 access and

5:51 sd1 can you see what's new right there

5:54 in the middle is a data center leg for

5:57 the first time ever you get a complete

6:00 view of your endtoend network in a

6:03 single AI native dashboard if a problem

6:07 emerges like an application is running

6:09 slowly for instance you can easily

6:12 pinpoint which domain needs to be

6:15 addressed to fix the problem if you

6:19 double click on the data center leg you

6:21 see all the data center components

6:23 including switching devices connectivity

6:26 and so forth as you click further you

6:29 can see detail on every anomaly with

6:33 recommended actions for

6:35 troubleshooting all of this information

6:37 is coming directly from abstra so we're

6:40 leveraging the rich monitoring and

6:42 Telemetry that is already built into

6:44 abstra bringing it to the cloud and

6:47 presenting it in the Mist marvous user

6:51 interface if you need to Deep dive on a

6:55 data center issue you can click one

6:57 button to launch the abstra UI and

7:01 continue troubleshooting from there in

7:03 this demo we are also showing how a

7:06 simulated fault is detected wait a

7:08 minute that's not Juna CLI that's a

7:11 competitor CLI that's right folks this

7:14 is a multivendor network think about

7:17 this we have a multivendor data center

7:19 Network including competitor switches

7:22 that is being managed by abstra and

7:25 we're bringing all that full visibility

7:29 of that multivendor Network into the

7:31 endtoend view provided by

7:34 Marvis this type of endtoend visibility

7:37 delivers enormous value to our customers

7:40 and now we have a great tool to deliver

7:43 it even if there are no Juniper switches

7:47 in your data center well at least not

7:51 yet but that isn't the only iops

7:55 capability that we have added to the

7:56 data center Marvis VNA also has a

8:00 conversational interface that uses

8:02 generative AI for simple and seamless

8:06 knowledge based queries instead of

8:09 getting frustrated trying to sort

8:11 through technical documentation you can

8:13 just type a plain language question such

8:16 as what is a logical device and get

8:20 clear and concise answers and links

8:23 directly to the right document to learn

8:27 more these new aops capabilities are

8:30 very exciting but this is just the

8:32 beginning for Marvis VNA in the data

8:35 center over the next 12 to 18 months we

8:38 will be rolling out more AI native

8:42 capabilities to give you even better

8:44 automation agility and Assurance Juniper

8:49 is leading in both Ai and networking we

8:52 have followed a proven aiops blueprint

8:55 for the campus and branch which we are

8:57 now taking to the data Center our

9:01 competitors simply cannot touch

9:04 this but you cannot talk about Ai and

9:07 networking without also talking about AI

9:11 clusters just as customers need AI Ops

9:14 to simplify networking operations many

9:17 also need simple seamless and assured

9:20 networks to support AI workloads in the

9:23 data center we call this networking for

9:27 AI and it is a CR iCal element of

9:30 Juniper's AI native networking

9:34 platform I had the opportunity to talk

9:37 with Alan wle from the 650 group to talk

9:41 about this new market let's take a

9:46 look hey Alan thank you for joining me

9:49 today yeah thanks so much for having me

9:52 yeah know absolutely so this AI market

9:55 and especially when it comes to the

9:58 networking infrastructure piece uh how

10:01 do we Define it and uh also I'm

10:03 interested in its

10:05 size yeah so if we look at AI networking

10:09 there's kind of two networks there

10:10 there's a backend training Network some

10:12 people call this the AI Fabric or AI

10:15 networking piece uh the ethernet value

10:18 of that is about a billion dollars today

10:21 uh growing to 4 A5 billion in 2028 uh

10:25 there's also a front-end Network there

10:27 which is how we communicate to the rest

10:28 of the day data center that's a little

10:30 bit less than a billion dollars today uh

10:33 and growing to a 5 billion uh so we add

10:35 that up and we're talking about a$2

10:37 billion Market going to 10 billion and a

10:41 keger around 50% uh we've never actually

10:44 had a keger this high in data center

10:46 switching so remarkable growth and

10:48 driving the overall Market to record

10:50 highs every year wow yeah that indeed

10:53 that's a very very high growth rate and

10:56 um you you mentioned that ethernet was

10:57 one of the Technologies for the back end

11:00 um you know I believe you're referring

11:01 to infin band as the other alternative

11:04 perfect perhaps you can uh describe what

11:07 infiniband is and how they kind of

11:09 compare and

11:11 contrast yeah so infiniband is a

11:13 different network protocol out there um

11:16 there's pros and cons to both uh but in

11:19 general as we move forward ethernet will

11:21 become a larger and larger part of the

11:23 network uh out there I'd say infin ban

11:25 kind of comes from an HPC supercomputer

11:28 side of things and ethernet's obviously

11:30 the fabric of choice throughout the data

11:32 center and throughout most networks

11:34 right you know as they say never bet

11:36 against uh against ethernet and plus you

11:39 know it's a much larger ecosystem I

11:40 suspect also as there will be more GPU

11:42 options on the market you want the

11:44 technology that's kind of uh GPU

11:46 agnostic

11:48 right yeah absolutely as we get more

11:50 gpus and we're say more types of

11:52 training inference tier 2 training

11:55 ethernet does what it always does which

11:57 is a take over the market and becomes

11:59 the technology of choice out there yep

12:02 yep very true and so shifting gears now

12:05 just in terms of like the specific

12:07 characteristics of AI workloads

12:10 workloads and AI traffic you know again

12:13 specifically to networking um you know I

12:16 suspect there is a lot of fan in uh

12:18 elephant flows right given that you have

12:21 all of these iterations involved in AI

12:24 maybe tell us a bit you know what these

12:26 characteristics are for for AI workloads

12:30 yeah so you mentioned elephant flows

12:32 that's a big one we kind of have GPU to

12:34 GPU GPU to memory a connectivity GPU to

12:37 storage uh we then have a different

12:39 nodes transmitting all at once we've got

12:42 RDMA uh so this leads to a highly

12:44 latency sensitive networks and that

12:47 impacts your job completion time moner

12:49 I'm assuming you see some of this as

12:51 well when you talk to customer yeah know

12:53 absolutely and in fact you know it puts

12:55 a a big strain on the network right we

12:58 need like much larger capacity uh lower

13:01 latencies but also some capabilities

13:04 like you know you need kind of like on

13:05 the highway uh have an ability to

13:08 control congestion and uh and load

13:10 balance the traffic so that you don't

13:12 have the bottlenecks and it's it's a big

13:15 deal right for AI clusters if you don't

13:16 get it right these gpus are expensive

13:20 correct yeah absolutely like when you

13:22 don't get this right on a regular server

13:24 you're talking a few thousand dollar you

13:26 might be missing out when you start

13:28 talking about AI due to the cost of

13:29 those gpus this becomes hundreds of

13:31 thousands of dollars if your network

13:33 isn't tuned right and we're not even

13:35 talking about job completion time where

13:36 at it could delay your job completion

13:38 time in terms of training uh a data set

13:42 by what days weeks

13:44 potentially yeah on these new AI

13:47 clusters you're talking days weeks or

13:49 potentially months out there it's

13:51 something like you actually can't do AI

13:53 if the network isn't tuned correctly you

13:55 just would never get done training those

13:57 models so if you were to to to kind of

14:00 tell us the the kind of the keys to

14:02 success in this market if we want to get

14:04 into this market and be

14:06 successful what are kind of what

14:08 thoughts come to

14:09 mind yeah it's really about minimizing

14:11 the job completion time it's about some

14:14 of the things you asked in your question

14:16 about elephant flow latency load

14:18 balancing out there it's really about

14:20 creating a high performance high-end

14:22 Network something that's more similar to

14:24 HPC or what the cloud is used to than

14:27 kind of a traditional top Ira Network

14:30 where you weren't necessarily latency

14:31 sensitive yep indeed Alan thank you for

14:34 joining me today it was great to have

14:36 you yeah thanks so much for having me I

14:39 enjoyed it

14:40 absolutely jenniper is exceptionally

14:43 well positioned for this Market AIML

14:45 workloads in the data center are much

14:49 different than traditional workloads the

14:51 flows are large elephant flows they are

14:55 harder to load balance traffic is mostly

14:58 between between gpus and nodes tend to

15:01 transmit all at the same time and the

15:05 process is highly sensitive to packet

15:08 loss and Jitter these new flow

15:10 characteristics present complex

15:13 networking design problems suboptimal

15:16 Network tuning or configuration leads to

15:18 longer job completion times which which

15:21 could result in additional weeks if not

15:25 months for AI training not to mention

15:28 millions of doar of underutilized

15:31 gpus but rest assured Juniper is leading

15:35 with a combination of AIML features such

15:38 as load balancing and congestion control

15:41 and in operations first approach led by

15:44 abstra which fine tunes the network to

15:48 operate

15:51 optimally bottom line the result is an

15:54 optimal Network which is key to ensuring

15:57 that all the those expensive gpus work

16:01 together efficiently spend a little bit

16:05 more time and money designing the

16:07 network and save a lot of headaches on

16:10 the overall AI

16:12 application but don't take my word for

16:14 it I recently caught up with Udo wordss

16:18 Chief data officer at Fujitsu at the

16:22 world AI cans Festival he explained to

16:25 me how his building Juniper based AI

16:28 clusters to train self-driving software

16:32 take a

16:34 listen we started here with an executive

16:38 meeting in also com here November

16:41 last and we had 85 customers and

16:46 partners and at that time we had 12

16:50 pcc's in terms of Genera a we have an

16:53 own we have an own private GPT solution

16:56 where all the data remains in your

16:59 Center don't need a cloud service it's a

17:01 large language model you can feed in all

17:03 the information from your company and

17:06 you have information at your

17:08 fingertips now November till now is

17:13 let's say 2 and a half

17:15 months so Christmas in between and New

17:19 Year now we are facing 35 places so we

17:24 tripled the

17:25 numbers and end of this month I would

17:29 assume it's at least 45 to 50 it's

17:34 exploding first of all can we agree to

17:37 the topic saying a workload where it is

17:41 about self-driving class is one of the

17:44 hardest yes I think I agree yeah in

17:47 terms of um IO in terms of network Lo Y

17:52 in terms of GPU CPU whatever it okay I

17:56 can State out that I was involved in a

18:00 project where I can't give any names but

18:03 in terms of how to process these types

18:05 of data a real existing use case let's

18:08 call it like this um and we have

18:12 established an AI test drive which is an

18:15 infrastructure uh we have one in UK

18:17 London ATA and one at n in

18:20 Frankford um everything is um based on

18:25 true networks is based on zuza Rancher

18:30 Intel of course also inia elements as

18:34 well as net app storage so cool um and

18:38 it was just few weeks ago I was doing

18:41 the math and having counting the and

18:45 everything like this and we overachieved

18:49 expectations and this is not marketing I

18:53 have the slides in my laptop so we all

18:55 the utilization of iio of CPUs the gpus

19:00 memory even uh the the amount of

19:06 uh celius temperature of the car during

19:10 a specific process wow um so I can stay

19:14 out no that is not an issue with the

19:16 network but in this

19:18 respect our open solution with specific

19:21 features for optimizing AI over ethernet

19:25 avoid the challenges of infin band you

19:28 can Leverage all the benefits of a

19:30 proven technology including better

19:33 feature agility lower costs and no

19:36 vendor

19:37 lockin we have the best routing and

19:40 switching platforms for AI data centers

19:43 including the new qfx 5240 and PTX 102

19:49 launched this quarter and with TurnKey

19:53 validated Solutions you can get your AI

19:56 data center up and running faster and

19:59 with more confidence it will deliver the

20:01 results you need in fact we now have an

20:05 AI lab in Sunil where we are proving

20:09 everything out we are not stuck in

20:12 theory like other vendors we can show

20:14 you exactly how ethernet can match or

20:17 Beat infin Band performance for your

20:20 type of AI data center this is an open

20:24 invitation for all of you to come check

20:27 out our AI data center lab I am serious

20:31 I want to see you all in Sunny Veil

20:34 hopefully soon in short the jiper

20:37 approach will be critical to help AI

20:40 infrastructure move beyond the early

20:43 adopter stage today where it's mostly

20:45 hyperscalers building infrastructure to

20:49 mass Market where every Enterprise in

20:52 the world can have their own private AI

20:56 infrastructure working to solve their

20:59 particular digital transformation

21:03 issues wrapping up I think it's safe to

21:07 say that the quote unquote experts who

21:11 said everything will move to the public

21:13 Cloud got it wrong the future is hybrid

21:19 this means that not only are on-prem

21:24 data centers not going away AI is

21:27 driving even more demand for them in

21:31 fact my guess is that each and every one

21:34 of you right now are trying to figure

21:37 out how to use data centers to run new

21:40 Enterprise AI projects the data center

21:43 is one of the most exciting areas in all

21:46 of Tech right now if you're looking to

21:49 build a secure modernized data center to

21:52 reliably and simply scale Innovation and

21:56 embrace hybrid cloud and the AI

21:59 Revolution You are not

22:01 alone and from traditional applications

22:04 to AI clusters Juniper has your back

22:09 every step of the way our data center

22:12 Solutions are the easiest to manage most

22:15 flexible to design and quick to

22:21 deploy again don't take my word for it

22:24 real Juniper customers such as fast host

22:27 Yahoo and Advan are seeing an 85%

22:31 reduction in deployment time 90%

22:34 reduction in Opex and a 9x Improvement

22:38 in

22:39 reliability come to our lab sign up for

22:42 a demo and try a PC to see for yourself

22:46 I promise that you won't be disappointed

22:49 thanks for your

22:51 [Music]

22:55 time

Show more