Hybrid Is NOW – Why Ease and Speed Matter in Deploying the Right Infrastructure for AI
Hybrid Is NOW – Why Ease and Speed Matter in Deploying the Right Infrastructure for AI
Mansour Karam, Juniper’s GVP of Products for Data Center, talks how AI is changing the data center networking world, from AIOps to the networks enabling multimillion dollar GPU clusters.
You’ll learn
How Juniper’s AIOps tools are changing the way networks operate
What organizations like Fujitsu and 650Group think about the future of networking for AI
Why our customers are seeing 85% reduced deployment time, 90% reduced OpEx, and 9x reliability
Who is this for?
Host
Experience More
Transcript
0:01 [Music]
0:04 [Applause]
0:08 hi I'm mansur Karam gvp of products for
0:11 data center here at Juniper Networks I
0:14 am fortunate to be able to talk to a lot
0:16 of you about data center needs and I
0:19 consistently hear the same
0:24 challenges everyone wants to move faster
0:27 to make their businesses more
0:29 competitive and the network is an
0:31 integral part of this you are all facing
0:35 huge operational complexities in the
0:37 data center there are fewer resources to
0:39 manage this complexity due to Chronic
0:42 skilles shortages you need to run at the
0:46 speed of the business yet any error
0:49 could lead to a disastrous outage
0:51 costing millions of dollars and there
0:54 are lingering supply chain issues and
0:57 other challenges in implementing hybrid
0:59 clouds
1:00 strategies ultimately what you need is a
1:03 way to build and operate private unram
1:07 data centers that are as easy to use as
1:11 the public
1:12 Cloud many companies are trying to
1:15 address these challenges with
1:17 do-it-yourself also called DIY
1:20 automation but this only seems to add
1:23 complexity increase technical debt and
1:27 increase dependence on scarce Talent
1:30 it's just not working for
1:33 them Juniper is in a perfect position to
1:37 help solving these pain points better
1:39 than anyone else in the industry whether
1:43 addressing the needs of data centers
1:45 with traditional application
1:46 requirements or building new data
1:49 centers specifically for AI workloads
1:52 Juniper has you covered this is thanks
1:56 to three sustainable areas of
1:59 differentiation that drive real and user
2:03 value first we simplify operations and
2:07 save time and money with an operations
2:10 first approach to design deployment and
2:14 troubleshooting abstra is the first
2:17 intent based networking solution for
2:20 data center operations and the only
2:23 solution that works across any vendor's
2:26 products this let you automate data
2:29 center fabric with maximum Assurance
2:32 greater efficiency and using fewer
2:36 resources second we provide open
2:40 flexibility that allows customers to
2:43 design their networks using proven
2:45 Technologies such as ethernet with Abra
2:49 you completely avoid vendor lockin for
2:52 better flexibility and cost
2:54 savings and third we have reliable and
2:59 secure TurnKey Solutions with endtoend
3:02 validated designs that include switches
3:06 routers automation software and
3:10 security this ensures confidence in the
3:12 choice of products expedites deployment
3:15 time and delivers on the promise of zero
3:19 trust Juniper's comprehensive portfolio
3:23 of qfx switches and PTX routers will
3:26 scale up to 800 gig and use a mix of
3:30 merchant silicon from broadcom as well
3:32 as custom silicon from juniper this
3:35 diversity provides even more flexibility
3:38 to you our customers and since taking on
3:42 the role we've committed to live deliver
3:44 the hardware platforms faster than our
3:48 competitors which is why we were the
3:50 first to announce the 800 gab ethernet
3:54 51.2 terabit per second Tomahawk 5 based
3:58 qfx platform
4:00 all of this results in up to 90% lower
4:03 Opex 85% faster deployment times and 9x
4:09 more
4:10 reliability and you have peace of mind
4:13 that Juniper is 100% interoperable with
4:16 leading gpus switch and data center
4:20 fabric fendors but you don't have to
4:23 just trust me when I tell you we are
4:25 driving real change and real outcomes we
4:28 have seen a one
4:30 143% growth in new data center logos in
4:33 the last 12 months when a company
4:37 deploys Juniper in the data center they
4:39 love it they're willing to talk about it
4:41 and there's no going
4:43 back I'm excited to tell you that things
4:46 are getting even better from here a
4:50 couple of months ago we made a huge
4:52 announcement to bring even more agility
4:55 Automation and Assurance to the data
4:58 center you just heard sudir talk about
5:01 all the great things we can do with
5:03 aiops in the campus and Branch
5:06 specifically with Marvis the industry's
5:09 first and only AI native virtual Network
5:12 assistant well we are bringing the same
5:16 VNA capabilities to the data center
5:20 Marvis VNA for the data center is the
5:23 first step on our data center aiops
5:26 Journey integrating the best intent
5:28 based networking with the best AI op
5:31 solution creates an Unstoppable
5:34 Force but talk is cheap let's show you a
5:39 demo in this demo you can see the
5:41 standard Marvis actions dashboard which
5:44 shows you the top issues and actions
5:47 associated with wired access wireless
5:50 access and
5:51 sd1 can you see what's new right there
5:54 in the middle is a data center leg for
5:57 the first time ever you get a complete
6:00 view of your endtoend network in a
6:03 single AI native dashboard if a problem
6:07 emerges like an application is running
6:09 slowly for instance you can easily
6:12 pinpoint which domain needs to be
6:15 addressed to fix the problem if you
6:19 double click on the data center leg you
6:21 see all the data center components
6:23 including switching devices connectivity
6:26 and so forth as you click further you
6:29 can see detail on every anomaly with
6:33 recommended actions for
6:35 troubleshooting all of this information
6:37 is coming directly from abstra so we're
6:40 leveraging the rich monitoring and
6:42 Telemetry that is already built into
6:44 abstra bringing it to the cloud and
6:47 presenting it in the Mist marvous user
6:51 interface if you need to Deep dive on a
6:55 data center issue you can click one
6:57 button to launch the abstra UI and
7:01 continue troubleshooting from there in
7:03 this demo we are also showing how a
7:06 simulated fault is detected wait a
7:08 minute that's not Juna CLI that's a
7:11 competitor CLI that's right folks this
7:14 is a multivendor network think about
7:17 this we have a multivendor data center
7:19 Network including competitor switches
7:22 that is being managed by abstra and
7:25 we're bringing all that full visibility
7:29 of that multivendor Network into the
7:31 endtoend view provided by
7:34 Marvis this type of endtoend visibility
7:37 delivers enormous value to our customers
7:40 and now we have a great tool to deliver
7:43 it even if there are no Juniper switches
7:47 in your data center well at least not
7:51 yet but that isn't the only iops
7:55 capability that we have added to the
7:56 data center Marvis VNA also has a
8:00 conversational interface that uses
8:02 generative AI for simple and seamless
8:06 knowledge based queries instead of
8:09 getting frustrated trying to sort
8:11 through technical documentation you can
8:13 just type a plain language question such
8:16 as what is a logical device and get
8:20 clear and concise answers and links
8:23 directly to the right document to learn
8:27 more these new aops capabilities are
8:30 very exciting but this is just the
8:32 beginning for Marvis VNA in the data
8:35 center over the next 12 to 18 months we
8:38 will be rolling out more AI native
8:42 capabilities to give you even better
8:44 automation agility and Assurance Juniper
8:49 is leading in both Ai and networking we
8:52 have followed a proven aiops blueprint
8:55 for the campus and branch which we are
8:57 now taking to the data Center our
9:01 competitors simply cannot touch
9:04 this but you cannot talk about Ai and
9:07 networking without also talking about AI
9:11 clusters just as customers need AI Ops
9:14 to simplify networking operations many
9:17 also need simple seamless and assured
9:20 networks to support AI workloads in the
9:23 data center we call this networking for
9:27 AI and it is a CR iCal element of
9:30 Juniper's AI native networking
9:34 platform I had the opportunity to talk
9:37 with Alan wle from the 650 group to talk
9:41 about this new market let's take a
9:46 look hey Alan thank you for joining me
9:49 today yeah thanks so much for having me
9:52 yeah know absolutely so this AI market
9:55 and especially when it comes to the
9:58 networking infrastructure piece uh how
10:01 do we Define it and uh also I'm
10:03 interested in its
10:05 size yeah so if we look at AI networking
10:09 there's kind of two networks there
10:10 there's a backend training Network some
10:12 people call this the AI Fabric or AI
10:15 networking piece uh the ethernet value
10:18 of that is about a billion dollars today
10:21 uh growing to 4 A5 billion in 2028 uh
10:25 there's also a front-end Network there
10:27 which is how we communicate to the rest
10:28 of the day data center that's a little
10:30 bit less than a billion dollars today uh
10:33 and growing to a 5 billion uh so we add
10:35 that up and we're talking about a$2
10:37 billion Market going to 10 billion and a
10:41 keger around 50% uh we've never actually
10:44 had a keger this high in data center
10:46 switching so remarkable growth and
10:48 driving the overall Market to record
10:50 highs every year wow yeah that indeed
10:53 that's a very very high growth rate and
10:56 um you you mentioned that ethernet was
10:57 one of the Technologies for the back end
11:00 um you know I believe you're referring
11:01 to infin band as the other alternative
11:04 perfect perhaps you can uh describe what
11:07 infiniband is and how they kind of
11:09 compare and
11:11 contrast yeah so infiniband is a
11:13 different network protocol out there um
11:16 there's pros and cons to both uh but in
11:19 general as we move forward ethernet will
11:21 become a larger and larger part of the
11:23 network uh out there I'd say infin ban
11:25 kind of comes from an HPC supercomputer
11:28 side of things and ethernet's obviously
11:30 the fabric of choice throughout the data
11:32 center and throughout most networks
11:34 right you know as they say never bet
11:36 against uh against ethernet and plus you
11:39 know it's a much larger ecosystem I
11:40 suspect also as there will be more GPU
11:42 options on the market you want the
11:44 technology that's kind of uh GPU
11:46 agnostic
11:48 right yeah absolutely as we get more
11:50 gpus and we're say more types of
11:52 training inference tier 2 training
11:55 ethernet does what it always does which
11:57 is a take over the market and becomes
11:59 the technology of choice out there yep
12:02 yep very true and so shifting gears now
12:05 just in terms of like the specific
12:07 characteristics of AI workloads
12:10 workloads and AI traffic you know again
12:13 specifically to networking um you know I
12:16 suspect there is a lot of fan in uh
12:18 elephant flows right given that you have
12:21 all of these iterations involved in AI
12:24 maybe tell us a bit you know what these
12:26 characteristics are for for AI workloads
12:30 yeah so you mentioned elephant flows
12:32 that's a big one we kind of have GPU to
12:34 GPU GPU to memory a connectivity GPU to
12:37 storage uh we then have a different
12:39 nodes transmitting all at once we've got
12:42 RDMA uh so this leads to a highly
12:44 latency sensitive networks and that
12:47 impacts your job completion time moner
12:49 I'm assuming you see some of this as
12:51 well when you talk to customer yeah know
12:53 absolutely and in fact you know it puts
12:55 a a big strain on the network right we
12:58 need like much larger capacity uh lower
13:01 latencies but also some capabilities
13:04 like you know you need kind of like on
13:05 the highway uh have an ability to
13:08 control congestion and uh and load
13:10 balance the traffic so that you don't
13:12 have the bottlenecks and it's it's a big
13:15 deal right for AI clusters if you don't
13:16 get it right these gpus are expensive
13:20 correct yeah absolutely like when you
13:22 don't get this right on a regular server
13:24 you're talking a few thousand dollar you
13:26 might be missing out when you start
13:28 talking about AI due to the cost of
13:29 those gpus this becomes hundreds of
13:31 thousands of dollars if your network
13:33 isn't tuned right and we're not even
13:35 talking about job completion time where
13:36 at it could delay your job completion
13:38 time in terms of training uh a data set
13:42 by what days weeks
13:44 potentially yeah on these new AI
13:47 clusters you're talking days weeks or
13:49 potentially months out there it's
13:51 something like you actually can't do AI
13:53 if the network isn't tuned correctly you
13:55 just would never get done training those
13:57 models so if you were to to to kind of
14:00 tell us the the kind of the keys to
14:02 success in this market if we want to get
14:04 into this market and be
14:06 successful what are kind of what
14:08 thoughts come to
14:09 mind yeah it's really about minimizing
14:11 the job completion time it's about some
14:14 of the things you asked in your question
14:16 about elephant flow latency load
14:18 balancing out there it's really about
14:20 creating a high performance high-end
14:22 Network something that's more similar to
14:24 HPC or what the cloud is used to than
14:27 kind of a traditional top Ira Network
14:30 where you weren't necessarily latency
14:31 sensitive yep indeed Alan thank you for
14:34 joining me today it was great to have
14:36 you yeah thanks so much for having me I
14:39 enjoyed it
14:40 absolutely jenniper is exceptionally
14:43 well positioned for this Market AIML
14:45 workloads in the data center are much
14:49 different than traditional workloads the
14:51 flows are large elephant flows they are
14:55 harder to load balance traffic is mostly
14:58 between between gpus and nodes tend to
15:01 transmit all at the same time and the
15:05 process is highly sensitive to packet
15:08 loss and Jitter these new flow
15:10 characteristics present complex
15:13 networking design problems suboptimal
15:16 Network tuning or configuration leads to
15:18 longer job completion times which which
15:21 could result in additional weeks if not
15:25 months for AI training not to mention
15:28 millions of doar of underutilized
15:31 gpus but rest assured Juniper is leading
15:35 with a combination of AIML features such
15:38 as load balancing and congestion control
15:41 and in operations first approach led by
15:44 abstra which fine tunes the network to
15:48 operate
15:51 optimally bottom line the result is an
15:54 optimal Network which is key to ensuring
15:57 that all the those expensive gpus work
16:01 together efficiently spend a little bit
16:05 more time and money designing the
16:07 network and save a lot of headaches on
16:10 the overall AI
16:12 application but don't take my word for
16:14 it I recently caught up with Udo wordss
16:18 Chief data officer at Fujitsu at the
16:22 world AI cans Festival he explained to
16:25 me how his building Juniper based AI
16:28 clusters to train self-driving software
16:32 take a
16:34 listen we started here with an executive
16:38 meeting in also com here November
16:41 last and we had 85 customers and
16:46 partners and at that time we had 12
16:50 pcc's in terms of Genera a we have an
16:53 own we have an own private GPT solution
16:56 where all the data remains in your
16:59 Center don't need a cloud service it's a
17:01 large language model you can feed in all
17:03 the information from your company and
17:06 you have information at your
17:08 fingertips now November till now is
17:13 let's say 2 and a half
17:15 months so Christmas in between and New
17:19 Year now we are facing 35 places so we
17:24 tripled the
17:25 numbers and end of this month I would
17:29 assume it's at least 45 to 50 it's
17:34 exploding first of all can we agree to
17:37 the topic saying a workload where it is
17:41 about self-driving class is one of the
17:44 hardest yes I think I agree yeah in
17:47 terms of um IO in terms of network Lo Y
17:52 in terms of GPU CPU whatever it okay I
17:56 can State out that I was involved in a
18:00 project where I can't give any names but
18:03 in terms of how to process these types
18:05 of data a real existing use case let's
18:08 call it like this um and we have
18:12 established an AI test drive which is an
18:15 infrastructure uh we have one in UK
18:17 London ATA and one at n in
18:20 Frankford um everything is um based on
18:25 true networks is based on zuza Rancher
18:30 Intel of course also inia elements as
18:34 well as net app storage so cool um and
18:38 it was just few weeks ago I was doing
18:41 the math and having counting the and
18:45 everything like this and we overachieved
18:49 expectations and this is not marketing I
18:53 have the slides in my laptop so we all
18:55 the utilization of iio of CPUs the gpus
19:00 memory even uh the the amount of
19:06 uh celius temperature of the car during
19:10 a specific process wow um so I can stay
19:14 out no that is not an issue with the
19:16 network but in this
19:18 respect our open solution with specific
19:21 features for optimizing AI over ethernet
19:25 avoid the challenges of infin band you
19:28 can Leverage all the benefits of a
19:30 proven technology including better
19:33 feature agility lower costs and no
19:36 vendor
19:37 lockin we have the best routing and
19:40 switching platforms for AI data centers
19:43 including the new qfx 5240 and PTX 102
19:49 launched this quarter and with TurnKey
19:53 validated Solutions you can get your AI
19:56 data center up and running faster and
19:59 with more confidence it will deliver the
20:01 results you need in fact we now have an
20:05 AI lab in Sunil where we are proving
20:09 everything out we are not stuck in
20:12 theory like other vendors we can show
20:14 you exactly how ethernet can match or
20:17 Beat infin Band performance for your
20:20 type of AI data center this is an open
20:24 invitation for all of you to come check
20:27 out our AI data center lab I am serious
20:31 I want to see you all in Sunny Veil
20:34 hopefully soon in short the jiper
20:37 approach will be critical to help AI
20:40 infrastructure move beyond the early
20:43 adopter stage today where it's mostly
20:45 hyperscalers building infrastructure to
20:49 mass Market where every Enterprise in
20:52 the world can have their own private AI
20:56 infrastructure working to solve their
20:59 particular digital transformation
21:03 issues wrapping up I think it's safe to
21:07 say that the quote unquote experts who
21:11 said everything will move to the public
21:13 Cloud got it wrong the future is hybrid
21:19 this means that not only are on-prem
21:24 data centers not going away AI is
21:27 driving even more demand for them in
21:31 fact my guess is that each and every one
21:34 of you right now are trying to figure
21:37 out how to use data centers to run new
21:40 Enterprise AI projects the data center
21:43 is one of the most exciting areas in all
21:46 of Tech right now if you're looking to
21:49 build a secure modernized data center to
21:52 reliably and simply scale Innovation and
21:56 embrace hybrid cloud and the AI
21:59 Revolution You are not
22:01 alone and from traditional applications
22:04 to AI clusters Juniper has your back
22:09 every step of the way our data center
22:12 Solutions are the easiest to manage most
22:15 flexible to design and quick to
22:21 deploy again don't take my word for it
22:24 real Juniper customers such as fast host
22:27 Yahoo and Advan are seeing an 85%
22:31 reduction in deployment time 90%
22:34 reduction in Opex and a 9x Improvement
22:38 in
22:39 reliability come to our lab sign up for
22:42 a demo and try a PC to see for yourself
22:46 I promise that you won't be disappointed
22:49 thanks for your
22:51 [Music]
22:55 time