Network Automation: The Hype vs. the Reality
Watch: Jonah Kowall at NANOG on the state of network automation.
What is the state of network automation today and where is it going from here? Listen as Kentik CTO Jonah Kowall discusses network complexity, automation strategies, and where some of the most cutting-edge customers today are headed with automation.
You’ll learn
How automation tools have evolved as the network has become more complex
A state of the union: survey results on automation and telemetry that highlight customer strategies moving forward
What’s happening with closed loop automation and validation via continuous integration and continuous delivery
Who is this for?
Host
Transcript
0:00 hey everyone how you doing today thanks
0:02 for coming and listening after lunch on
0:06 your own and and showing up to listen to
0:08 Damien and I talked a little bit about
0:10 the state of network automation my name
0:15 is Jonah Cowell I currently work for Ken
0:17 tech and the CTO but I have a pretty
0:21 wide experience as a practitioner
0:23 running infrastructure and operations
0:25 running networks I kind of shifted gears
0:28 and went into research for a few years
0:31 of Gartner and then moved back working
0:34 for a couple of vendors AppDynamics and
0:37 Cisco and currently attending today I'm
0:40 gonna talk a little bit about you know
0:43 the state of automation I'm gonna use a
0:45 little bit of the data that Damian
0:47 talked about but also sort of set some
0:49 guidelines and explain where I think
0:52 things are going and and some of the
0:54 things that we're seeing I've spoken to
0:57 a lot of folks trying to transform their
1:01 infrastructures transform their tooling
1:04 and most people really struggle with the
1:07 same problems which is really the people
1:11 and how they can deal with the debt that
1:14 they have as they try to transform and
1:16 build new things I obviously work for a
1:20 monitoring company I'm not going to talk
1:22 much about monitoring this is really
1:24 more about the automation side but we do
1:28 see things coming together so I'll talk
1:30 a little bit about what Damian touched
1:32 on which is that event-driven automation
1:34 and where we see things going especially
1:37 in some of our most advanced customers
1:42 so just going to touch on a few key
1:45 things around what's happening with the
1:47 network and the complexity and then talk
1:50 about the evolution of some of the
1:52 automation tools that we've seen because
1:55 there's been a lot of changes as the
1:57 infrastructures have have evolved so
1:59 have the tools and then also talk about
2:03 sort of State of the Union some of the
2:05 survey results and and sort of some of
2:09 the more advanced things around
2:11 integration and tucán
2:14 genuine systems and eventually
2:17 continuous delivery in sort of where
2:19 that's evolving to some of our customers
2:21 that can take are definitely on that
2:23 cutting edge but the bulk are trying to
2:26 figure out how they deal with their
2:28 current situation and sort of move on
2:32 some interesting information about
2:35 what's happening it's obviously good for
2:37 the audience here in terms of folks that
2:39 run networks or that run colocation and
2:42 other facilities enterprises are really
2:45 trying to get out of running their own
2:47 data centers they either want to go into
2:50 an environment where someone is running
2:51 it for them or there's additional
2:53 flexibility and the graph over on the
2:57 right hand side really shows both
3:00 historical and future spending on cloud
3:03 services the top graph obviously it's
3:06 hard to see this but the top graph is
3:08 software as a service the other two
3:10 lines that are trending up our
3:12 infrastructure as a service and platform
3:15 as a service so those of you that are
3:17 doing colocation and building these
3:19 services are going to see continued
3:22 investment what this really means is
3:25 that the network becomes even more
3:26 important because as environments are
3:29 distributed and more SAS services are
3:32 being consumed the network becomes that
3:35 critical point that ties everything
3:36 together which is good news for us as an
3:39 industry because the problems are just
3:41 going to get more complex and the demand
3:44 on our services and what we build is
3:47 going to increase but the complexity is
3:51 also increasing significantly we heard a
3:54 lot in the keynotes this week about
3:57 interesting ways that sd1 and
3:59 virtualization of networks are is
4:01 changing this is also happening
4:03 significantly on the enterprise we see
4:06 almost every telco that we work with and
4:09 most enterprises investing in some type
4:12 of SV win and they do this in order to
4:15 get that flexibility but it creates a
4:17 lot of challenges for us as networking
4:20 professionals because it's hard to
4:22 diagnose and debug these things that
4:24 abstract a lot of what we're used to
4:26 dealing with
4:28 additionally if you have to integrate
4:30 with these solutions they all have
4:32 different api's different formats for
4:34 telemetry there's so much variance in
4:37 here that it's really hard to manage
4:39 them it's hard to automate them it's
4:41 hard to troubleshoot them and this is
4:44 just going to continue getting more
4:46 complex as we layer more on top of the
4:48 onion so to speak and so what we've seen
4:53 over time and I'm surprised that Damien
4:57 said he was not expecting people to have
5:01 more automation systems over the last
5:03 three years but I definitely see people
5:06 adopting more automation systems to deal
5:08 with these new layers if you sort of
5:11 rewind to what we were doing 10 to 15
5:14 years ago these NCCM type tools would
5:17 connect and manage traditional Network
5:20 environments in terms of being able to
5:22 scale out a software upgrade or deploy
5:25 an Akal across a large distributed
5:27 network they would do some of these big
5:29 bulk changes and allow you to understand
5:32 the network's I manage several of these
5:34 tools across pretty large networks you
5:37 know 30,000 Network Devices and they did
5:40 a pretty good job but they really lacked
5:43 running things in a in a code forward
5:46 method so as people have moved towards
5:50 DevOps and looking at infrastructure as
5:53 code these network orchestration type
5:56 systems ansible obviously being the most
5:59 common one is really where people have
6:01 moved around their tooling but what we
6:04 start seeing is that when you look at
6:06 nsx and ACI and some of these new policy
6:09 based systems that really abstract away
6:13 a lot of the network constructs and lets
6:15 us control application layers and of
6:18 course these same vendors are talking
6:19 about this intent based system that
6:21 they're expecting to build at some point
6:24 in the future that's really fantasy at
6:26 this point but sounds great in marketing
6:29 we'll see where that whole thing goes
6:31 but it's clearly getting a lot of buzz
6:35 but not a lot of reality yet so that is
6:38 where you know things are going around
6:41 Automation we'll see how much of this is
6:43 smoke and mirrors and how much of it is
6:45 real but the transition towards that
6:48 code-based infrastructure management is
6:52 definitely happening and Damien
6:56 obviously used much cleaner data from
6:58 the survey I sort of used the raw data
7:00 here and he talked about the survey so
7:03 I'm not going to go into the details
7:05 behind it but ansible is clearly the
7:07 tool that we see across our customers
7:10 and just from speaking to folks in terms
7:13 of where they're going but the challenge
7:16 is that ansible isn't in itself and
7:18 out-of-the-box easy to use system you
7:20 have to write a lot of code so this
7:22 means that you really need different
7:24 types of engineers and I'll talk about
7:26 that in a minute the other thing that
7:29 Damien also touched on is how broken
7:31 monitoring is if you had told me 10
7:34 years ago that we would still be doing
7:36 ping and up/down monitoring as our
7:39 primary method of understanding the
7:41 network I would have called you crazy
7:42 because I was doing packet capture and
7:45 flow analytics 10-15 years ago and
7:48 people today still aren't doing these
7:50 things which I find depressing
7:54 disheartening I don't know you want to
7:56 call it but hopefully this will change
7:58 because because monitoring has some some
8:02 serious challenges but this move to you
8:07 know infrastructure as code and managing
8:09 the code is definitely is real it's
8:12 happening it's already happened on the
8:14 most of the infrastructure but the
8:16 network is coming along too the
8:19 challenge here is really that many of
8:21 these tools are really toolkits and some
8:26 of the tools that are out there like nor
8:28 near for example aren't inherently built
8:31 to just manage the network they're
8:33 really built to manage everything
8:35 so although napalm is a network specific
8:38 toolkit and is extremely popular with
8:41 users of ansible to provide additional
8:43 capabilities on top some of these other
8:46 tools are really very rudimentary
8:49 toolkits
8:50 some of them have a lot of vendor
8:52 restrictions to them
8:54 Napalm only supports a few of the
8:56 leading manufacturers of networked
8:59 devices that Nikko actually supports a
9:03 lot more devices but there are some that
9:05 are in sort of like a limited level of
9:07 support as well and as Damien Damien
9:11 also hinted on there's a stamp there's
9:13 not a standard implementation here so
9:16 every time a new engineer comes in and
9:19 sets one of these systems up it's going
9:21 to be different than every other company
9:23 so it's really difficult to figure out
9:27 best practices which is why it's
9:29 important to participate in communities
9:31 and understand what other people are
9:33 doing so that you can try to find the
9:35 best thing to fit your environment and
9:38 then finally the challenge with the
9:41 infrastructure is code things is that it
9:43 doesn't support existing automation or
9:45 existing configuration so for many
9:48 organizations that are setting up a new
9:50 network they'll start from scratch with
9:53 something like ansible which is a great
9:55 situation to be in unfortunately most of
9:58 us have existing infrastructure to
10:00 manage so it's very difficult to make
10:03 that transition just because you have to
10:06 deal with what is in place today as you
10:09 build for the future so definitely a
10:14 challenge the other thing is modeling
10:18 the network and understanding current
10:20 configuration is a big challenge
10:22 so although net comp which has been
10:24 extensively covered here over the last
10:26 four or five years and I won't go into
10:28 it it is a good good attempt at
10:33 standardization it still has its own set
10:36 of challenges which means that for
10:39 someone like Kent ik where we're really
10:41 trying to understand the infrastructure
10:43 we can't just rely on net comp we
10:46 actually have to do things like SNMP we
10:49 have to use other methods to discover
10:51 and understand what the devices are so
10:55 although there are standards here
10:56 they're not implemented in standard ways
10:58 so it becomes very difficult as you have
11:01 a complex network if you're lucky to
11:04 have standardize your entire network on
11:06 a single vendor or
11:08 a single platform of a single vendor
11:10 then that's great
11:12 but we don't see that as as something
11:14 that commonly happens just because we
11:17 build over time and you get this you
11:19 know change that's happening another
11:23 topic which I am actually currently
11:26 writing a blog about is really is
11:28 streaming telemetry and how variant it
11:31 is across the vendors and the platform's
11:33 so many of you would love to get rid of
11:36 SNMP I don't particularly think SNMP is
11:39 a wonderful protocol it's definitely
11:41 very old but the transition to streaming
11:44 telemetry is a big challenge because
11:47 it's missing a lot of the
11:48 standards-based things that we built
11:50 with SNMP and so there's a lot of
11:53 challenges around what you can use
11:55 streaming telemetry for and what you
11:58 still need SNMP for so this is still
12:01 evolving and I wish we had some
12:03 standards behind streaming telemetry
12:06 like we did with SNMP some rfcs would
12:09 would definitely be be nice to have
12:11 there so I'm gonna I was actually gonna
12:18 show some demos but I found out on
12:20 Sunday that I couldn't do a live demo on
12:22 stage because there's no place to plug
12:25 laptops in so I recorded a couple short
12:27 videos to show a few things but before I
12:30 jump into that I'm gonna be showing a
12:33 little bit about some of the things that
12:35 damien mentioned around chat offs
12:36 because i do believe that that is
12:38 important a lot of these cutting edge
12:41 customers that use cantik today also
12:44 build their own chat box and have
12:47 integrated with lots of other systems a
12:49 really common one that we see is in that
12:51 box very popular I'm sure many of you
12:55 are using it it is a great system but it
12:58 requires some manual work to get it
13:01 populated and keep it up to date it's
13:04 pretty nice because it really can show
13:06 you how your network is set up what
13:08 types of devices you have how things are
13:11 connected and it's a great single source
13:14 source of truth but it does require some
13:16 work to get up and running and make it
13:18 useful
13:21 so I'm going to show you a little bit
13:23 about about chat ops and where we think
13:26 things are going the idea is really most
13:31 of us use a slack or a Microsoft teams
13:33 or maybe some of the open source
13:35 projects out there to chat among
13:38 ourselves but these tools essentially
13:41 allow you to interact with a bot in a in
13:45 a in a chat room and collaborate more
13:48 easily especially when you're
13:50 troubleshooting problems you can bring
13:52 this data in directly to a conversation
13:54 it really creates a much more
13:56 collaborative environment so we're
14:00 working with network to code to open
14:02 source a chat bot that has a bunch of
14:05 integrations out of the box I'm going to
14:07 show you a little bit of this you can
14:09 see more details if you grab these
14:11 slides and look at that URL there's a
14:13 more extensive demo but the idea is how
14:16 can we bring together some of these
14:18 common tools that folks are using as
14:21 they try to change their operations as
14:23 well so a quick little demo of a chat
14:30 bot interacting with netbox
14:32 so I'm going to show you here in the
14:34 video that this is a copy of net box and
14:39 I can basically click on my sites here I
14:42 can pull up a site like Los Angeles I
14:45 can look at the devices that are on the
14:47 site some configuration information now
14:50 I'm switching over to slack and I can
14:52 actually make a command right here
14:54 to net box the same type of way tell it
14:58 that I'm interested in looking at a
14:59 particular site pick the same Los
15:02 Angeles site click Submit and instantly
15:05 I get the same type of data but it's a
15:07 much faster way to interact on using
15:10 that silly mouse thing you know so it
15:13 just kind of gives you an easier way to
15:15 collaborate and communicate and work
15:18 together as a team on troubleshooting
15:20 problems or understanding what's
15:22 happening in your environment so that's
15:24 kind of one example in this next example
15:28 I'm going to do a similar thing and show
15:30 you in Ken tech you know something that
15:33 we can pull up here
15:34 sort of show show us show me are my top
15:37 net flow sources in terms of the devices
15:40 that we're seeing sending flow data in I
15:44 can easily pick a you know save view
15:46 within Kent ik right from the chat bot
15:49 it then pulls up the graph that I want
15:52 to look at I can see what devices are
15:54 sending at most I can do the same exact
15:57 thing thing in Kent ik search for a save
15:59 view the same type of data essentially
16:02 comes up the same exact graph and view
16:05 and I can see you know what's sending
16:08 the data and start slicing and dicing
16:10 from there but it just kind of makes it
16:12 easier for you to see that data
16:15 instantly in a collaborative team view
16:18 versus work everyone working in their
16:20 own tools and their own browsers and
16:22 such so there's kind of the idea between
16:25 what we're trying to build with the chat
16:27 bot and really why we see this as being
16:31 a future path that a lot of teams are
16:34 moving towards for running automation
16:36 and and collaborating and
16:38 troubleshooting together and Damian also
16:43 talked a little bit about some of the
16:44 tools that folks are using most people
16:48 will start with a CI type system so
16:51 naturally the sort of the first steps
16:54 that you do is take your configurations
16:56 and store them and get some type of
16:58 source control system and that most
17:01 people will build start building a
17:03 pipeline let's say I want to automate
17:05 some of the verifications some of the
17:07 checking that happens when a new
17:09 configuration is committed people will
17:12 start stringing together these different
17:13 tools typically in Python and they'll
17:17 run that with a system the most common
17:20 being Jenkins is an open source CI
17:22 system but gitlab has a really nice CI
17:26 system as well and yet is also
17:29 integrated with it git lab is a SAS
17:32 service but they also it's open source
17:34 so you can download it and install the
17:36 whole system on Prem if you want to as
17:38 well there was also a talk yesterday
17:42 about batfish and that's definitely
17:43 something that we commonly see for
17:45 validating some of your policy
17:47 as this gets checked in the idea is to
17:50 try to eliminate some of the errors
17:52 automate some of the the checks and
17:55 verification and other things that you
17:57 would want to do and then of course as
18:00 you get more advanced some folks will do
18:03 instant deployment of that it runs
18:05 through all their verifications they see
18:07 that it checks out and then they'll
18:09 actually do the deployment directly with
18:11 ansible it's it's pretty hard to
18:16 retrofit an existing network to do this
18:18 type of thing but as you evolve you can
18:21 start automating more of this pipeline
18:23 and be more confident in these
18:25 incremental changes and so that's the
18:28 goal is instead of doing a big Thursday
18:31 night change window push of everything
18:34 that you're doing is incrementally
18:36 releasing these things and being able to
18:38 better manage and rollback when you have
18:40 problems and so it is important as you
18:44 build these types of pipelines and the
18:47 verification that you do that
18:49 closed-loop type system so incorporating
18:52 monitoring or running a synthetic test
18:54 let's say you close a firewall or you
19:00 close a port on an apple or something
19:02 like that you would run a test and make
19:04 sure that that was in fact closed and
19:06 that the connection didn't go through so
19:09 people will often incorporate that type
19:11 of testing it's almost like unit testing
19:13 with the code that they commit and it is
19:17 really common as you go into continuous
19:19 deployment that your deploys fail that
19:21 is totally normal and there's nothing
19:25 wrong with it don't think of it as a
19:26 failed release or a failed build it's
19:30 really just part of doing these
19:32 incremental changes that things do fail
19:34 the screenshot on the bottom is actually
19:36 a pretty advanced CD system called
19:39 harness and managing that failure is
19:43 definitely part of any CD strategy is
19:46 understanding we've failed how do we
19:49 revert things back quickly and make sure
19:51 things are back in the state they were
19:53 before the change so it is an important
19:56 part of understanding continuous
19:58 deployment is that that failure is
20:01 spected and it is normal it's how you
20:05 recover so a couple of things that are
20:09 happening more broadly in the industry
20:12 there's a new term and a new market
20:15 that's been forming over the last few
20:17 years called AI ops something that we've
20:19 been part of it can take for sure but
20:23 the idea here and and this is a Gartner
20:25 terminology in terms of what they're
20:28 seeing happening in the industry we see
20:31 a lot of folks building this with
20:33 open-source or using you know commercial
20:36 technologies but the idea is how do I
20:38 bring all of this data the configuration
20:41 the logs the metrics the traffic and
20:45 store this in a central system and the
20:48 ideas of course first monitoring and
20:50 understanding what's happening but being
20:52 able to drive service management
20:54 ticketing paging and then obviously the
20:58 ultimate goal is to tie in to that
21:00 automation and make this a closed-loop
21:02 system so there's a lot of things that
21:05 you can accomplish with a system like
21:06 this but it's a new type of technology
21:09 and the market is still really evolving
21:11 so it's very immature and fragmented and
21:15 not well understood but the goal is
21:17 really how do we do this closed-loop
21:19 type a type of thing that we've been
21:22 truck that we've been talking about here
21:25 so you can kind of think of this as like
21:27 an evolution of phases where you start
21:30 off by monitoring and ultimately you
21:33 want to get all the way to that goal of
21:35 automation and there's a lot of steps
21:36 and things that you can use these types
21:38 of systems to improve within your
21:41 organization that are in between those
21:43 two sort of methods so this is is kind
21:47 of where these platforms are evolving to
21:49 and I'm sure we'll see lots of great
21:51 open source also facilitating some of
21:54 this but it's an exciting area because
21:57 it's really how we start taking this
21:59 data and tying it to automation and
22:02 making that easier for everyone which i
22:05 think is a very valid goal and really
22:09 it's it's supposed to help us identify
22:11 these problems and what's happening
22:14 sooner prioritize what's the most
22:17 important and then get more out of our
22:19 people by better automation in closing
22:22 the loop with these things so that's
22:26 kind of a you know an overview of what
22:28 we're seeing but it's not fully reality
22:32 it's still formulating and coming
22:34 together and and I think we'll see that
22:35 get much better so with that I'm open to
22:41 take a few questions and thanks everyone
22:45 for for listening