Jonah Kowall, CTO, Kentik

Network Automation: The Hype vs. the Reality

Industry Voices Network Automation
Jonah Kowall Headshot
Screenshot showing host Jonathan Kowall from Kentick as he speaks to the audience. A presentation slide is to his right side, which says, “Netowork Modeling and State. Netconf protocol implemented on most devices, but older style XML implementation. Still need other discovery to get device status and details: * Config download (legacy), SNMP, or proprietary APIs, * Limited usefulness of Streaming Telemetry due to lack of standards and most networks being diverse vendors/versions.”

Watch: Jonah Kowall at NANOG on the state of network automation.

What is the state of network automation today and where is it going from here? Listen as Kentik CTO Jonah Kowall discusses network complexity, automation strategies, and where some of the most cutting-edge customers today are headed with automation. 

Show more

You’ll learn

  • How automation tools have evolved as the network has become more complex 

  • A state of the union: survey results on automation and telemetry that highlight customer strategies moving forward 

  • What’s happening with closed loop automation and validation via continuous integration and continuous delivery 

Who is this for?

Network Professionals Business Leaders

Host

Jonah Kowall Headshot
Jonah Kowall
CTO, Kentik

Transcript

0:00 hey everyone how you doing today thanks

0:02 for coming and listening after lunch on

0:06 your own and and showing up to listen to

0:08 Damien and I talked a little bit about

0:10 the state of network automation my name

0:15 is Jonah Cowell I currently work for Ken

0:17 tech and the CTO but I have a pretty

0:21 wide experience as a practitioner

0:23 running infrastructure and operations

0:25 running networks I kind of shifted gears

0:28 and went into research for a few years

0:31 of Gartner and then moved back working

0:34 for a couple of vendors AppDynamics and

0:37 Cisco and currently attending today I'm

0:40 gonna talk a little bit about you know

0:43 the state of automation I'm gonna use a

0:45 little bit of the data that Damian

0:47 talked about but also sort of set some

0:49 guidelines and explain where I think

0:52 things are going and and some of the

0:54 things that we're seeing I've spoken to

0:57 a lot of folks trying to transform their

1:01 infrastructures transform their tooling

1:04 and most people really struggle with the

1:07 same problems which is really the people

1:11 and how they can deal with the debt that

1:14 they have as they try to transform and

1:16 build new things I obviously work for a

1:20 monitoring company I'm not going to talk

1:22 much about monitoring this is really

1:24 more about the automation side but we do

1:28 see things coming together so I'll talk

1:30 a little bit about what Damian touched

1:32 on which is that event-driven automation

1:34 and where we see things going especially

1:37 in some of our most advanced customers

1:42 so just going to touch on a few key

1:45 things around what's happening with the

1:47 network and the complexity and then talk

1:50 about the evolution of some of the

1:52 automation tools that we've seen because

1:55 there's been a lot of changes as the

1:57 infrastructures have have evolved so

1:59 have the tools and then also talk about

2:03 sort of State of the Union some of the

2:05 survey results and and sort of some of

2:09 the more advanced things around

2:11 integration and tucán

2:14 genuine systems and eventually

2:17 continuous delivery in sort of where

2:19 that's evolving to some of our customers

2:21 that can take are definitely on that

2:23 cutting edge but the bulk are trying to

2:26 figure out how they deal with their

2:28 current situation and sort of move on

2:32 some interesting information about

2:35 what's happening it's obviously good for

2:37 the audience here in terms of folks that

2:39 run networks or that run colocation and

2:42 other facilities enterprises are really

2:45 trying to get out of running their own

2:47 data centers they either want to go into

2:50 an environment where someone is running

2:51 it for them or there's additional

2:53 flexibility and the graph over on the

2:57 right hand side really shows both

3:00 historical and future spending on cloud

3:03 services the top graph obviously it's

3:06 hard to see this but the top graph is

3:08 software as a service the other two

3:10 lines that are trending up our

3:12 infrastructure as a service and platform

3:15 as a service so those of you that are

3:17 doing colocation and building these

3:19 services are going to see continued

3:22 investment what this really means is

3:25 that the network becomes even more

3:26 important because as environments are

3:29 distributed and more SAS services are

3:32 being consumed the network becomes that

3:35 critical point that ties everything

3:36 together which is good news for us as an

3:39 industry because the problems are just

3:41 going to get more complex and the demand

3:44 on our services and what we build is

3:47 going to increase but the complexity is

3:51 also increasing significantly we heard a

3:54 lot in the keynotes this week about

3:57 interesting ways that sd1 and

3:59 virtualization of networks are is

4:01 changing this is also happening

4:03 significantly on the enterprise we see

4:06 almost every telco that we work with and

4:09 most enterprises investing in some type

4:12 of SV win and they do this in order to

4:15 get that flexibility but it creates a

4:17 lot of challenges for us as networking

4:20 professionals because it's hard to

4:22 diagnose and debug these things that

4:24 abstract a lot of what we're used to

4:26 dealing with

4:28 additionally if you have to integrate

4:30 with these solutions they all have

4:32 different api's different formats for

4:34 telemetry there's so much variance in

4:37 here that it's really hard to manage

4:39 them it's hard to automate them it's

4:41 hard to troubleshoot them and this is

4:44 just going to continue getting more

4:46 complex as we layer more on top of the

4:48 onion so to speak and so what we've seen

4:53 over time and I'm surprised that Damien

4:57 said he was not expecting people to have

5:01 more automation systems over the last

5:03 three years but I definitely see people

5:06 adopting more automation systems to deal

5:08 with these new layers if you sort of

5:11 rewind to what we were doing 10 to 15

5:14 years ago these NCCM type tools would

5:17 connect and manage traditional Network

5:20 environments in terms of being able to

5:22 scale out a software upgrade or deploy

5:25 an Akal across a large distributed

5:27 network they would do some of these big

5:29 bulk changes and allow you to understand

5:32 the network's I manage several of these

5:34 tools across pretty large networks you

5:37 know 30,000 Network Devices and they did

5:40 a pretty good job but they really lacked

5:43 running things in a in a code forward

5:46 method so as people have moved towards

5:50 DevOps and looking at infrastructure as

5:53 code these network orchestration type

5:56 systems ansible obviously being the most

5:59 common one is really where people have

6:01 moved around their tooling but what we

6:04 start seeing is that when you look at

6:06 nsx and ACI and some of these new policy

6:09 based systems that really abstract away

6:13 a lot of the network constructs and lets

6:15 us control application layers and of

6:18 course these same vendors are talking

6:19 about this intent based system that

6:21 they're expecting to build at some point

6:24 in the future that's really fantasy at

6:26 this point but sounds great in marketing

6:29 we'll see where that whole thing goes

6:31 but it's clearly getting a lot of buzz

6:35 but not a lot of reality yet so that is

6:38 where you know things are going around

6:41 Automation we'll see how much of this is

6:43 smoke and mirrors and how much of it is

6:45 real but the transition towards that

6:48 code-based infrastructure management is

6:52 definitely happening and Damien

6:56 obviously used much cleaner data from

6:58 the survey I sort of used the raw data

7:00 here and he talked about the survey so

7:03 I'm not going to go into the details

7:05 behind it but ansible is clearly the

7:07 tool that we see across our customers

7:10 and just from speaking to folks in terms

7:13 of where they're going but the challenge

7:16 is that ansible isn't in itself and

7:18 out-of-the-box easy to use system you

7:20 have to write a lot of code so this

7:22 means that you really need different

7:24 types of engineers and I'll talk about

7:26 that in a minute the other thing that

7:29 Damien also touched on is how broken

7:31 monitoring is if you had told me 10

7:34 years ago that we would still be doing

7:36 ping and up/down monitoring as our

7:39 primary method of understanding the

7:41 network I would have called you crazy

7:42 because I was doing packet capture and

7:45 flow analytics 10-15 years ago and

7:48 people today still aren't doing these

7:50 things which I find depressing

7:54 disheartening I don't know you want to

7:56 call it but hopefully this will change

7:58 because because monitoring has some some

8:02 serious challenges but this move to you

8:07 know infrastructure as code and managing

8:09 the code is definitely is real it's

8:12 happening it's already happened on the

8:14 most of the infrastructure but the

8:16 network is coming along too the

8:19 challenge here is really that many of

8:21 these tools are really toolkits and some

8:26 of the tools that are out there like nor

8:28 near for example aren't inherently built

8:31 to just manage the network they're

8:33 really built to manage everything

8:35 so although napalm is a network specific

8:38 toolkit and is extremely popular with

8:41 users of ansible to provide additional

8:43 capabilities on top some of these other

8:46 tools are really very rudimentary

8:49 toolkits

8:50 some of them have a lot of vendor

8:52 restrictions to them

8:54 Napalm only supports a few of the

8:56 leading manufacturers of networked

8:59 devices that Nikko actually supports a

9:03 lot more devices but there are some that

9:05 are in sort of like a limited level of

9:07 support as well and as Damien Damien

9:11 also hinted on there's a stamp there's

9:13 not a standard implementation here so

9:16 every time a new engineer comes in and

9:19 sets one of these systems up it's going

9:21 to be different than every other company

9:23 so it's really difficult to figure out

9:27 best practices which is why it's

9:29 important to participate in communities

9:31 and understand what other people are

9:33 doing so that you can try to find the

9:35 best thing to fit your environment and

9:38 then finally the challenge with the

9:41 infrastructure is code things is that it

9:43 doesn't support existing automation or

9:45 existing configuration so for many

9:48 organizations that are setting up a new

9:50 network they'll start from scratch with

9:53 something like ansible which is a great

9:55 situation to be in unfortunately most of

9:58 us have existing infrastructure to

10:00 manage so it's very difficult to make

10:03 that transition just because you have to

10:06 deal with what is in place today as you

10:09 build for the future so definitely a

10:14 challenge the other thing is modeling

10:18 the network and understanding current

10:20 configuration is a big challenge

10:22 so although net comp which has been

10:24 extensively covered here over the last

10:26 four or five years and I won't go into

10:28 it it is a good good attempt at

10:33 standardization it still has its own set

10:36 of challenges which means that for

10:39 someone like Kent ik where we're really

10:41 trying to understand the infrastructure

10:43 we can't just rely on net comp we

10:46 actually have to do things like SNMP we

10:49 have to use other methods to discover

10:51 and understand what the devices are so

10:55 although there are standards here

10:56 they're not implemented in standard ways

10:58 so it becomes very difficult as you have

11:01 a complex network if you're lucky to

11:04 have standardize your entire network on

11:06 a single vendor or

11:08 a single platform of a single vendor

11:10 then that's great

11:12 but we don't see that as as something

11:14 that commonly happens just because we

11:17 build over time and you get this you

11:19 know change that's happening another

11:23 topic which I am actually currently

11:26 writing a blog about is really is

11:28 streaming telemetry and how variant it

11:31 is across the vendors and the platform's

11:33 so many of you would love to get rid of

11:36 SNMP I don't particularly think SNMP is

11:39 a wonderful protocol it's definitely

11:41 very old but the transition to streaming

11:44 telemetry is a big challenge because

11:47 it's missing a lot of the

11:48 standards-based things that we built

11:50 with SNMP and so there's a lot of

11:53 challenges around what you can use

11:55 streaming telemetry for and what you

11:58 still need SNMP for so this is still

12:01 evolving and I wish we had some

12:03 standards behind streaming telemetry

12:06 like we did with SNMP some rfcs would

12:09 would definitely be be nice to have

12:11 there so I'm gonna I was actually gonna

12:18 show some demos but I found out on

12:20 Sunday that I couldn't do a live demo on

12:22 stage because there's no place to plug

12:25 laptops in so I recorded a couple short

12:27 videos to show a few things but before I

12:30 jump into that I'm gonna be showing a

12:33 little bit about some of the things that

12:35 damien mentioned around chat offs

12:36 because i do believe that that is

12:38 important a lot of these cutting edge

12:41 customers that use cantik today also

12:44 build their own chat box and have

12:47 integrated with lots of other systems a

12:49 really common one that we see is in that

12:51 box very popular I'm sure many of you

12:55 are using it it is a great system but it

12:58 requires some manual work to get it

13:01 populated and keep it up to date it's

13:04 pretty nice because it really can show

13:06 you how your network is set up what

13:08 types of devices you have how things are

13:11 connected and it's a great single source

13:14 source of truth but it does require some

13:16 work to get up and running and make it

13:18 useful

13:21 so I'm going to show you a little bit

13:23 about about chat ops and where we think

13:26 things are going the idea is really most

13:31 of us use a slack or a Microsoft teams

13:33 or maybe some of the open source

13:35 projects out there to chat among

13:38 ourselves but these tools essentially

13:41 allow you to interact with a bot in a in

13:45 a in a chat room and collaborate more

13:48 easily especially when you're

13:50 troubleshooting problems you can bring

13:52 this data in directly to a conversation

13:54 it really creates a much more

13:56 collaborative environment so we're

14:00 working with network to code to open

14:02 source a chat bot that has a bunch of

14:05 integrations out of the box I'm going to

14:07 show you a little bit of this you can

14:09 see more details if you grab these

14:11 slides and look at that URL there's a

14:13 more extensive demo but the idea is how

14:16 can we bring together some of these

14:18 common tools that folks are using as

14:21 they try to change their operations as

14:23 well so a quick little demo of a chat

14:30 bot interacting with netbox

14:32 so I'm going to show you here in the

14:34 video that this is a copy of net box and

14:39 I can basically click on my sites here I

14:42 can pull up a site like Los Angeles I

14:45 can look at the devices that are on the

14:47 site some configuration information now

14:50 I'm switching over to slack and I can

14:52 actually make a command right here

14:54 to net box the same type of way tell it

14:58 that I'm interested in looking at a

14:59 particular site pick the same Los

15:02 Angeles site click Submit and instantly

15:05 I get the same type of data but it's a

15:07 much faster way to interact on using

15:10 that silly mouse thing you know so it

15:13 just kind of gives you an easier way to

15:15 collaborate and communicate and work

15:18 together as a team on troubleshooting

15:20 problems or understanding what's

15:22 happening in your environment so that's

15:24 kind of one example in this next example

15:28 I'm going to do a similar thing and show

15:30 you in Ken tech you know something that

15:33 we can pull up here

15:34 sort of show show us show me are my top

15:37 net flow sources in terms of the devices

15:40 that we're seeing sending flow data in I

15:44 can easily pick a you know save view

15:46 within Kent ik right from the chat bot

15:49 it then pulls up the graph that I want

15:52 to look at I can see what devices are

15:54 sending at most I can do the same exact

15:57 thing thing in Kent ik search for a save

15:59 view the same type of data essentially

16:02 comes up the same exact graph and view

16:05 and I can see you know what's sending

16:08 the data and start slicing and dicing

16:10 from there but it just kind of makes it

16:12 easier for you to see that data

16:15 instantly in a collaborative team view

16:18 versus work everyone working in their

16:20 own tools and their own browsers and

16:22 such so there's kind of the idea between

16:25 what we're trying to build with the chat

16:27 bot and really why we see this as being

16:31 a future path that a lot of teams are

16:34 moving towards for running automation

16:36 and and collaborating and

16:38 troubleshooting together and Damian also

16:43 talked a little bit about some of the

16:44 tools that folks are using most people

16:48 will start with a CI type system so

16:51 naturally the sort of the first steps

16:54 that you do is take your configurations

16:56 and store them and get some type of

16:58 source control system and that most

17:01 people will build start building a

17:03 pipeline let's say I want to automate

17:05 some of the verifications some of the

17:07 checking that happens when a new

17:09 configuration is committed people will

17:12 start stringing together these different

17:13 tools typically in Python and they'll

17:17 run that with a system the most common

17:20 being Jenkins is an open source CI

17:22 system but gitlab has a really nice CI

17:26 system as well and yet is also

17:29 integrated with it git lab is a SAS

17:32 service but they also it's open source

17:34 so you can download it and install the

17:36 whole system on Prem if you want to as

17:38 well there was also a talk yesterday

17:42 about batfish and that's definitely

17:43 something that we commonly see for

17:45 validating some of your policy

17:47 as this gets checked in the idea is to

17:50 try to eliminate some of the errors

17:52 automate some of the the checks and

17:55 verification and other things that you

17:57 would want to do and then of course as

18:00 you get more advanced some folks will do

18:03 instant deployment of that it runs

18:05 through all their verifications they see

18:07 that it checks out and then they'll

18:09 actually do the deployment directly with

18:11 ansible it's it's pretty hard to

18:16 retrofit an existing network to do this

18:18 type of thing but as you evolve you can

18:21 start automating more of this pipeline

18:23 and be more confident in these

18:25 incremental changes and so that's the

18:28 goal is instead of doing a big Thursday

18:31 night change window push of everything

18:34 that you're doing is incrementally

18:36 releasing these things and being able to

18:38 better manage and rollback when you have

18:40 problems and so it is important as you

18:44 build these types of pipelines and the

18:47 verification that you do that

18:49 closed-loop type system so incorporating

18:52 monitoring or running a synthetic test

18:54 let's say you close a firewall or you

19:00 close a port on an apple or something

19:02 like that you would run a test and make

19:04 sure that that was in fact closed and

19:06 that the connection didn't go through so

19:09 people will often incorporate that type

19:11 of testing it's almost like unit testing

19:13 with the code that they commit and it is

19:17 really common as you go into continuous

19:19 deployment that your deploys fail that

19:21 is totally normal and there's nothing

19:25 wrong with it don't think of it as a

19:26 failed release or a failed build it's

19:30 really just part of doing these

19:32 incremental changes that things do fail

19:34 the screenshot on the bottom is actually

19:36 a pretty advanced CD system called

19:39 harness and managing that failure is

19:43 definitely part of any CD strategy is

19:46 understanding we've failed how do we

19:49 revert things back quickly and make sure

19:51 things are back in the state they were

19:53 before the change so it is an important

19:56 part of understanding continuous

19:58 deployment is that that failure is

20:01 spected and it is normal it's how you

20:05 recover so a couple of things that are

20:09 happening more broadly in the industry

20:12 there's a new term and a new market

20:15 that's been forming over the last few

20:17 years called AI ops something that we've

20:19 been part of it can take for sure but

20:23 the idea here and and this is a Gartner

20:25 terminology in terms of what they're

20:28 seeing happening in the industry we see

20:31 a lot of folks building this with

20:33 open-source or using you know commercial

20:36 technologies but the idea is how do I

20:38 bring all of this data the configuration

20:41 the logs the metrics the traffic and

20:45 store this in a central system and the

20:48 ideas of course first monitoring and

20:50 understanding what's happening but being

20:52 able to drive service management

20:54 ticketing paging and then obviously the

20:58 ultimate goal is to tie in to that

21:00 automation and make this a closed-loop

21:02 system so there's a lot of things that

21:05 you can accomplish with a system like

21:06 this but it's a new type of technology

21:09 and the market is still really evolving

21:11 so it's very immature and fragmented and

21:15 not well understood but the goal is

21:17 really how do we do this closed-loop

21:19 type a type of thing that we've been

21:22 truck that we've been talking about here

21:25 so you can kind of think of this as like

21:27 an evolution of phases where you start

21:30 off by monitoring and ultimately you

21:33 want to get all the way to that goal of

21:35 automation and there's a lot of steps

21:36 and things that you can use these types

21:38 of systems to improve within your

21:41 organization that are in between those

21:43 two sort of methods so this is is kind

21:47 of where these platforms are evolving to

21:49 and I'm sure we'll see lots of great

21:51 open source also facilitating some of

21:54 this but it's an exciting area because

21:57 it's really how we start taking this

21:59 data and tying it to automation and

22:02 making that easier for everyone which i

22:05 think is a very valid goal and really

22:09 it's it's supposed to help us identify

22:11 these problems and what's happening

22:14 sooner prioritize what's the most

22:17 important and then get more out of our

22:19 people by better automation in closing

22:22 the loop with these things so that's

22:26 kind of a you know an overview of what

22:28 we're seeing but it's not fully reality

22:32 it's still formulating and coming

22:34 together and and I think we'll see that

22:35 get much better so with that I'm open to

22:41 take a few questions and thanks everyone

22:45 for for listening

Show more
Sign up for more