- Network Programmability
- NetDevOps
- Hands-on with NetDevOps
Do you often ask yourself why we keep configuring our network devices in the same way we have been doing it for the last 30 years? Isn't it strange that we still have to log into each individual box and use command-line instructions to perform any changes? Do you wonder if there might be a more optimal way of configuring your infrastructure, instead of CLI? Does this way of working make you feel like any simple change in your network is complex to implement?
You are not alone.
There are definitely alternative and innovative ways of programming your network infrastructure. Yes, when you configure your network devices to adopt a certain behaviour, or implement a new available feature, you are programming them. So one of the first things we should be looking for is more optimal ways of programming our infrastructure.
Furthermore, as the network exists to provide connectivity for applications, we should take a look at how these are evolving. Agile microservices-based cloud-native development, DevOps automation with CICD pipelines, and automated unit testing, enable really dynamic application development for quick time-to-market requirements. Let's not forget that software is one of the most important assets to differentiate modern enterprises from their competition. Being able to quickly implement new features, deploy new locations, or fix issues, is absolutely key to their success.
For the last years, servers have been virtualized with Virtual Machines that can be automatically deployed in minutes. These days the trend is going to container-based microservices, that are deployed insanely fast. These are short-lived entities that may be deployed dynamically across hybrid cloud environments, interacting among them to provide the desired service with virtually unlimited scalability, and adapting to any possible issues in the underlying infrastructure via declarative statements.
In comparison, network infrastructure is much more static. In order to accommodate requirements from application developers it needs to be faster, more flexible and cost-optimized. Today network configuration is often a completely manual process that makes any desired change across the network complex and slow. The more elements these changes include (eg. firewalls, load-balancers...) the more difficult it gets to make them quick, reliable and adaptable. This situation often leads to bare minimum configurations in the network, that allows for a faster deployment (eg. no security ACLs, no QoS config, or trunking every VLAN in an interface) but usually leading to much bigger concerns.
Infrastructure is full of products designed to be used by... humans. It may not always seem that way, but human operators are the target users for CLI and web interfaces. This means that when you need to get something done via these interfaces, you (or some other human) has to do the work.
You won't have to think back too far to remember the last time you needed to complete some bulk-task on a computer. The task probably involved a lot of clicking, typing, copying-and-pasting, or other mind-numbing repetitions. These human interfaces (and the paradigm of having humans do the work) are to blame for the bulk-work that we sometimes have to do to complete a task.
Our brain has a great capacity, but clearly human input/output interfaces with a computer (typing and reading) are not very fast. Our thoughts neck down to this tiny straw, which output-wise is like poking things with your meat sticks, or using words (speaking or tapping things with fingers). For example, machine typing usually happens at a 20th of the speed you are thinking. And I am talking ten-finger typing, let's not even go into two-thumb typing... So while Elon Musk finishes his BMI (Brain Machine Interface), aka Wizard Hat, we will have to explore alternative options that optimize how we configure our networks.
Computers are great at bulk-work, but if you want your computer to talk to your infrastructure and do something, you will need a machine-to-machine interface or API (Application Programming Interface): an interface designed for software pieces to interact with each other.
By 2020, only 40% of network operations teams will use the command line interface (CLI) as their primary interface, which is a decrease from 75% in 2Q18. (Gartner, 2018 Strategic Roadmap for Networking)
Network Programmability uses a set of software tools to deploy, manage and troubleshoot network devices and controllers via APIs, gathering data and driving configurations to enhance and secure application delivery. This software can on-box or off-box, and work on-demand or event-driven.
We can ask an API to:
- Take some action
- Provide us with some piece of information
- Store some piece of information
We use these machine-to-machine APIs to make simple requests to our infrastructure, which in aggregate, enable us to complete powerful tasks.
For example, you might use APIs to make simple requests like...
- Get the status for interface X
- Get the last-change time for interface X
- Shutdown interface X
- Set the description of interface X to "Interface disabled per Policy"
... and that way complete a powerful task like: "Disable all ports that have been inactive for 30 days."
Sure, you could do this manually, but wouldn't it be better to codify the process (write it once) and then let your computer run this task whenever you need it done?
Besides this, information included in API responses should be formed by data structures that can be programmatically readable by machines (and ideally also by humans). Classic CLI responses are human-readable text, but very difficult to be interpreted by a machine, that needs to be parsed with great difficulty before being able to leverage the included information.
If you need information from your infrastructure, ask for it. Using a machine-to-machine API means your request will complete, your data retrieved in a programmatic data structure, or you will receive notification to the contrary. All done in a way that enables you to automate the interaction. APIs make it easy to send requests to your infrastructure, but what makes it easy to codify the processes?
Coding is the process of writing down instructions, in a language a computer can understand, to complete a specific task.
Let's consider a simple codified process that we are asking a computer to follow:
- For each switch in my network...
- For each interface in the switch...
- If the interface is down, and hasn't changed states in more than thirty days, then:
- Shutdown the interface
- Update the interface description to mention why it's been shut down
- If the interface is down, and hasn't changed states in more than thirty days, then:
- For each interface in the switch...
for switch in my_network:
for interface in switch:
if interface.is_down() and interface.last_change() > thirty_days:
interface.shutdown()
interface.set_description("Interface disabled per Policy")
This is essentially the process that you, as a human, would go through to complete the same task. By taking the time to codify it (write it down in a machine interpretable language), you can now ask the computer to do the task whenever you need it done. You, the human, are providing the intelligence (what needs to be done and how it should be done), while letting the computer do the boring and repetitious work (which is what it does best).
While the code sample above is a snippet of a larger script, and is calling other functions (like interface.last_change()
and interface.shutdown()
), implementing the utility functions is straightforward and the code shown is actual valid Python code that would complete the task. The core logic is that simple.
APIs and programming languages aren't new, so, why the recent hype?
Well... they have matured!
Modern programming languages like JavaScript, Python, Go, Swift, and others are less cumbersome and more flexible than their predecessors. It used to be that you had to write 10,000 lines of C++ code to do anything useful, but with these modern languages (and packages and libraries available from their developer communities) you can do powerful things in less than 300 lines of code. Which is probably shorter, or on par with, most Cisco IOS configurations that you have worked with.
These languages, when combined with other modern developer tools (eg. Git repositories, Package management systems, Virtual environments, Integrated Development Environments) equip you with powerful development tools that enable you to automate your tasks and processes and begin creating your own set of powerful tools and workflows.
While these tools are great, and are now bringing rich value to the systems engineering discipline, we are also benefiting from another maturing area of the software development industry.
In the past, when you set out to create some script or program, you often had to start from scratch, working with low-level standard libraries included with your programming language and toolset of choice. This created a high barrier to entry (and massive global repetition) as software developers had to write the same heavy lifting modules to get common tasks done. Take for example making a HTTPS web request, where they had to write code to:
- Open a TCP connection on port 443
- Handle TLS negotiation and exchange certificates
- Validate the certificates
- Manage the TCP connection (and any connection pooling)
- Format HTTP requests
- Interpret HTTP responses
That is a lot of work when all the developer wanted to do was to get or send some data to / from some remote server. This is the reason why engineers left this work to software developers.
Now, thanks to the Open Source community, social code-sharing and collaboration sites like GitHub, and public package repositories, the developer communities around these new modern programming languages are building and sharing Open Source software libraries that help to encourage reuse and reduce duplicate work. Leveraging these community-created libraries can save you tremendous amounts of time and effort, and they enable you to focus your time and effort on what you want your code to do: your codified process.
You can make a HTTPS request without much personal investment, because of the work done by these online communities.
$ pip install requests
Collecting requests
Using cached
<-- output omitted for brevity -->
$ python
>>> import requests
>>> requests.get("https://api.github.com")
<Response [200]>
What you are seeing here is the following:
- We installed a community library from a public package repository (
pip install requests
) - We entered a Python interactive shell (
python
) - We imported the library into our Python code (
import requests
) - We made a HTTPS request to https://api.github.com and it was successful (
<Response [200]>
)
Starting with installing the requests
package on our machine, in four typed lines in a terminal we were able to download and install the package and use it to make a HTTPS request (without having to think about the steps involved with making the HTTPS request).
Now that languages and tools have evolved to be useful for infrastructure engineers, APIs have become easier to work with.
Gone are the days where it took an expert programmer to work with a product's API. Previous API standards like SOAP proved themselves to be not so simple, and easier to use API models like RESTful APIs have taken their place.
Now, thanks to RESTful APIs and standardized data formats like JSON, you can make requests of your infrastructure with the same ease these modern programming languages provide.
Let's do a quick review of the different foundational coding building blocks that network engineers will need to understand and use when entering the programmability world.
Data models are conceptual representations of data, that define what specific information needs to be included and the format to represent it. A data model can be accessed by multiple source applications, via different communication protocols.
YANG (Yet Another Next Generation) is a data modelling language defined originally in RFC 6020 and updated later in RFC 7950. It uses XML to describe the data model for network devices, and it is composed of modules and sub-modules that represent individual YANG files. YANG modules are self-documenting hierarchical tree structures for organizing data.
+--rw interfaces
| +--rw interface* [name]
| +--rw name string
| +--rw description? string
| +--rw type identityref
| +--rw enabled? boolean
| +--rw link-up-down-trap-enable? enumeration
+--ro interfaces-state
+--ro interface* [name]
+--ro name string
+--ro type identityref
+--ro admin-status enumeration
+--ro oper-status enumeration
+--ro last-change? yang:date-and-time
+--ro if-index int32
+--ro phys-address? yang:phys-address
+--ro higher-layer-if* interface-state-ref
+--ro lower-layer-if* interface-state-ref
+--ro speed? yang:gauge64
+--ro statistics
+--ro discontinuity-time yang:date-and-time
+--ro in-octets? yang:counter64
+--ro in-unicast-pkts? yang:counter64
+--ro in-broadcast-pkts? yang:counter64
+--ro in-multicast-pkts? yang:counter64
+--ro in-discards? yang:counter32
+--ro in-errors? yang:counter32
+--ro in-unknown-protos? yang:counter32
As you can see in the previous example, YANG modules are used to model configuration and state data. Configuration data can be modified (rw), while State data can only be read (ro).
YANG is based on standards from IETF, OpenConfig and others. It is supported by most networking vendors in their own devices, and allows them to augment or deviate models, in order to include vendor / platform specific information.
YANG data models are publicly available here. As you browse through the hundreds of them, you might soon realize that finding the model you are looking for may be quite time-consuming. To make your life easier please take a look at Cisco YANG Explorer, an open-source YANG browser and RPC builder application to experiment with YANG data models.
Once you decide to use YANG data models in your code, you will need to use libraries for your preferred programming language. If your choice is Python, as it is for many network engineers, you should definitely checkout pyang. This Python library can be used to validate YANG modules for correctness, to transform YANG modules into other formats, and even to generate code from the modules.
Finally you might also be interested in taking a look at the capabilities offered by the YANG Catalog, a registry that allows users to find models relevant to their use cases from the large and growing number of YANG modules being published. You may read-access it via NETCONF or REST, to validate YANG modules, search the catalog, view module's details, browse modules and much more.
Now that we know how to model data and store it locally, we need to start considering how to communicate it machine-to-machine. It is critical that our system knows how to send requests to network devices, and what format to expect when receiving responses.
The classic approach with CLI provides us with structured data:
GigabitEthernet1 is up, line protocol is up
Description: TO_vSWITCH0
Internet address is 172.16.11.11/24
MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full Duplex, 1Gbps, media type is RJ45
This type of text output is great for human-machine interaction, because our brain easily understands the information reading through it. However this is not a good format for machine-to-machine communication, because the system receiving this text would need to be programmed to parse through it, in order to extract the values for the different included fields. Yes, we could program the system to do it, using regular expressions. But there would be important drawbacks: not only implementing how to extract the relevant keys and values, but also how to do it for different platforms and vendors. Please consider that each OS will provide a slightly / largely different text output to show the same kind of info. So we would need to parse things differently for each case... definitely not the best approach.
Considering that we have defined a common data model, let's also agree on a common format to exchange that data. Instead of the previous text we would like to receive something like the following:
{
"description": " TO_vSWITCH0",
"ipv4Address": "172.16.11.11",
"ipv4Mask": "255.255.255.0",
"portName": "GigabitEthernet1",
}
This is an example of data in structured format, and it is critical for our systems to easily process information exchanged between machines.
There are two common formats for data interchange being used these days: JSON and XML.
JSON (JavaScript Object Notation) is more modern and commonly used by new APIs. With its simple key:value approach, it is very lightweight, easy for systems to generate and parse, but also easy for humans to read.
{
"className": "GRETunnelInterface",
"status": "up",
"interfaceType": "Virtual"
"pid": "C9300-48U",
"serialNo": "FCW2123L0N3",
"portName": "Tunnel201"
}
No, you don't need to know any JavaScript to work with JSON. They just happen to share the syntax, but no need at all to be a JavaScript developer when using JSON as the data transfer format between systems.
Python users can easily work with JSON, using its own standard library:
import json
This library allows you to easily work with JSON as native Python objects. Very often you will import JSON data into Python dictionaries, with an array of key:value pairs that enables you to search for the field you require by just running a standard search for a certain key.
Later we will discuss communication protocols, but for your reference please make a note that both REST APIs and RESTCONF support JSON and XML.
XML (eXtensible Markup Language) is a bit older, but still used by a lot of APIs. It is used for data transfer, but sometimes also to store info. It is language-independent and designed to be self-descriptive, although, compared to JSON, tagging makes it a little bit more difficult to read for humans.
{
<interface>
<name>GigabitEthernet1</name>
<description>TO_vSWITCH0</description>
<type xmlns:ianaift="urn:ietf:params:xml:ns:yang:
iana-if-type">ianaift:ethernetCsmacd</type>
<enabled>true</enabled>
<ipv4 xmlns="urn:ietf:params:xml:ns:yang:ietf-ip">
<address>
<ip>172.16.11.11</ip>
<netmask>255.255.255.0</netmask>
</address>
</ipv4>
</interface>
}
XML is not the same as HTML: XML carries data, while HTML represents it.
Python users also benefit from multiple available resources to work with XML, like ElementTree objects, Document Object Model (DOM), Minimal DOM Implementation (minidom), and xmltodict.
You may learn more about XML in this tutorial.
By now you should have a clearer view on the relationship between YANG and JSON/XML. YANG is the data model that shows information about network devices configuration and status. JSON and XML are data exchange formats to represent the information stored in the data model, so it can easily be understood by both machines and humans.
JSON displays information in a clearer way and will be used more frequently by modern systems. However XML is still required for multiple systems that support it exclusively.
Now that we understand data models and data transfer formats, we need to consider what protocol to use in order to exchange that information. NETCONF and RESTCONF are different protocols that you will need to use depending on the availability provided by your platform.
Network Configuration Protocol (RFC 6241), is a network management protocol developed and standardized by the Internet Engineering Task Force (IETF). It supports a rich set of functionality to manage configuration and operational data, being able to manage network devices running, candidate and startup configurations. The NETCONF protocol defines a simple mechanism through which a network device can be managed, configuration data can be retrieved, and new configuration data can be uploaded and manipulated. The NETCONF protocol uses Remote Procedure Calls (RPCs) for its paradigm, such as get-config
, edit-config
, or get
. A client encodes an RPC in XML and sends it to a server using a secure, connection-oriented session (such as Secure Shell Protocol [SSH]). The client (application) initiates a connection using SSH port 830 towards the server (network device). The server responds with a reply encoded in XML, and there is a capability exchange during session initiation, using XML encoding.
Let' take a look at an example on how we could use Python to connect to a device via NETCONF.
from ncclient import manager
import xml
import xml.dom.minidom
with manager.connect(host=RW_HOST, port=PORT, username=USER, password=PASS, hostkey_verify=False, device_params={'name': 'default'}, allow_agent=False, look_for_keys=False) as m:
# XML filter to issue with the get operation
# IOS-XE 16.6.2+ YANG model called "ietf-interfaces"
interface_filter = '''
<filter xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<interfaces-state xmlns="urn:ietf:params:xml:ns:yang:ietf-interfaces">
<interface>
<name>GigabitEthernet1</name>
</interface>
</interfaces-state>
</filter>
'''
result = m.get(interface_filter)
xml_doc = xml.dom.minidom.parseString(result.xml)
We start by importing the NETCONF and XML libraries we will be using (ncclient
is a Python library that facilitates client-side scripting and application development around the NETCONF protocol). Then we connect to the device IP (RW_HOST
), using the specified port for SSH (PORT
) and the required credentials (USER
/PASS
). Once connected we define specifically what we want to receive (interface_filter
) and make the request (m.get
). get
is the method used to request operational data, but you could also ask for configuration data using get-config
, or modify that configuration using edit-config
. Final step is just to parse the result into a Python dictionary, using the minidom library, to be able to work it.
And voilá, you get an XML response showing operational data for the requested interface.
<rpc-reply message-id="urn:uuid:50bf9d6e-7e5c-4182-ae6b-972a055ceef7" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0">
<data>
<interfaces-state xmlns="urn:ietf:params:xml:ns:yang:ietf-interfaces">
<interface>
<name>GigabitEthernet1</name>
<admin-status>up</admin-status>
<oper-status>up</oper-status>
<phys-address>00:0c:29:6c:81:06</phys-address>
<speed>1024000000</speed>
<statistics>
<in-octets>5432293472</in-octets>
<in-unicast-pkts>28518075</in-unicast-pkts>
……………
<out-octets>2901845514</out-octets>
<out-unicast-pkts>18850398</out-unicast-pkts>
</statistics>
</interface>
</interfaces-state>
</data></rpc-reply>
RESTCONF (RFC 8040) is based on the idea of adding a REST API to NETCONF. It can manage manage configuration and operational data defined in YANG models, and the URLs, HTTP verbs, and Request bodies are automatically generated from those associated YANG models. RESTCONF uses HTTP(S) as transport, and supports both XML and JSON as data transfer formats, while NETCONF only supports XML. Also, RESTCONF supports only a sub-set of NETCONF, so not all operations are supported.
Remember that since REST principles are being used, RESTCONF is based on stateless connections. As such, every application using RESTCONF writes directly to the running configuration, with no support for candidate configuration.
Being based on REST, RESTCONF supports the following methods:
- GET, to read/retrieve info
- POST, to create a new record
- PATCH, to update only some values of an existing record
- PUT, to update all values of an existing record
- DELETE, to erase an existing record
Let's take a look at how to use it.
url = 'https://RO_HOST/restconf/data/interfaces-state/interface=GigabitEthernet1'
header = {'Content-type': 'application/yang-data+json',
'accept': 'application/yang-data+json'}
response = requests.get(url, headers=header, verify=False, auth=ROUTER_AUTH)
interface_info = response.json()
oper_data = interface_info['ietf-interfaces:interface']
In this case we are sending a HTTP(S) request to our network device REST API. The URL structure will include the network device IP address (RO_HOST
) and the resource we are asking about (interface=GigabitEthernet1
). Then we will have to define the HTTP headers to send, specifying in this case what is the content type we are sending (YANG encoded in JSON) and the content we expect to receive in the response (YANG encoded in JSON). Finally we parse the JSON into a Python dictionary and extract the relevant info from the structured data.
{
"ietf-interfaces:interface": {
"name": "GigabitEthernet1",
"admin-status": "up",
"oper-status": "up",
"last-change": "2018-01-17T21:49:17.000387+00:00",
"phys-address": "00:0c:29:6c:81:06",
"speed": 1024000000,
"statistics": {
"in-octets": 5425386232,
"in-unicast-pkts": 28489134,
……………
"out-octets": 2899535736,
"out-unicast-pkts": 18844784
}
}
}
So the overall picture looks like this now:
Network devices information is modelled in YANG to make it consistent, independent of the underlying infrastructure. Then than information can be represented with JSON or XML, and accessed by mean of NETCONF or RESTCONF from a remote client.
By now you might be wondering what is REST? It stands for Representational State Transfer, and it was born from the need to create a scalable Internet, where software systems could interact with each other, in an uniform and efficient approach.
It is a simple-to-use communications architecture style (not a standard) for networked applications, based on the client-server model. It expects all information required for the transaction to be provided at the time of the request. Client could be an application or a REST client, like Postman for development and testing. Server could be a system, network device, or network management application.
REST is stateless, so the server will close the connection after the specified exchange is completed, and no state will be maintained on the server side. This way it makes transactions very efficient.
The same as you use a HTTP get method when browsing the internet and the server provides you with a website in HTML format that your browser decodes to make it human readable, REST APIs answer to get requests from other systems with structured data (in JSON or XML) specifically addressed to them.
Think about SDN and NFV, where different types of controllers need to communicate and exchange information with multiple devices. Applications sitting on top of those controllers can actually query anything that the controller knows about the network below it. This can be operational data, configuration data stats about a single device with a 10GE interface, etc. Applications then take this information, process it and then program the controller by sending a post instead of a get request.
RESTful APIs are REST-based APIs, based on response-request communications using the HTTP protocol for the following operations (CRUD):
- Post: Create a new resource
- Get: Retrieve/Read a resource
- Put: Update an existing resource
- Delete: Delete a resource
It includes five components that may be required in each Request:
- URL: application server and the API resource
- Auth: there are few different authentication methods, not standardized, required to identify who is making the request (HTTP Basic, Custom, OAuth, none)
- Headers: define content-type and accept-type, communicating to the server the format of data we will send and expect to receive (JSON or XML)
- Request Body (optional): may be missing if no data is required to be sent with the request
- Method: What is the task we ask the server to perform (ie. use POST to create a new record, or PUT to update an existing one)
Let's take a look at the format in this example:
url = DNAC_IP + '/api/v1/host?hostIp=' + client_ip
header = {'content-type': 'application/json', 'Cookie': dnac_jwt_token}
response = requests.get(url, headers=header, verify=False)
client_json = response.json()
client_info = client_json['response'][0]
First we need to define the URL with the IP address of the end system (ie. DNAC_IP
) and the route to the required resource (ie. /api/v1/host?hostIp=
combined with the IP of an end system). Then we specify the required headers, defining what is the format we are sending (JSON) and the required auth cookie. With that info we open the connection, make the request and store the response to parse it.
As long as these are HTTP requests we are sending, server will answer with a HTTP status code, headers and a response body.
Some possible HTTP status codes:
- 2xx Success: 200 OK, 201 Created
- 4xx Client Error: 400 Bad Request, 401 Unauthorized (something is wrong the authentication), 404 Not Found (most likely URL is wrong, or payload is wrongly formatted)
- 5xx Server Error: 500 Internal Server Error
Headers will define the content-type (JSON or XML), cache control, date and encoding.
The response body will be the payload, including the requested data in JSON or XML, depending on the headers provided during the request.
Response 200 / success
Cache-Control →no-cache
Content-Type →application/json;charset=UTF-8
…
{
"hostIp" : "10.93.140.35" ,
"hostMac" : "00:0c:29:6d:df:40" ,
"hostType" : "wired" ,
"connectedNetworkDeviceId" : "601c9ead-576c-402d-bcb1-224235b1e020" ,
"connectedNetworkDeviceIpAddress" : "10.93.140.50" ,
"connectedInterfaceId" : "eb613db0-0994-44ec-9146-1b65346f3d07" ,
"connectedInterfaceName" : "GigabitEthernet1/0/13" ,
"connectedNetworkDeviceName" : "NYC-9300" ,
"vlanId" : "123" ,
"lastUpdated" : "1528324633014" ,
"accessVLANId" : "123" ,
"id" : "841f9433-0d2c-4735-afe8-beb7547b7883"
}
Documentation is always essential, but in this case even more, because REST APIs are an architectural style, not a standard. So docs will define specifically what you need to send to your network device, and what you should expect in return.
Quality of the API documentation is the most important factor in API adoption, because it determines how difficult is to work with your APIs. You might have the most powerful APIs, but if they are not documented correctly nobody will be able to leverage them.
APIs are very often documented in the platform itself, offering you the option to test them directly there without needing to write any code, or even know a programming language.
It is also common for them to offer you the option to automatically generate sample code in different programming languages, so you can directly use it in your developments.
When talking about programmability and APIs you need to pick your favorite programming language to let your system know what you want it to do, and how it needs to communicate with your network devices APIs. The goal will be to automate and script actions using the APIs provided by network devices, controllers, and applications. There are a myriad of different options when choosing your programming language (Python, Ruby, Go, JavaScript, C#, etc) and each developer will have his/her own preferences.
One very good option for network engineers to get started with programming is Python. It is one of the most popular programming languages across the globe for several reasons:
- Lots of available resources
- Extensive libraries
- Most SDKs developed in Python
- Powerful and fast
- Ubiquitous
- Easy to learn and friendly
- Open
- Wide support on different devices and platforms
- Rich and active support communities
- Most wanted language in 2017 & 2018
APIs and programming languages have evolved and matured to the point of being useful and applicable to the domains of infrastructure engineers.
The net-effect being that you can get powerful things done with relatively small amounts of code. And by so doing, you can automate the repetitious and/or labor intensive parts of your job freeing you up to focus your time and effort on tasks deserving of your intellect.
Network programmability provides consistent and dynamic infrastructure configuration by automating deployments and simplifying network management, bringing the following main benefits:
- Automation
- Time and cost optimization
- Reduce errors
- Integration
- Innovation
DevOps principles are not exclusive to software development, and some of them can definitely be applied to infrastructure configuration. NetDevOps brings the culture, technical methods, strategies and best practices of DevOps to network management.
Sometimes it is referred to by different names, like DevNetOps, NetOps, or SuperNetOps. But in general it is related to the more generic term Network Reliability Engineer (also coming from the DevOps counterpart Site Reliability Engineering).
Networks exist to provide connectivity for end-systems and applications, so obviously they have a critical role in any type of service. Everything needs connectivity, so the network is certainly a fundamental asset in any modern enterprise these days. Its functionality has become so critical that most business nowadays would not be able to survive without connectivity.
However there is a very common perception that the network is actually fragile.
Key network engineers that have been working long enough on a certain network become gurus. They are the ones that know the why and how of multiple specific configurations: why that had to be done last year on those core routers, how many neighbors should be seen by a certain edge router, or what that propagated BGP community means. Every box has a unique configuration to accommodate whatever was required at a specific point in time: troubleshooting or debugging a certain issue, that small fix in the routing protocol weight to determine the right interface to use, or those interfaces that are down and nobody knows if they should actually be up or not. Sequential and manual provisioning leads into a situation where each network device becomes a snowflake, due to how its configuration has changed organically according to whatever was required along since it was installed.
Without these key engineers there is a fear that network changes will go wrong. So operations teams tend to minimize the number and frequency of changes in their networks. Nobody wants to affect that precious business traffic and be pointed at by the CTO as the person responsible for that big failure. So changes rarely happen. And when they happen they are BIG, because there is a backlog of things to do. The bigger the change, the more possibilities that something will fail. Besides this, teams are not well practiced because changes do not happen often. Fixing an issue while operating a network live, or performing a rollback quickly, requires practice. So now any problem that happens during the maintenance window will lead to the perception that the network configuration change was a failure.
Furthermore, applying network-wide policies becomes a task proportionally tedious to how big the network is. For example, consider a possible Infosec recommendation to change SNMP strings every 3 months. Doing it manually in a big network might require a number of engineers performing those changes simultaneously across the network, maybe during a maintenance window by night to make sure systems can be synchronized next morning. This manual process involves quite some manual interaction, which is definitely prone to errors.
This type of considerations is very similar to the ones they had in classic software development. With their monolith architectures and bi-annual software updates, they suffered from similar challenges. And then they started doing things different, with things like Agile, DevOps, CICD pipelines and automated unit testing.
Applying this same type of principles to network configuration is what we called NetDevOps, and it will provide similar benefits to the ones software developers obtained while implementing this practices in their own environment. But it will require big cultural changes, like:
- Embracing failure and learning from it for the future
- Understand that change is good
- Collaborate actively between network developers and operations teams
- Empower teams to take ownership and responsibility
- Provide feedback systems that are actually useful to iterate and improve processes
- End-to-end automation for the whole lifecycle of changes
What if network engineers started working with network configurations the same way software developers work with their code?
What if we could create automated pipelines for those network configurations, that worked like CICD does for software development?
What if the network could be continuously monitored for health and improvement?
Now that would be a game changer. Not only in the way we manage our networks, but also in how we scale up, how we automate repetitive tasks, how different teams collaborate, and how we improve the reliability of our networks.
Let's explore it.
With the advent of Cloud computing we have now the capabilities to provision and manage ephemeral data centre resources (compute and connectivity) via machine-readable definition files. These files can be treated as common code, utilizing the same version control systems and best practices we use for software development, with goals like providing automation, improving efficiency and reducing errors. This is called Infrastructure as Code, or IaC.
We could follow the same approach with network device configurations, and this is what we call Network as Code. It is based on the idea of storing all network configurations in a Version Control System (VCS) that manages and tracks changes in the network. This system storing all configurations for the whole network would be considered the Single Source of Truth for all-things network configuration.
In this new mode of operation, network configuration changes are proposed in code branches, like software code developers do. These branches are safe places where network developers will be able to work safely on their proposed configurations, without affecting the master branch, where master configurations reside. Once these configurations are ready, developers will request their branch to be merged with the master configurations, and will go through an approval process to verify there are no issues when incorporating these changes.
Continuing with the emulation of DevOps automation capabilities, this will lead into using CICD (Continuous Integration and Delivery) Build Servers to automatically deploy and test the proposed configurations in testing, staging and production environments. Configurations that successfully pass the complete tests set, will be deployed into the production environment. In case of failure during that final deployment, the system itself will automatically rollback the proposed changes, leaving the production network in the previous state just before the change.
And considering that modern network devices support modern interfaces and APIs, let's leverage those to deploy our configurations across the network in an optimal way, instead of using the classic, slow and error-prone command-line interface.
Following this strategy, we are now ready to start building a completely automated environment to deploy and test configuration changes across the network.
Now that you know about some of the most important building blocks for programmability, it is time to see them working together and how they are used to build business-relevant solutions that help managing our networks. And what better way to learn about them than getting our hands dirty by going through some demos?
The following set of demos requires a sandbox: an environment where you have all the required platforms and elements that you will need for those demos. In our case we need a big server to run VIRL simulations for all network devices we will discuss later, and another server to run our VCS, NSO, etc.
You may find the required sandbox for our demo using this link, and book it for up to one week exclusively for you.
Note: when doing the reservation please choose 'None' for simulation, as we will be launching the required topologies as part of the setup process.
Spinning up the whole system will take roughly 15 mins, so please look at this strangely satisfying pendulum while we get everything ready for you.
Once the setup is ready you will receive an email with all required information to VPN into your sandbox. If you do not have a VPN client you may download AnyConnect here. Connect to your VPN and you are now ready to start working on your demos!
NetDevOps will deliver consistent version-controlled infrastructure configurations, deployed with parallel and automated provisioning.
And what better way of understanding the real benefits of NetDevOps than building your own setup and seeing how it works? The goal will be to create a complete environment that demonstrates the following benefits across the whole network:
- Track the status of network configurations at any point in time
- Track who proposed and approved each specific configuration change
- Provide visibility on what are the differences of configurations at any point in time vs a previous situation
- Enable rollback to any previous moment
- Provide syntax-checking capabilities for network changes in your own local workstation
- Automate the deployment of any proposed change across different environments (eg. testing, staging, production)
- Model simulated virtual environments to test proposed changes before going to production
- Define and run the required tests set and passing criteria, both in testing and production, before accepting a change as successful
- Automatically rollback any proposed configuration that does not pass the tests set
These are the building blocks we will use to provide such a comprehensive demonstration:
- GitLab: Version Control Server (VCS) with integration capabilities to provide automated pipelines
- Cisco Network Services Orchestrator: formerly Tail-f, it provides end-to-end automation to design and deliver services much faster
- pyATS: automation tool to perform stateful validation of network devices operational status with reusable test cases
- VIRL: network modelling and simulation environment
- Ansible: simple automation
Open a terminal window (ie. putty on Windows or terminal
on OSX) and ssh
to your devbox with the following credentials: developer
/C1sco12345
$ ssh [email protected]
Once in, clone the repository that includes all required files to build the setup into your devbox.
[developer@devbox ~]$git clone https://github.com/DevNetSandbox/sbx_multi_ios.git
With that, your sandbox devbox includes now all required info to start building the environment.
[developer@devbox ~]$cd sbx_multi_ios/gitlab
[developer@devbox gitlab]$./setup.sh
setup.sh
will start and configure your Version Control Server, a GitLab instance inside a Docker container running in your devbox.
The process will take like 5 minutes, so check this out in the meanwhile.
Once your terminal shows the process is finished, you may check with docker ps
that your GitLab containers are running, and how they offering their service in port 80.
[developer@devbox gitlab]$docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5cd18a397811 gitlab/gitlab-ce "/assets/wrapper" 2 days ago Up 2 days (healthy) 0.0.0.0:80->80/tcp, 0.0.0.0:4567->4567/tcp, 0.0.0.0:32769->22/tcp, 0.0.0.0:32768->443/tcp gitlab_gitlab_1
182c5937b931 gitlab/gitlab-runner "/usr/bin/dumb-init …" 2 days ago Up 2 days
Please point your browser to http://10.10.20.50, the IP address of your devbox (default port 80), and check that you can access the HTTP interface for your new GitLab service.
Now that GitLab is ready, go back to your terminal and let's run the script to setup the complete CICD environment.
[developer@devbox gitlab]$cd ../cicd-3tier
[developer@devbox cicd-3tier]$./setup.sh
In this case setup.sh
will perform the following actions:
- Launch the required VIRL simulations for two different environments: test and production
- Start NSO
- Import test and production network configurations from VIRL to NSO
- Synchronize devices configuration from NSO into VIRL simulations
- Create a new repo in GitLab and initialize it locally in your devbox
- Create locally in devbox the prod and test git branches and push them to GitLab
- List the status of VIRL nodes in production and test
This complete process will take like 10 minutes, so time for your fix.
Congrats, everything is now installed and ready!
Now you have two complete simulated environments running in your VIRL server: one for testing, and one replicating what would be a production physical network. Real world scenarios might be diverse: some customers may have a physical network in production, but only a simulated one for testing. Others might also have a real network for testing. Maybe even an additional one for staging before going to production. No matter how, the same principles apply to what we will be demonstrating. In our case the sandbox includes a couple of virtual environments, like the one depicted below, and implemented with VIRL for convenience.
As you can see each environment includes a standard 3-tier architecture, with 2x IOS-XE routers in the Core, 2x NX-OS switches in Distribution, and another 2x NX-OS switches in the Access layer.
You may find VIRL definitions for these two environments at the following locations in your devbox:
/home/developer/sbx_multi_ios/cicd-3tier/virl/test/topology.virl
/home/developer/sbx_multi_ios/cicd-3tier/virl/prod/topology.virl
Please make sure all your simulated routers are readily available (REACHABLE status) in both prod and test. If they are not, your demonstration will fail in different stages.
[developer@devbox test]$pwd
/home/developer/sbx_multi_ios/cicd-3tier/virl/test
[developer@devbox test]$virl nodes
Here is a list of all the running nodes
╒══════════════╤═════════════╤═════════╤═════════════╤════════════╤══════════════════════╤════════════════════╕
│ Node │ Type │ State │ Reachable │ Protocol │ Management Address │ External Address │
╞══════════════╪═════════════╪═════════╪═════════════╪════════════╪══════════════════════╪════════════════════╡
│ test-dist1 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.213 │ N/A │
├──────────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ test-access1 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.215 │ N/A │
├──────────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ test-dist2 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.214 │ N/A │
├──────────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ test-core2 │ CSR1000v │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.212 │ N/A │
├──────────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ test-core1 │ CSR1000v │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.211 │ N/A │
╘══════════════╧═════════════╧═════════╧═════════════╧════════════╧══════════════════════╧════════════════════╛
[developer@devbox test]$cd ../prod
[developer@devbox prod]$virl nodes
Here is a list of all the running nodes
╒═════════╤═════════════╤═════════╤═════════════╤════════════╤══════════════════════╤════════════════════╕
│ Node │ Type │ State │ Reachable │ Protocol │ Management Address │ External Address │
╞═════════╪═════════════╪═════════╪═════════════╪════════════╪══════════════════════╪════════════════════╡
│ core2 │ CSR1000v │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.222 │ N/A │
├─────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ core1 │ CSR1000v │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.221 │ N/A │
├─────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ access1 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.225 │ N/A │
├─────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ dist2 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.224 │ N/A │
├─────────┼─────────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ dist1 │ NX-OSv 9000 │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.223 │ N/A │
╘═════════╧═════════════╧═════════╧═════════════╧════════════╧══════════════════════╧════════════════════╛
If any of the nodes stay in UNREACHABLE status please try the following:
-
Go into the environment directory (prod or test) and restart the node.
[developer@devbox cicd-3tier]$cd virl/test [developer@devbox test]$virl stop test-dist2 [developer@devbox test]$virl start test-dist2
-
Connect into that specific node (with
virl ssh
orvirl console
) and reboot it (password iscisco
).[developer@devbox test]$virl ssh core1 Attemping ssh connectionto core1 at 172.16.30.221 Warning: Permanently added '172.16.30.221' (RSA) to the list of known hosts. [email protected]'s password: core1#reload
-
If it still refuses to cooperate, stop the whole environment...
[developer@devbox test]$cd /home/developer/sbx_multi_ios/cicd-3tier [developer@devbox cicd-3tier]$./cleanup.sh
... and then restart it.
[developer@devbox cicd-3tier]$./setup.sh
Now that both of your VIRL environments are ready, let's setup your local environment.
To experience and demonstrate the full NetDevOps configuration pipeline, you may want to setup a local development environment where you can test proposed configuration changes before committing and pushing them to GitLab for the full test builds to occur. This is a completely optional step you might want to skip if you are not interested in testing locally.
To complete this step you will need to have a few local pre-requisites setup on your local workstation.
1. Common software: install Java JDK, python and sed (brew install gnu-sed
in OSX)
2. Network Service Orchestrator: in order to test the configuration pipeline locally, you'll need to have a local install of NSO on your workstation. Furthermore, you will need to have the same versions of NSO and NEDs (network element drivers) installed as the DevBox within the Sandbox. Using different versions may work, but for best experience matching the versions exactly is recommended.
- Network Service Orchestrator 4.5.3
- Cisco IOS NED 5.8
- Cisco IOS XE NED 6.2.10
- Cisco NX-OS NED 4.5.10
Once downloaded, you would install NSO in OSX like this:
$ sh nso-4.5.3.darwin.x86_64.signed.bin
$ sh nso-4.5.3.darwin.x86_64.installer.bin ~/ncs-4.5.3 --local-install
You may download the required NEDs from your sandbox devbox via SCP to your own workstation.
$ scp [email protected]:/usr/src/nso/ncs-4.5.3-cisco-ios-5.8.signed.bin .
$ scp [email protected]:/usr/src/nso/ncs-4.5-cisco-nx-4.5.10.signed.bin .
$ scp [email protected]:/usr/src/nso/ncs-4.5-cisco-iosxr-6.2.10.signed.bin .
Install those NEDs, by running the following two commands for each downloaded binary...
$ sh <bin_file>
$ tar -xzvf <gz_file>
... and then move each uncompressed folder into ~/dev/ncs-4.5.3/packages/neds
, replacing the existing ones.
Check all required NEDs are installed.
$ ls $NCS_DIR/packages/neds/
Once you have installed these versions, you'll need to source
the ncsrc
file for this version before beginning the local development process.
$ source ~/ncs-4.5.3/ncsrc
Don't forget to include this command in your startup shell (eg .zshrc)
Now you can test your local NSO installation.
First, setup the required structure and environment in your preferred directory.
$ ncs-setup --dest ~/ncs-run
Then start the NCS daemon.
$ cd ~/ncs-run
$ ncs
Check if NCS started correctly.
$ ncs --status
Start the CLI to connect to NCS...
$ ncs_cli -u admin
... or connect via SSH (default password is admin
).
$ ssh -l admin -p 2024 localhost
Point your browser to http://localhost:8080/ (credentials arer admin
/admin
).
If everything works correctly you may now stop the NCS daemon.
$ ncs --stop
Congrats, your NSO local installation is complete!
3. Python + Ansible
The network-as-code mechanism in this demonstration leverages both Ansible and NSO, with Ansible orchestrating the execution and configuration used by NSO to deploy to the network. In order to test locally, you'll need to have a Python environment (virtual environment is recommended) that meets these requirements.
- Python 3.6.5 or higher
- Ansible 2.6.3 or higher
Once you install them, and with your virtual environment active, install the requirements.
$ python3 -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt
All pre-requisites are now complete!
Let's now dig into setting up the local environment in your workstation.
-
Clone a copy of the repository from GitLab to your local workstation. Use this command to ensure the demo credentials are embedded in the git configuration. Please note this first repo clone might take some time, so you will need to be patient.
$ git clone http://developer:[email protected]/developer/cicd-3tier $ cd cicd-3tier
-
To simplify the setup and management of the local environment, a
Makefile
is included in the repository. Simply runmake dev
to do the following (to see the exact commands being executed for each of these steps, just take a look at the content of yourMakefile
):a. Use NCS netsim to start a local simulation of the network including the core, distribution, and access devices
b. Setup a local NCS project directory within the repo, start NCS and import in the netsim simulation
c. Synchronize netsim and NCS
d. Deploy the current network-as-code configuration to NCS and the network devices, using Ansible
$ make dev
Let's examine what is happening here, by going through the content of the
Makefile
.$ cat Makefile
You will see the first line defines the different steps that are part of the
dev
directive.dev: netsim nso sync-from dev-deploy
These steps are defined later in the same
Makefile
. You may also run them independently if you want to execute only that special step (eg.make netsim
).a. Start netsim
netsim: -ncs-netsim --dir netsim create-device cisco-ios core1 -ncs-netsim --dir netsim add-device cisco-ios core2 -ncs-netsim --dir netsim add-device cisco-nx dist1 -ncs-netsim --dir netsim add-device cisco-nx dist2 -ncs-netsim --dir netsim add-device cisco-nx access1 -ncs-netsim start
These
ncs-netsim
commands create netsim devices in thenetsim
directory, using the specified NEDs (ie.cisco-ios
orcisco-nx
) and a certain name (ie.coreX
,distX
,accessX
). Then the last step starts these devices locally in your workstation. Netsim devices are a quick and easy way to emulate the management plane and test configuration changes locally, with no risk involved in the test or production networks.You may check your netsim devices started correctly and their ports configuration, with:
$ ncs-netsim is-alive $ ncs-netsim list
You can also connect to your netsim devices CLI, and check with
show run
that nothing is configured yet. For example, to connect tocore1
:$ ncs-netsim cli-c core1
b. Start NSO
nso: -ncs-setup --dest . --package cisco-ios --package cisco-nx -ncs
This
nso
directive prepares the current directory (--dest .
) for a local NCS project, with the NEDs it will use (ie.cisco-ios
andcisco-nx
), and then it starts NCS.It is important to note that NCS will automatically detect and add existing local netsim devices.
You may login into NSO CLI and check the discovered devices (your netsim devices in this case) with:
$ ncs_cli -C -u admin admin connected from 127.0.0.1 using console on JGOMEZ2-M-D2KW admin@ncs# show devices brief NAME ADDRESS DESCRIPTION NED ID ------------------------------------------ access1 127.0.0.1 - cisco-nx core1 127.0.0.1 - cisco-ios core2 127.0.0.1 - cisco-ios dist1 127.0.0.1 - cisco-nx dist2 127.0.0.1 - cisco-nx admin@ncs#
You may also see the devices configuration stored in NSO (not configured yet). For example, for
core1
:admin@ncs# show running-config devices device core1
c. Synchronize netsim and NCS
sync-from: -curl -X POST -u admin:admin http://localhost:8080/api/running/devices/_operations/sync-from
This step will synchronize initial configurations from netsim devices into NCS. Check the configuration of your devices in NCS again, and you will see they include interfaces definitions now (eg. Loopback, Eth, FE).
d. Apply configurations
dev-deploy: -ansible-playbook --syntax-check -i inventory/dev.yaml site.yaml -ansible-playbook -i inventory/dev.yaml site.yaml
This last directive uses Ansible to first check the syntax (linting), and then executes the
site.yaml
playbook on the list of devices defined in thedev.yaml
inventory file.The inventory file (
dev.yaml
) lists the devices that will be configured by the playbook, with their hostnames, credentials (if necessary) and management IP addresses:- NSO
- One access switch
- Two core routers
- Two distribution switches
If you review the playbook itself (
site.yaml
) you will find it executes the following steps:- Synchronize old configurations from NSO to devices
- Push new configurations to NSO
- Synchronize new configurations from NSO to devices
But specifically for the second step you might be wondering where are those new configurations?
Take a look at this extract from
site.yaml
, describing that second step:- name: Push new configurations to NSO hosts: all connection: local gather_facts: no tasks: - name: Device configuration nso_config: url: "{{ nso.url }}" username: "{{ nso.username }}" password: "{{ nso.password }}" data: tailf-ncs:devices: device: - name: "{{ nso_device_name }}" tailf-ncs:config: "{{ config }}"
That tasks description uses the
nso_config
module, and provides the required NCS URL, username and password, as parameters defined in the inventory file mentioned before.The
data
section is the one that describes what configuration to apply, and there you may find you need to provide the device_name and config. Device names come again from the inventory file. BUT configurations are stored in thehost_vars
directory, where Ansible looks for variables as required. That directory stores individual YAML files, one per device, with the required configuration to apply to NCS devices.These configuration files in the
host_vars
directory will be important for us throughout the demo, as they store the configuration we want to apply, and therefore we will use them to apply changes to our network.After
dev_deploy
is completed you will see configurations correctly applied (and synchronized) to your netsim devices and NCS ones. You may check it worked fine with the same commands described in previous steps. For example, forcore1
:$ ncs-netsim cli-c core1 admin connected from 127.0.0.1 using console on JGOMEZ2-M-D2KW core1# show running-config
And...
$ ncs_cli -C -u admin admin connected from 127.0.0.1 using console on JGOMEZ2-M-D2KW admin@ncs# show running-config devices device core1
(Note: after you complete the rest of this demo, when you don't need the local environment anymore, you can easily delete everything by running
make clean
. It will shutdown netsim devices, NSO, and delete any related remnants.)
Our demonstration will include the following architecture and elements, to show how a completely automated CICD pipeline could be applied to a network configuration environment across a complete network, including test and production environments.
The flow will be as follows: our network operator will interact with GitLab to perform any configuration changes. Ansible and NSO will deploy those changes into a virtual test environment (with VIRL), and run automated tests (with pyATS) to verify the expected results after the change. If everything goes well, then our VCS will run the same process in the production environment to implement those changes in the real network.
Integrating that environment with the local setup we built in the previous section, results in a comprehensive architecture where the local environment uses the same tools (NSO & Ansible) as the remote one. Locally it will only do the syntax checking, and once configurations are pushed to the remote GitLab, the same set of tools will also deploy and test the proposed changes, first into a test environment and then into production.
Your GitLab Version Control Server (VCS) is ready. Please find the new infrastructure-as-code repository by pointing your browser to http://10.10.20.50/developer/cicd-3tier, and login with developer
/C1sco12345
. Leave that window open, as we will use it to run the demo.
The repository (or repo) stores all required files and configurations to work with during the demo. Some key elements are the following ones:
.gitlab-ci.yml
is the pipeline definition, including all different steps to follow in the automation processvirl
is a folder used by VIRL to define the emulated architectures (test and prod)tests
is a folder used by pyATS for automated testinggroup_vars
,host_vars
andinventory
are folders used by Ansible to automate configurations deployment
If you did not follow the optional local setup process, please clone a copy of the repository from GitLab to your local workstation (if you already did it in the previous section, please skip this step). Use this command to ensure the demo credentials are embedded in the git configuration.
$ git clone http://developer:[email protected]/developer/cicd-3tier
$ cd cicd-3tier
You will need to edit some of the files in this local repo, so please choose your favorite editor / IDE (integrated development environment). One possible option is Visual Studio Code, but you could also just defer to using something simpler like vi
or any other text editor.
First of all, please take a look at the .gitlab-ci.yml
pipeline file definition.
$ cat .gitlab-ci.yml
You will see our pipeline includes the following steps:
- Use Ansible to validate configurations that need to be applied to NSO and network devices are syntactically correct (linting), for the three environments: dev (local), test and production.
- Deploy those configurations to the test environment.
- Run automated testing in the test environment to make sure the resulting network state is the expected one.
- Deploy those configurations to the production environment. In this case you will see it specifies
when: manual
, meaning we would like to explicitly initiate the deployment process to production.allow-failure: false
means that in case of failure when deploying in production the system should automatically roll-back to the previous state. - Run automated testing in the production environment to make sure the resulting network state is the expected one.
Important note: for our demonstration we will use two simulated environments: test and production. It is more convenient for us to use a simulated environment for production, but in a real-world scenario the production environment would be built by real equipment from the production network.
Let's take a look at our network configurations.
$ cd host_vars
$ ls
access1.yaml core1.yaml core2.yaml dist1.yaml dist2.yaml
As you can see there is one YAML file per device in our network. Those files will be the ones you need to modify to perform changes in your network.
In a real-world scenario each network developer would have cloned this repository in their local machine, and work in their own local copy, via a specific branch. For our demo we will be one of those network developers, and propose changes from our local git repo.
For example, let's say we would like to change the OSPF router-id of our core1 router, from .1
to .101
. We would have to edit core1.yaml
, look for the relevant configuration line...
ospf:
- id: 1
network:
- area: 0
ip: 172.16.0.0
mask: 0.0.0.3
- area: 0
ip: 172.16.0.4
mask: 0.0.0.3
- area: 0
ip: 172.16.0.16
mask: 0.0.0.3
- area: 0
ip: 192.168.1.1
mask: 0.0.0.0
router-id: 192.168.1.1
... and change that last line to the desired value.
router-id: 192.168.1.101
Save the file.
Right now you have only modified a local text file in your workstation. And git knows about it.
$ git status
On branch master
Your branch is up to date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: core1.yaml
no changes added to commit (use "git add" and/or "git commit -a")
As long as we are happy with this change, we need to add the modified file to our next git commit.
$ git add core1.yaml
$ git commit -am "Update OSPF router-id from .1 to .101"
[master 0b24c9b] Update OSPF router-id from .1 to .101
1 file changed, 1 insertion(+), 1 deletion(-)
Now is the time to send our configuration change to the remote repo in the VCS GitLab server.
$ git push
warning: redirecting to http://10.10.20.50/developer/cicd-3tier.git/
Counting objects: 4, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 389 bytes | 389.00 KiB/s, done.
Total 4 (delta 3), reused 0 (delta 0)
To http://10.10.20.50/developer/cicd-3tier
00f2b65..1b70da9 test -> test
Go back to the browser window that pointed to your GitLab repo at http://10.10.20.50/developer/cicd-3tier, and you will see the update there.
Pushing your proposed change in the new core1.yaml
file will automatically start the pipeline defined in the .gitlab-ci.yml
file. It is applied on the test network, and you may check its execution in real-time by clicking on the CI / CD section in the left bar. In this example we would be running pipeline #3.
As you can see the pipeline includes 3 different stages: validate, deploy_to_test and verify_deploy_to_test.
These are coming from the pipeline definition in your .gitlab-ci.yml
file.
Clicking on each one of these stages will show you the specific steps followed in there:
- validate
This stage starts a container running Ansible and do the syntax checking of proposed configurations, including changes, in all 3 environments: dev (local), test and prod.
- deploy_to_test
Second stage starts a container running Ansible to sync existing configs from NSO to devices, then apply configuration changes to NSO, and finally sync again configs from NSO to devices.
- verify_deploy_to_test
The final stage starts a container running a pyATS image and run automated tests based on the content of tests/validation_tasks.robot
and tests/test_testbed.yml
. The defined set of tests includes not only reachability, but also number of expected OSPF neighbors and interfaces in each network device after the applied changes.
The whole process will take like 5 minutes, until you can see the 3 stages completed successfully in the test environment.
At that point your proposed network configuration change has been completely validated in a VIRL-simulated test environment, and you are now good to propose it to be applied in the real production network.
You can do that by requesting to merge the content of the git test branch into the git production branch.
If you go back to the pipeline section you will see there is a new pipeline there, #4. It includes the same steps as the previous one, but the main difference is that this pipeline is being applied to the production network.
As you can see the production pipeline has the same 3 stages as the one we applied previously in the test environment. However when running it, the pipeline appears as blocked before running the deploy_to_prod stage. The reason is we configured when: manual
in the .gitlab-ci.yml
pipeline definition file, so that we had to manually confirm we want to actually initiate the deployment in the production network. This is a configuration decision, and maybe useful if you would like to perform the actual change during a maintenance window.
In order to move forward with the pipeline we need to confirm it manually by pressing the play button.
It will automatically start deploying our configuration changes into the production network. If everything goes well, it will successfully complete this stage and move to the next one, to test the results after changes are implemented.
After 5 minutes, by the end of this process you should see the complete pipeline has been successfully executed, and your proposed changes have been tested and finally applied to the production network.
Please click here to see a recorded demo of this CICD pipeline working on a 3-tier network environment.
CONGRATULATIONS! You have completed your first NetDevOps demo on how to fully automate and test network configuration changes all the way to production!
In this NetDevOps demo you have seen a modern approach into version-controlled automated network configuration and testing. The scenario describes how multiple network operators would be able to propose configuration changes, in the same way developers do it for code: using git branches. A standard version control server provides multiple benefits, like automated pipelines, version control and tracking, rollback cababilities, etc. During the demo you have also experienced the benefits of being able to locally verify syntax for proposed changes before submitting them. Also how a simulated environment helps verifying proposed changes are correct, before applying them into the production network. Finally, the set of automated tests helps making sure proposed changes have not had unexpected results on critical business-relevant functionality. This way you have experienced end-to-end automation and testing in a scalable and error-free approach.
Managing connections from extranet environments usually involves a great amount of workload, especially around VPN configurations at central hub points. One way of implementing this type of environments is pre-configuring VPN endpoints at remote locations, and then completing the required configuration from the central head-end point as connectivity is required. This configuration will explicitly define the authorized end-points and type of traffic that can traverse the connection.
Once connectivity to a certain remote location is not required anymore, you will have to remove the associated relevant configuration from the central head-end, disabling that specific VPN and hence discontinuing connectivity.
As you might guess, scaling this type of environment would really benefit from automation. The more remote locations from different 3rd-party entities (ie. partners, vendors), the longer the process to configure VPNs, ACLs with type of traffic and authorized end-points, etc. Implementing these long VPN configurations via CLI is of course a prone-to-error process due to the required human interaction, so automation will also take care of this challenge and provide the required consistency along the network.
This demonstration will focus on how to automate the lifecycle of extranet VPN connections, from setting them up to checking everything is correct, providing related metrics, and tearing them down once they are not required anymore. It also includes a simple graphical user interface (GUI) that uses APIs to demonstrate how easy it could be to manage those VPN connections for users without the required permissions to connect via CLI to network devices, or even the knowledge to configure them.
Our demo setup will include 1 central hub location with a headend router that will concentrate VPN connections from 4 remote partner locations.
We will also have some switches acting as hosts exchanging traffic, and another router simulating internet, providing connectivity between the headend and partner locations.
All devices will be simulated using VIRL as per the diagram below.
These are the components we will use to build the demo:
- Cisco Network Services Orchestrator: formerly Tail-f, it provides end-to-end automation to design and deliver services much faster
- VIRL: network modelling and simulation environment
- Ansible: simple automation
The provided GUI portal to manage HEMP uses the following technologies:
- Python, Flask, and JavaScript for the primary web interface
- Telegraf, InfluxDB, and Grafana for visualizing operational metrics collected via SNMP
For ease of deployment and portability, all of the above components are run as a docker compose stack which can be executed directly on your sandbox devbox.
Once you are connected via VPN to your reserved sandbox, please open a terminal window (ie. putty on Windows or terminal
on OSX) and ssh
to your devbox with the following credentials: developer
/C1sco12345
$ ssh [email protected]
Once in, and before starting the setup phase, please edit the /opt/nso/etc/ncs/ncs.conf
file, delete the following line, and save the file:
<dir>/opt/nso/packages/neds/</dir>
Now you are ready to start the setup, so clone the repository that includes all required files to build the demo environment into your devbox.
[developer@devbox ~]$git clone https://github.com/DevNetSandbox/sbx_multi_ios.git
With that, your sandbox devbox includes now all required info to start building the environment.
Go into the hemp
directory and run the setup.sh
script to set the complete environment up.
[developer@devbox ~]$cd sbx_multi_ios/hemp
[developer@devbox hemp]$./setup.sh
setup.sh
will perform the following steps in the sandbox devbox:
- Install required software tools and dependencies in a python virtual environment
- Launch VIRL simulations for the whole network, including 4 remote partner locations and 1 central hub headend
- Setup and start NSO
- Add all VIRL network devices into NSO
- Synchronize all existing configurations from network devices to NSO
- Display the status for VIRL network devices
- Start a HEMP management GUI, implemented with containers
- Use Ansible to pre-configure the headend and activate 2 out of the 4 remote locations VPNs
The process will take approximately 15 minutes, so check this out in the meanwhile.
Your demo architecture is now set up, and includes the following main components:
- 1 central headend router where partner extranet VPN connections from remote devices are terminated
- 4 remote partner routers (partner1, partner2, partner3, partner4) that represent the unmanaged side of extranet/partner VPN connections
Simulated devices connected to both, the remote partner routers and the headend one, are configured with IP SLA probes, to send interesting traffic through the VPNs and keep them active.
Every remote partner router (1 to 4) is completely configured to establish their respective VPNs. Having connectivity for each one of them will depend exclusively on having the proper configuration applied on the headend router side.
On the headend router we have already provided the required configuration to setup VPN connections towards partner1 and partner2 remote devices. The partners
directory includes YAML files with all required parameters to configure the headend router and complete the VPN connections just for partner1 and partner2 remote locations (not 3 and 4).
This configuration has been provided using Ansible and associated NSO modules during step 8 of the setup phase. That step is the one that runs an Ansible playbook, described in the site.yaml
file. If you go through its content, you will see that first it synchronizes the configuration from NSO to the remote devices for consistency (in case there might have been any changes configured directly on the devices they will be overwritten by this step). Then the playbook will load partner1 and partner2 YAML files into variables, and push those those to NSO as new headend router configuration to activate those specific VPNs. Finally the playbook with instruct NSO to sync that new configuration from NSO to the headend router.
However, partner3 and partner4 VPNs are pre-configured only on the partner/remote side, and will need you to provide additional configuration on the headend to complete those VPNs setup.
Instead of configuring it manually, or via YAML files and Ansible, for this demo you will be able to define the required configuration in the headend via a GUI management portal. It will allow you to provide the required parameters, and the GUI will translate them into the required information to send towards NSO north-bound APIs.
This API-based automation solution will enable you to easily apply or remove the required configuration in the headend router, without having to connect to the device via CLI and type myriads of commands.
At this point you might be wondering why NSO is part of the architecture, or if you could use Ansible to directly configure your network devices. One of the multiple benefits that NSO provides is that, although in this demo we are only using IOS XE devices, it would be easy to support a mixed environment with other types of devices / CLIs (ie. IOS XR, ASA firewall, other vendors...) without doing any modifications in the management GUI. Please remember the GUI uses NSO north-bound APIs, so it does not depend on the type of underlying infrastructure devices. NSO plays a key role by performing that translation from API requests to the information and format those devices require and support.
You may access the HEMP GUI portal by pointing your browser to http://10.10.20.50:5001
Please click on Configure VPN connections and there you will see the ones already configured on the headend router: partner1 and partner2.
You may now click on one of them, for example partner1, and the GUI will display its configuration and metrics. The system will also allow you to perform some actions on that specific VPN:
- Check Sync: this will compare the configuration in NSO vs the one in the headend router
- Reactivate Re-Deploy: ask NSO to sync configuration again from NSO to the headend router
- Undeploy: remove configuration from the headend router, while conveniently keeping it in NSO in case you need to redeploy it later
Now let's go back to Configure VPN connections and click on Add VPN to start the "VPN Setup Wizard". This will allow you to provide the required information to establish the VPN connection from the headend router to partner3.
You may find below the required configuration that will be applied in the headend router, so that partner3 VPN connection is established:
partner3:
- partner_name: partner3
device:
- headend
sequence: 103
peer_ip: 172.16.252.3
isakmp_algo: 3des
isakmp_group: 2
pre_shared_key: cisco
transform_encryption: esp-3des
transform_auth: esp-md5-hmac
acl_number: "101"
acl_rule: "permit ip 192.168.0.0 0.0.0.255 192.168.3.0 0.0.0.255"
This is the sequence of steps you will need to follow in the GUI:
Once you are done with partner3 please repeat the process for the partner4 VPN connection, using the parameters below:
partner4:
- partner_name: partner4
device:
- headend
sequence: 104
peer_ip: 172.16.252.4
isakmp_algo: 3des
isakmp_group: 2
pre_shared_key: cisco
transform_encryption: esp-3des
transform_auth: esp-md5-hmac
acl_number: "104"
acl_rule: "permit ip 192.168.0.0 0.0.0.255 192.168.4.0 0.0.0.255"
By the end of the process you should have something like this in the Configure VPN connections section:
You may now click on Monitor VPN Connections and the GUI will load a Grafana dashboard. Please login there with admin/admin
, and change the password. If it does not work correctly (error message Dashboard not found) you may still access the Grafana dashboard by pointing your browser directly to http://10.10.20.50:3000
Selecting the Tunnel Detail dashboard will show you information about each specific tunnel, just by choosing the peer IP address:
The Extranet Monitoring dashboard will show you all information about how the headend router is doing:
As the final step please restore the NCS configuration file we modified at the beginning of this demo, so that you can use the sandbox reservation later for other demos.
[developer@devbox hemp]$cp /opt/nso/etc/ncs/ncs.conf.bak /opt/nso/etc/ncs/ncs.conf
CONGRATULATIONS! You have now completed your second NetDevOps demo on how to leverage APIs to automate Extranet VPNs management!
This automation demo shows how you can leverage APIs to easily provision and monitor Extranet VPNs from a simple custom GUI. With this kind of approach network operators would not need to:
- Understand network architecture details
- Remotely connect to devices
- Be experts on each underlying device CLI
- Configure those devices via a myriad CLI commands
Note: it is important to remark that this automation demo is based on NSO and its capability to extend existing functionalities via service models. The primary service model for NSO can be found in the
./nso/packages/vpn
directory. Service models/packages are the primary way that NSO functionality is extended. A service model is comprised of a YANG file, a set of templates, and optionally some python or java logic.
pyATS is an Automation Test System written in Python. It provides the core infrastructure to define topologies, connect to network devices and run the required tests.
Genie builds on top of pyATS and it is fully integrated to provide model automation tests. It focuses on test cases for features (ie. BGP, HSRP, OSPF), and abstracts how this information is obtained from underlying devices.
Together, pyATS and Genie enable you to create network test cases that provide stateful validation of devices operational status. You can use them to validate how a new feature or product will affect your network, compare the status of your network before/after a change, customize your network monitoring based on your own business drivers.
The solution provides visibility on network devices health, by focusing not only on the configurational state, but also on the operational status.
It is agnostic and extensible, so any type of system could potentially be included by developing the right set of libraries.
It can be integrated into CICD pipelines (implemented via integration servers like GitLab or Jenkins), other frameworks (like Robot, for almost-natural language stateful tests definition), or even interact with ChatBots (ChatOps).
It also integrates beautifully with VIRL topologies, and we will show you how to do it so you can focus only on what you want to test in your network.
The network topology you will use for testing is called the testbed, and it includes your devices and links. It is defined in a YAML file, and as long as pyATS is implemented in Python, everything is an object... including the testbed.
Your network devices are also objects in pyATS, so you can perform operations on them using methods, like the following:
- connect()
- ping(destination)
- execute('show version')
- configure('no ip domain lookup')
The output from these commands will be parsed into structured data, so your systems can easily extract business-relevant data from them.
0k, let's see it working.
The first thing you need to decide is how you want to run pyATS: natively in your own system, or in a Docker container.
For the first option you should use a Python 3.X virtual environment, so you don't clog your system, and then install the required tools (see this doc).
However it is easier to run it in a Docker container, as the available image includes all required software, libraries, dependencies and a ton of examples you can use to get started. So we will use containers for our demos.
The sandbox you have reserved includes a big VIRL server we will use to run some simulated devices for our demos.
It also includes a devbox with all required utilities pre-configured. At this point you could decide to use the devbox included in your sandbox to execute the demos, or rather configure your own system so you can continue using it later. If you decide to use the sandbox devbox you can connect to it by running: ssh [email protected]
, and use password C1sco12345
.
In order to easily manage the VIRL server we will use a very handy utility called virlutils. You will only need to install virlutils if you decide to use your own local workstation for the demos (no need to do it if you will be using the sandbox devbox).
$ pip install virlutils
Once done, please create a VIRL init file (again, no need to do this step if you will be using the sandbox devbox)...
$ vi ~/.virlrc
... and define the required VIRL credentials:
VIRL_USERNAME=guest
VIRL_PASSWORD=guest
VIRL_HOST=10.10.20.160
Then start a new terminal window in your workstation, so that it reads the new VIRL init file configuration.
Now you should be able to search for some example pre-defined simulated topologies that could be useful for testing (you can find some more here).
$ virl search
You may even filter those examples: ie. look for the ones including IOS in their name.
$ virl search ios
Displaying 1 Results For ios
╒════════════════════════╤═════════╤══════════════════════╕
│ Name │ Stars │ Description │
╞════════════════════════╪═════════╪══════════════════════╡
│ virlfiles/2-ios-router │ 0 │ hello world virlfile │
╘════════════════════════╧═════════╧══════════════════════╛
That is a simple template for a 2 IOS-routers simulation (kind of like a hello-world for virlutils).
Make sure you are connected to your sandbox VPN and then download the VIRL topology specified below, so that you can start it in your server.
$ mkdir tests
$ cd tests
$ virl pull virlfiles/genie_learning_lab
Pulling from virlfiles/genie_learning_lab
Saved topology as topology.virl
$ virl up
Creating default environment from topology.virl
Localizing {{ gateway }} with: 172.16.30.254
Now you have your VIRL simulation running in the sandbox server!
$ virl ls
Running Simulations
╒══════════════════════════╤══════════╤════════════════════════════╤═══════════╕
│ Simulation │ Status │ Launched │ Expires │
╞══════════════════════════╪══════════╪════════════════════════════╪═══════════╡
│ netdevops_default_oAmstu │ ACTIVE │ 2019-04-03T10:54:44.416113 │ │
╘══════════════════════════╧══════════╧════════════════════════════╧═══════════╛
You can also see the status of its included nodes.
$ virl nodes
Here is a list of all the running nodes
╒════════════╤══════════╤═════════╤═════════════╤════════════╤══════════════════════╤════════════════════╕
│ Node │ Type │ State │ Reachable │ Protocol │ Management Address │ External Address │
╞════════════╪══════════╪═════════╪═════════════╪════════════╪══════════════════════╪════════════════════╡
│ csr1000v-1 │ CSR1000v │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.129 │ N/A │
├────────────┼──────────┼─────────┼─────────────┼────────────┼──────────────────────┼────────────────────┤
│ nx-osv-1 │ NX-OSv │ ACTIVE │ REACHABLE │ telnet │ 172.16.30.130 │ N/A │
╘════════════╧══════════╧═════════╧═════════════╧════════════╧══════════════════════╧════════════════════╛
Once a node shows up as ACTIVE and REACHABLE you can connect to it (use password cisco
) with:
$ virl ssh nx-osv-1
Please note that during the connection process you will need to confirm you want to add its IP address to the list of known hosts.
One of the fantastic features that virlutils includes is that it can generate inventories to be used by other systems, using the command: virl generate [ pyats | nso | ansible ]
For our demos we will use the pyats
one, so try it once that all nodes in your simulation are REACHABLE.
$ virl generate pyats -o default_testbed.yaml
Writing default_testbed.yaml
With just a single command you have now a YAML file that defines your VIRL environment as a testbed to be used by pyATS straight away!
That pyATS testbed definition file will need some variables to define the enable password and login user/password. The most convenient way to use them later is to have them stored in a file, so please go ahead and download it.
$ curl -L https://raw.githubusercontent.com/juliogomez/netdevops/master/pyats/env.list -o env.list
As the final preparation step before starting, please make sure to obtain the latest pyATS Docker image.
$ docker pull ciscotestautomation/pyats:latest
We are now READY to start our tests!
Don't do it now, but please note that by the end of our set of demos, when you are finally done with your simulation, you can easily tear it down with:
$ virl down
Removing ./.virl/default
Shutting Down Simulation netdevops_default_oAmstu.....
SUCCESS
The most basic demo will show you how to use pyATS to execute a single command on a certain network device. In this case you will see in your screen how this script executes a show version
on a CSR1000v.
Download the required script to your system:
$ curl -L https://raw.githubusercontent.com/juliogomez/netdevops/master/pyats/1-pyats-intro.py -o 1-pyats-intro.py
Please review its content and you will see it executes the following steps:
- Load the required pyATS library
- Load the pyATS testbed definition from file
- Select a specific device from the testbed
- Connect to that device via SSH and configure the connection to be automation-friendly (disable logging, change terminal width/length, no timeout)
- Execute a command in that device
Run the demo with an interactive container (-it
) that will be automatically deleted after execution (--rm
), and pass it a mapped volume from your workstation to the container (-v $PWD:/pyats/demos/
). When the container starts it will automatically execute the specified python script.
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
--env-file env.list \
ciscotestautomation/pyats:latest \
python3 /pyats/demos/1-pyats-intro.py
In this case you will use not only pyATS, but also Genie, to compile interface counters from multiple devices across the network and then check if there are any CRC errors in them.
The script will use the same function to compile CRC errors information from 2 devices with different CLI (ie. CSR1000v and Nexus switch), with the available Genie parsers providing independence from the underlying device type. Genie uses models to determine the specific commands and format that need to be used for each feature in each platform/OS, and how to map the outcome to the specific fields in the resulting structured data. Genie determines the platform/OS from the testbed file.
Download the required script to your system:
$ curl -L https://raw.githubusercontent.com/juliogomez/netdevops/master/pyats/2-genie-intro.py -o 2-genie-intro.py
Please review its content and you will see the following steps to execute:
-
Load the required pyATS and Genie libraries
-
Define a reusable function that obtains all interface counters from a single device
- If not connected to the device, connect to it via SSH
- Learn info about those device interfaces to parse and return it as structured data
-
Load the pyATS and Genie testbeds definition from file
-
Select a specific device from the testbed
-
Call the function defined previously to obtain all interface counters from that device
-
Select another device, with a different CLI
-
Call the function defined previously to obtain all interface counters from that device
-
Merge all interface details from these 2 different devices (with different CLIs), into a single source (python dictionary)
-
Loop through the compiled data in that single source and show CRC errors for every interface
Run the demo with an interactive container (-it
) that will be automatically deleted after execution (--rm
), and pass it a mapped volume from your workstation to the container (-v $PWD:/pyats/demos/
). When the container starts it will automatically execute the specified python script.
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
--env-file env.list \
ciscotestautomation/pyats:latest \
python3 /pyats/demos/2-genie-intro.py
Now that you have seen a couple of simple examples of what can be done with pyATS and Genie, you might want to start developing your own tests. But instead of iterating through the process of "writing a complete script, trying to run it, failing and rewriting", we would rather have a more interactive way of developing tests. Something that allows us to check the results of each step during the test, and debug it by exploring the results at any point of the flow.
As you may have noticed pyATS feels really pythonic, so wouldn't it be great to have something similar to the interactive Python shell? Something that would give us the option to execute individual steps interactively while developing our tests? Well, we got you covered!
Genie has a function called shell, which can be invoked from the Bash command line. When invoking shell, Genie will load the correct testbed file and initiate the required libraries in for the python interactive shell.
For our demos we will start a pyATS container and ask it to start an interactive shell (bash) so we can install ipyATS in it.
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
--env-file env.list \
ciscotestautomation/pyats:latest bash
You can easily install ipyATS in it with:
root@2ad68679070c:/pyats# pip install ipyats
And run it with your VIRL testbed:
root@2ad68679070c:/pyats# ipyats --testbed demos/default_testbed.yaml
Invoke Genie Shell with the following command:
root@bfaa28c3faf3:/pyats# genie shell --testbed-file demos/default_testbed.yaml
The great thing about being able to define the specific testbed to use for this test is that you can reuse everything you create in different environments (eg. production, testing, datacenter 1, datacenter 2).
You can see the devices included in your own testbed:
>>> testbed.devices
TopologyDict({'csr1000v-1': <Device csr1000v-1 at 0x7f8fa13e9438>, 'nx-osv-1': <Device nx-osv-1 at 0x7f8fa141f0f0>})
Create aliases for your devices:
>>> nx = testbed.devices['nx-osv-1']
>>> csr = testbed.devices['csr1000v-1']
You can now connect to your device (please make sure you have telnet
installed in your system):
>>> csr.connect()
Ask if there are any links going csr to nx:
>>> csr.find_links(nx)
{<Link object 'csr1000v-1-to-nx-osv-1' at 0x7f8fa0a3db38>,
<Link object 'csr1000v-1-to-nx-osv-1#1' at 0x7f8fa0a3d9b0>,
<Link object 'flat' at 0x7f8fa0a3dba8>}
Or execute a command in it:
>>> csr.execute('show version')
Probably by now you are thinking...
... and you are right!
Let's start by exploring what can be done with genie.ops libraries. Genie Ops libraries are at the heart of parsing features on devices and returning structure data. The models are based on OpenConfig and IETF YANG models. For the full list of models please go to Genie Models
How about easily obtaining from a device the complete table of routes in a structured format?
>>> routes = csr.learn('routing')
This request will execute a number of commands in the device, compile all the received routing info and parse it into a structured format. Check the resulting dictionary:
>>> routes.info
It is structured data that you can now easily query and process in your scripting!
For example, let's say you have a tool that needs to verify that a specific route (eg. 172.16.30.0/24) exists in the management VRF of your Nexus switch.
>>> routes.info['vrf']['management']['address_family']['ipv4']['routes']['172.16.30.0/24']
{'route': '172.16.30.0/24',
'active': True,
'source_protocol': 'direct',
'metric': 0,
'route_preference': 0,
'next_hop': {'next_hop_list': {1: {'index': 1,
'next_hop': '172.16.1.104',
'updated': '2d00h',
'outgoing_interface': 'mgmt0'}}}}
Wow, that was easy! Think about the kind of processing and parsing you would have had to do in the past to go through the text output of all those commands. Now pyATS is compiling the information from all those commands and giving you a consolidated, structured view that you can easily work with.
Now let's try a different task, and learn about all-things BGP in the csr device:
>>> bgp = csr.learn('bgp')
Again, this task will run multiple BGP-related commands, iterating through all detected BGP neighbors, and provide you with a consolidated view that includes all relevant information in a structured format, so you can easily extract and process the specific data you require.
>>> bgp.info
Now let's explore what can be done with genie.conf libraries.
For example, in order to work with BGP configurations we need to import the required library:
>>> from genie.libs.conf.bgp import Bgp
And then we could use it to learn the BGP configuration in our Nexus switch:
>>> bgps_nx = Bgp.learn_config(nx)
As long as for other routing protocols (not BGP) there might be several instances we receive a list, and we need to refer to its first entry, numbered 0:
>>> bgp_nx = bgps_nx[0]
We can also apply configurations, like this or a different one, to our device:
>>> bgp_nx.build_config()
Or remove all BGP configuration:
>>> bgp_nx.build_unconfig()
You can check it's all gone with the same command we used in the genie.ops section:
>>> bgp = nx.learn('bgp')
>>> bgp.info
{}
And easily apply all BGP configuration back again:
>>> bgp_nx.build_config()
When you are done exploring ipyATS, you can exit with:
>>> exit()
root@bfaa28c3faf3:/pyats# exit
ipyATS makes it really easy for you to develop and debug your tests step-by-step, in the classic pythonic way!
We will now explore another example that will help you check all BGP neighbors in your network are in the desired established state.
The test case structure includes the following sections:
- Common setup: connect to all devices included in your testbed.
- Test cases: learn about all BGP sessions in each device, check their status and build a table to represent that info. If there are neighbors not in a established state the test will fail and signal this condition in an error message.
In order to run it first you will need to install git
on your pyATS container, clone a repo with additional examples, install a tool to create nice text tables (tabulate), go into the directory and execute the job:
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
ciscotestautomation/pyats:latest-alpine ash
(pyats) /pyats # apk add --no-cache git
(pyats) /pyats # git clone https://github.com/kecorbin/pyats-network-checks.git
(pyats) /pyats # pip3 install tabulate
(pyats) /pyats # cd pyats-network-checks/bgp_adjacencies
(pyats) /pyats/pyats-network-checks/bgp_adjacencies # pyats run job BGP_check_job.py --testbed-file /pyats/demos/default_testbed.yaml
As a result you will find the following table in your logs, displaying all BGP neighbors in all your devices, and their current status:
2019-04-05T18:10:41: %SCRIPT-INFO: | Device | Peer | State | Pass/Fail |
2019-04-05T18:10:41: %SCRIPT-INFO: |------------+----------+-------------+-------------|
2019-04-05T18:10:41: %SCRIPT-INFO: | csr1000v-1 | 10.2.2.2 | established | Passed |
2019-04-05T18:10:41: %SCRIPT-INFO: | nx-osv-1 | 10.1.1.1 | established | Passed |
It was never this easy to make sure BGP neighbors across your network are properly established!
Now let's say you are responsible for a network and could use some help on how to be updated about possible issues happening in it. Wouldn't it be great to have a tool that helps you profile the network end-to-end and store that info as snapshots?
Let's focus, for example, on profiling everything related to BGP, OSPF, interfaces and the platforms in your network, and saving it to snapshot files. Ideally you would take a first snapshot of your network when everything is working superb.
Genie can help you do it with a simple command, specifying what features you want to learn (ospf interface bgp platform
), from what specific testbed (--testbed-file default_testbed.yaml
), and the directory where you want to store the resulting files (--output good
):
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
--env-file env.list \
ciscotestautomation/pyats:latest-alpine ash
(pyats) /pyats# cd demos
(pyats) /pyats/demos # genie learn ospf interface bgp platform --testbed-file default_testbed.yaml --output good
Inside the created good
directory, console files will show you what commands were run to obtain all required info, while ops files will store the resulting information in structured format.
Now let's simulate something terrible happened in your network... by shutting down one of the loopback interfaces in your CSR1000v router. Well, it's not that terrible, but you get the idea as an example of what could have happened.
First you need to identify the IP address of that CSR1000v, so you can connect to it:
(pyats) /pyats/demos # cat default_testbed.yaml | grep -A 1 GigabitEthernet1:
GigabitEthernet1:
ipv4: 172.16.30.129/24
Now you can SSH to it, with password cisco
:
(pyats) /pyats/demos # ssh [email protected]
Once inside the system please shutdown interface loopback 1, to simulate that terrible catastrophe in your network:
csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.
In the real world, soon you would be receiving calls from users: "Something is wrong... terribly wrong", "I lost ALL connectivity", "My database stopped working!". So instead of starting to troubleshoot by brute force, how about asking Genie to determine what is the current new status of the network after the outage. And even better, what changed exactly since the last time you took the snapshot of the network in good state?
Let's do this by running the same command as previously, but this time asking the system to store the resulting files in a different directory (--output bad
).
(pyats) /pyats/demos # genie learn ospf interface bgp platform --testbed-file default_testbed.yaml --output bad
And now find out what changed between the good situation and the bad one with yet another simple command.
(pyats) /pyats/demos # genie diff good bad
1it [00:00, 5.96it/s]
+==============================================================================+
| Genie Diff Summary between directories good/ and bad/ |
+==============================================================================+
| File: ospf_iosxe_csr1000v-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: platform_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: interface_iosxe_csr1000v-1_ops.txt |
| - Diff can be found at ./diff_interface_iosxe_csr1000v-1_ops.txt |
|------------------------------------------------------------------------------|
| File: bgp_nxos_nx-osv-1_ops.txt |
| - Diff can be found at ./diff_bgp_nxos_nx-osv-1_ops.txt |
|------------------------------------------------------------------------------|
| File: ospf_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: bgp_iosxe_csr1000v-1_ops.txt |
| - Diff can be found at ./diff_bgp_iosxe_csr1000v-1_ops.txt |
|------------------------------------------------------------------------------|
| File: platform_iosxe_csr1000v-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
| File: interface_nxos_nx-osv-1_ops.txt |
| - Identical |
|------------------------------------------------------------------------------|
As you can see the system generates some files that signal exactly what has changed from the good situation to the bad one. In this specific case, one of the files immediately shows that interface Loopback 1 in the CSR1000v has been disabled!
(pyats) /pyats/demos # cat ./diff_interface_iosxe_csr1000v-1_ops.txt
--- learnt/interface_iosxe_csr1000v-1_ops.txt
+++ bad/interface_iosxe_csr1000v-1_ops.txt
info:
Loopback1:
...
+ enabled: False
- enabled: True
+ oper_status: down
- oper_status: up
Talk about an easy way to determine why your network is not working properly as before, and to find out what happened exactly!
But we could do better... there's always room for improvement, right? Probably you have noticed that the output from genie
commands is better and more meaningful than the one for the original pyats
commands. But still it was a lot for just a couple of devices. Please consider if we wanted to run that same test in the complete network with maybe hundreds or thousands of systems... that would be a lot of logging info! However as an operator probably I don't need that much output, and I could use a more intuitive summary that gives me the key info on what I am doing.
Besides this, network operators are probably interested in defining their tests in a way that is as close to natural language as possible. Robot framework is an open-source automation framework for testing that can help you with these challenges. Let's take a look at an example on what can be done with it.
We will run the same scenario as before, and see what are some of the benefits we get with Robot. So again, we will take a first snapshot of our network when it is working fine.
Before we start, please go to your CSR and get interface Loopback 1 back up again, so that the network is tidy and clean, as it was in the beginning.
(pyats) /pyats/demos # ssh [email protected]
csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#no shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.
Everything is now back to the normal initial situation.
Now, instead of running the Genie profiling command directly from the CLI, with Robot we will use the initial_snapshot.robot
test definition file you will find in the demos
directory. This file specifies the libraries to import, where the testbed file resides, and the test cases definition. Please review this file and you will see the different steps in these test cases are defined with very simple language.
First it will connect to the testbed devices:
Connect
# Initializes the pyATS/Genie Testbed
use genie testbed "${testbed}"
# Connect to both devices
connect to device "nx-osv-1"
connect to device "csr1000v-1"
And then the system will profile them, specifiying where to store the resulting network profile snapshot files:
Profile the devices
Profile the system for "bgp;config;interface;platform;ospf;arp;routing;vrf;vlan" on devices "nx-osv-1;csr1000v-1" as "./good/good_snapshot"
Very simple and natural language that helps understanding intuitively what the test case is supposed to do.
Once more we will create a container, change to the required directory and run robot
with a single command that simply specifies the directory where we want to store the resulting log, output and report (-d good
):
$ docker run -it --rm \
-v $PWD:/pyats/demos/ \
--env-file env.list \
ciscotestautomation/pyats:latest-alpine ash
[Entrypoint] Starting pyATS Docker Image ...
[Entrypoint] Workspace Directory: /pyats
[Entrypoint] Activating workspace
(pyats) /pyats # cd demos
(pyats) /pyats/demos # robot -d good initial_snapshot.robot
==============================================================================
Initial Snapshot
==============================================================================
[ WARN ] Could not load the Datafile correctly
Connect | PASS |
------------------------------------------------------------------------------
Profile the devices | PASS |
------------------------------------------------------------------------------
Initial Snapshot | PASS |
2 critical tests, 2 passed, 0 failed
2 tests total, 2 passed, 0 failed
==============================================================================
Output: /pyats/demos/good/output.xml
Log: /pyats/demos/good/log.html
Report: /pyats/demos/good/report.html
(pyats) /pyats/demos #
As you can see now the output an operator would get when executing the test case, is much more summarized. It clearly specifies, in one line per step, if the test passed or not and where you can find the resulting report, output and log files. These are extremely useful to easily visualize from a browser how did the tests go, drill down into each specific test and examine the logs about what happened exactly. In this case we have decided to store these files in the same directory where we keep the profiling snapshots.
The good
directory now stores everything about your network profile when things work fine. Let's mess it up again, by connecting to the system and shutting down interface Loopback 1.
(pyats) /pyats/demos # ssh [email protected]
csr1000v-1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
csr1000v-1(config)#int lo 1
csr1000v-1(config-if)#shut
csr1000v-1(config-if)#exit
csr1000v-1(config)#exit
csr1000v-1#exit
Connection to 172.16.30.129 closed by remote host.
Connection to 172.16.30.129 closed.
After this terrible happening it is time to profile the network again, but this time we will use the compare_snapshot.robot
file to run another test case, a little bit different from the initial one. In this case it will include one extra step: once it is connected to the devices and has profiled them as before, it will automatically compare the new snapshots with the old good ones.
Compare snapshots
Compare profile "./good/good_snapshot" with "./fail/failed_snapshot" on devices "nx-osv-1;csr1000v-1"
Again, very simple and natural language that helps understanding intuitively what the test case is supposed to do.
(pyats) /pyats/demos # robot -d fail compare_snapshot.robot
As you will see from the output the first 2 steps work fine: it connects to the devices and profiles them just fine. However, when it goes into step number 3 it fails, indicating that something has changed from the previous good situation. Going further down the log it clearly states the CSR interface has actually been shutdown and it is not operational anymore, compared to the initial good state. Wow, that was easy to debug!
Comparison between ./good/good_snapshot and ./fail/failed_snapshot is different for feature 'config' for device:
'csr1000v-1'
interface Loopback1
+ shutdown
**********
Comparison between ./good/good_snapshot and ./fail/failed_snapshot is different for feature 'interface' for device:
'csr1000v-1'
info:
Loopback1:
+ enabled: False
- enabled: True
+ oper_status: down
- oper_status: up
In summary, using Robot we have been able to define the desired test case using very intuitive and natural language for the desired profiling. The resulting outcome is also very clear when debugging possible network issues and even offer HTML reporting that you can easily consume and share. Really awesome tool!
If you want to learn more about how Genie network profiling can help you manage and debug issues in your network, please check this fantastic lab and also this one. Both offer you the option to run them on mocked devices, so you don't actually need a reserved sandbox environment... how cool is that?
Now that you know how to run some basic tests with pyATS and Genie, it is time to explore how we could give it a proper structure to build more complex tests. That's what Test Cases are all about: a framework that allows you to build repeatable and more sophisticated testing processes.
Let's take a look at this example:
Task-1: basic_example_script
|-- commonSetup
| |-- sample_subsection_1
| `-- sample_subsection_2
|-- tc_one
| |-- prepare_testcase
| |-- simple_test_1
| |-- simple_test_2
| `-- clean_testcase
`-- commonCleanup
`-- clean_everything
The sections are easy to understand:
- You can define a number of tasks to run in your test case (in the example above we have just 1 task)
- Then you will have some common setup to do, structured in subsections
- After that, you would go into the real Test Case (tc), with 3 phases: preparation, execution and cleaning
- Finally, as a good citizen, you would need to clean after yourself, everything you set up during the common setup phase
Let's see it working in your own setup. In this case we will use the -alpine image because it has vi already included in it, and you will need it to edit some files during this demo. We will ask our pyATS container to provide a shell (ash for -alpine image) so we can work with it interactively.
$ docker run -it --rm ciscotestautomation/pyats:latest-alpine ash
Once inside the container shell you have access to its directory structure and tools. Inside the pyats
directory you will find multiple examples and templates to use with pyATS. To get started let's focus on a basic one.
(pyats) /pyats # cd examples/basic
There you will find the basic_example_script.py
python script file that defines a very simple Test Case. It includes quite some python code for all the sections mentioned before, but actually not doing much (in fact only logging), so it is a good starting point as a template to develop your own test cases.
(pyats) /pyats/examples/basic# cat basic_example_script.py
This script will be executed from a job, defined in this file:
(pyats) /pyats/examples/basic# cat job/basic_example_job.py
You would run the job with:
(pyats) /pyats/examples/basic# pyats run job job/basic_example_job.py
You can see in the report shown at the end of the execution process that all tests in our task PASSED.
Let's insert a simple verification test in our test case. Please edit the python script with vi basic_example_script.py
, scroll down to the TESTCASES SECTION and look for the First test section. There you need to insert the required code as per the following:
# First test section
@ aetest.test
def simple_test_1(self):
""" Sample test section. Only print """
log.info("First test section ")
self.a = 1
self.b = 2
if self.a != self.b:
self.failed("{} is not {}".format(self.a, self.b))
As you can see we are defining 2 simple variables with fixed values of 1 and 2, and then inserting a conditional statement that fails if they are different. So, obviously the test will now fail because 1 and 2 are different.
Save the file and try it.
(pyats) /pyats/examples/basic# pyats run job job/basic_example_job.py
Check the execution logs and you will find how a failed test looks like when executing a test case:
...
2019-04-04T08:32:09: %AETEST-INFO: Starting section simple_test_1
2019-04-04T08:32:09: %SCRIPT-INFO: First test section
2019-04-04T08:32:09: %AETEST-ERROR: Failed reason: 1 is not 2
2019-04-04T08:32:09: %AETEST-INFO: The result of section simple_test_1 is => FAILED
...
2019-04-04T08:32:09: %EASYPY-INFO: +------------------------------------------------------------------------------+
2019-04-04T08:32:09: %EASYPY-INFO: | Task Result Summary |
2019-04-04T08:32:09: %EASYPY-INFO: +------------------------------------------------------------------------------+
2019-04-04T08:32:09: %EASYPY-INFO: Task-1: basic_example_script.commonSetup PASSED
2019-04-04T08:32:09: %EASYPY-INFO: Task-1: basic_example_script.tc_one FAILED
2019-04-04T08:32:09: %EASYPY-INFO: Task-1: basic_example_script.commonCleanup PASSED
2019-04-04T08:32:09: %EASYPY-INFO: +------------------------------------------------------------------------------+
2019-04-04T08:32:09: %EASYPY-INFO: | Task Result Details |
2019-04-04T08:32:09: %EASYPY-INFO: +------------------------------------------------------------------------------+
2019-04-04T08:32:09: %EASYPY-INFO: Task-1: basic_example_script
2019-04-04T08:32:09: %EASYPY-INFO: |-- commonSetup PASSED
2019-04-04T08:32:09: %EASYPY-INFO: | |-- sample_subsection_1 PASSED
2019-04-04T08:32:09: %EASYPY-INFO: | `-- sample_subsection_2 PASSED
2019-04-04T08:32:09: %EASYPY-INFO: |-- tc_one FAILED
2019-04-04T08:32:09: %EASYPY-INFO: | |-- prepare_testcase PASSED
2019-04-04T08:32:09: %EASYPY-INFO: | |-- simple_test_1 FAILED
2019-04-04T08:32:09: %EASYPY-INFO: | |-- simple_test_2 PASSED
2019-04-04T08:32:09: %EASYPY-INFO: | `-- clean_testcase PASSED
2019-04-04T08:32:09: %EASYPY-INFO: `-- commonCleanup PASSED
2019-04-04T08:32:09: %EASYPY-INFO: `-- clean_everything PASSED
As you can see you don't need to be a Python expert to use the test cases framework. You have templates readily available for you, where you can insert the specific tests you would like to run and execute them straight away.
Once you are done you can exit the container.
(pyats) /pyats/examples/basic# exit
- Julio Gomez - Initial work - Blog
This project is licensed under the MIT License - see the LICENSE.md file for details
Many thanks to the following programmability and NetDevOps gurus for their contributions and source materials that helped building this document:
- Kevin Corbin
- Hank Preston
- Chris Lunsford
- Jason Gooley
- Gabi Zapodeanu
- Jean-Benoit Aubin