IT teams put conversations to work with ChatOps

Putting tools, alerts and processes into the chat interface gives both developers and the ops team a new model for working with infrastructure.

collaboration public domain
Credit: Pixabay

Chat is an old tool that’s newly popular. From Slack and HipChat to Salesforce Chatter and Microsoft’s new Teams tool (and a myriad of others), these collaboration tools supplement rather than replace enterprise social networks like Yammer or Jive. Microsoft’s Office division director Richard Ellis likens it to the difference between Facebook and WhatsApp: “The chat-based workspace fills a gap where people can talk rapidly, share content and work as a team.”

Chatops takes it one step further, adding bots that are configured with custom scripts and plugins so that you can go from talking about work in chat, to actually doing it.

“Chat is the closest analog to the way people most naturally interact,” says Steve Goldsmith, the general manager of HipChat at Atlassian. “Ops is about getting done whatever my team is trying to get done, and the trend is moving from using chat to keep each other updated on a deliverable, to completing the objective in a timely way. ChatOps takes what people are already comfortable doing as humans and layers in process and technology, to allow teams to own a process or an issue end-to-end without constantly switching tools or leaving that process.”

Not all of the pieces are necessarily available currently, but Goldsmith predicts that the future of ChatOps is “teams taking action together — opening chat, going to the right room for that project where the right people are assembled for taking action together, and when the job is done we'll all know at the same time.”

The terms that come up repeatedly when you talk about ChatOps are less about technology and more about culture. “ChatOps is a new approach to manage teams and infrastructure using conversational chat interfaces to handle issues and improve collaboration,” RedMonk analyst James Governor tells CIO.

“It began with people working together to solve problems using IRC rather than traditional service management desks. IRC has over time been supplanted by Slack and sometimes HipChat. But the core idea of a conversational, chat-based metaphor is an effective one, and comes out of the movement that great operational tools are there to augment, rather than replace people. Chat platforms are a natural place to build agents to automate everyday tasks. Bots provide a way to extend and build on the core platform; a request response agent you can trigger conversationally.”

Some organizations have adopted ChatOps broadly. That includes not just companies like Slack and Atlassian that build chat platforms but also GitHub (which is often credited with coining the term ChatOps). The Hubot chatbot started as a simple collection of scripts and has become the primary way that GitHub controls its entire infrastructure.

“Any time I have a script to run or a system to interact with, most likely the best way is to go through a Hubot script,” says GitHub engineer Alain Hélaïli. “That way I don't need to log into a system — I don’t even need to know where it [the system] is. I'm just in my Slack environment and it works for me.”

It’s not just software developers at GitHub who use Hubot; the sales team uses it to get information about customers rather than going directly to Salesforce. The company has a lot of remote workers so they developed a strong culture of using tools for team collaboration and working out loud, along with the continuous integration and delivery tools they rely on to be able to deploy to the service around 80 times a day. The GitHub platform itself is as much about exchanging ideas and making discussions and decisions visible as it is about storing source code.

Recipe for ChatOps success

Before you start trying to introduce ChatOps, you need to have a culture for using chat effectively in teams. Atlassian uses HipChat extensively, from ‘social rooms’ that exist to create a sense of community, to short-lived, tactical discussion rooms created whenever a team needs to handle a problem with one of their cloud services.

“The precondition for ChatOps is chat,” Goldsmith says — not just talking about work but “time-based action when we have a deadline or we need to compete this task quickly.” He suggests that means you need to have standardized on a single chat platform. “In general, one of the beauties of chat is that it works for your whole team, from the front desk to the CEO. Everyone in the organization gets benefits and your organization benefits when everyone is on that platform.”

Pete Cheslock, head of operations at Threat Stack agrees. “One of the most fundamental things is to make sure the company has standardized on a single chat system. I worked at a company that had four chat systems in use, so there was no one way except email to get in touch with everyone.”

Cheslock also offers a similar definition. “ChatOps is a way to use a tool that’s already in your company for normal communications and build on top of that additional tooling that can help manage your systems and manage repeatable tasks.”

He uses a bot in the Threat Stack ops channel to get alerts and handle the problem directly in Slack, using Threat Stack’s integration with VictoOps real-time incident management. “If there’s an Amazon outage the events go into the chat systems for incident response while we start investigating. We use a chatbot to integrate with Atlassian’s StatusPage, and we might send a command to PagerDuty to adjust who’s on call and who gets alerted.”

The advantage isn’t just the convenience. It’s also that what he’s doing is visible. “I might get an alert that says one of my providers has gone down and I want to update our status page to let customers know we have an issue. I could click through the boxes on the vendor website — or I can use this automated tool to send a few commands that other people in this chat can see. Now I’m not just solving the problem; I’m training everyone else in the room. With ChatOps, you’re showing exactly how to troubleshoot issues or debug problems to new people in your organization.”

In most organizations today, the same commands are likely hidden in an admin’s terminal history. Moving away from the “lone hero” admin being the only person who knows how to fix problems is important for DevOps, and ChatOps will help dispel the mystique. And seeing what other people are doing gives everyone better situational awareness.

When it’s time for a less experienced employee to handle the task, doing it in public can also be helpful (rather like pair programming). “All the systems response and management is out in the open versus one lone admin in the background so you can have it be much more collaborative.”

That might be a developer noting that they’ve already pushed an update that might solve the problem to save the admin from making a configuration change, or reviewing commands before they’re run.

The chat can be as important as the ops, as a way of getting multiple people involved in decision-making and breaking down siloes.

ChatOps may also be a way of giving nontechnical teams like sales, finance and marketing some insight into understanding what’s going on in the IT organization, though the day-to-day technical details of ChatOps are likely to be so much noise.

Not all ChatOps needs to be in public, group chat notes Amir Shevat, director of Developer Relations at Slack. Ephemeral messages are shown only to the user who asks for them, or you can DM a bot for detailed information; in both cases you can chose to share that into the group channel if it’s useful, or have the bot summarize a longer process. “When you're designing ChatOps, the key is to understand what is public, what should be transparent, what is noisy or not too noisy and build processes around that.”

Key ChatOps tools

Beyond the chat platform you adopt — be it Slack, HipChat, Campfire, Teams or any other — what you need are integrations to the systems you want to operate through ChatOps.

Those integrations might be scripts you use frequently, and Cheslock notes that ChatOps also gives you more visibility into common tasks. “You can see that ‘four times this week, we had to run the following commands to reboot this server’ or ‘we had to build this load balancer ten times this week, and we should automate that’.”

Travis CI, the continuous integration tool used at used at GitHub and Facebook, uses Threat Stack to monitor commands that are being manually run. “The alerts bubble up in chat,” Cheslock explains. “I saw that you edited this file and I can ask you to put it into source control or write a script. They took a security tool and leveraged it for visibility of what’s happening in their environment and that way they’re discovering things they could automate.”

An advanced version of that could use a machine learning service like IBM Watson to move towards self-healing systems, Shevat suggests. “You could have an AI sit in the conversion and when an incident happens, it says ‘we’ve seen this before and this script solved the problem with 97 percent accuracy, we recommend running this script’.”

Often, ChatOps integrations will be plugins to third-party tools and services, whether that’s build tools like Jenkins, monitoring tools like Nagios, Splunk or New Relic, or work allocation tools like PagerDuty. “ChatOps and cloud-based end-to-end monitoring go hand-in-hand,” says Neil MacGowan, director of digital intelligence at New Relic. “Providing real-time visibility into the impact of actions taken as part of incident resolution enables teams to work on incident resolution in an agile manner, reducing mean-time-to-resolution and improving customer experience.”

“PagerDuty is seeing a lot of adoption as a core tool in ChatOps toolchains,” alongside GitHub integration, RedMonk’s James Governor notes. “xMatters is a traditional issue management platform, which now offers ChatOps support. Cog by Operable is notable because of its layered, social, security model — thus for example if you want to commit a change, that might require that two named members of the team sign off on it first. Because of this identity model, you have the opportunity to build in compliance from the ground up, which is obviously very important to CIOs.”

Security and getting started with ChatOps

Cog is aimed at enterprises and regulated environments where access control and auditing that traces actions back to specific users are required. “That’s something open source chat systems are lacking,” Cheslock says. Regulated industries aren’t the only ones who need to consider security and authentication models. (This is another way that the Office 365 environment of Teams will appeal to businesses, once it adds support for chatbots in group chat as well as one-to-one messages.)

Moving to ChatOps means that a system that you might have had isolated by a VPN and accessible only to a single employee will now be something an entire team can use without going through all those layers of protection. “One of the biggest concerns of any company should be how to secure the commands they’re allowing to be executed in chat. I've seen companies with chatbots provisioning systems that are changing routes on network systems. The way to think about it is that I’m moving my tooling from my secure environment with all my security protocols to — in a lot of cases — third-party or hosted chat systems.”

Start by considering the different threat surfaces you might be exposing. Even details from a customer support ticket can be relatively privileged information (Shevat notes that it’s against Slack’s terms and conditions to post credit card information into a Slack channel).

“Ensure you have two-factor authentication, so a lost password can't let an outsider into your systems,” Cheslock recommends. “You might want different ways to do authentication for sensitive commands; maybe anyone can see a command but you use access control groups so only some people can run them. Or for some commands, you send a push message to the user’s mobile device that they have to approve before the command can run.”

Netflix uses that model; if someone runs a command through ChatOps that needs elevated privileges, the security team can monitor that and send a message to the user’s phone before the action is confirmed. GitHub also uses two-factor authentication to confirm that the person typing a sensitive command intended to run it, and it gives its Hubot chatbot different privileges in different Slack channels, so salespeople can’t deploy code from their channel, for example.

Slack takes it a step further internally. “If you tell a bot to shut down all the servers, the bot might say ‘you don’t have permission to do that; would you like me to ask your manager for permission’,” Shevat explains.

Cheslock suggests starting by using ChatOps to integrate third-party cloud services; “it likely won't cause hard security questions because you’re not giving a chatbot access to your private secure environment; you’re using it as a way to orchestrate these public services.”

Start with something that will be valuable to a team, suggest Shevat. “Maybe you can’t yet manage source code on ChatOps but you can manage errors on servers; take part of that lifecycle and turn it into ChatOps. Or start with source control, then move into management tools like Trello, and then PagerDuty. Start with something valuable and doable — and if it hasn’t become the standard process after a week or two, revert it. But if you see value after a week or two, choose another and try that.”

“ChatOps won’t be for everyone,” warns Governor, “but teams using modern software toolchains, with agile CI and CD [continuous integration and continuous delivery], are likely to get the most out it. It doesn’t make sense to drive ChatOps as a top-down mandate, but rather for organizations trusting their ops and development teams already.”

This story, "IT teams put conversations to work with ChatOps" was originally published by CIO.

Computerworld's IT Salary Survey 2017 results
Shop Tech Products at Amazon