AIOps Archives - SD Times

Working toward AIOps maturity? It’s never too early (or late) for platform engineering

Kevin Cochrane — Tue, 09 Jul 2024 15:16:57 +0000

Until about two years ago, many enterprises were experimenting with isolated proofs of concept or managing limited AI projects, with results that often had little impact on the company’s overall financial or operational performance. Few companies were making big bets on AI, and even fewer executive leaders lost their jobs when AI initiatives didn’t pan out.

Then came the GPUs and LLMs.

All of a sudden, enterprises in all industries found themselves in an all-out effort to position AI – both traditional and generative – at the core of as many business processes as possible, with as many employee- and customer-facing AI applications in as many geographies as they can manage concurrently. They’re all trying to get to market ahead of their competitors. Still, most are finding that the informal operational approaches they had been taking to their modest AI initiatives are ill-equipped to support distributed AI at scale.

They need a different approach.

Platform Engineering Must Move Beyond the Application Development Realm

Meanwhile, in DevOps, platform engineering is reaching critical mass. Gartner predicts that 80% of large software engineering organizations will establish platform engineering teams by 2026 – up from 45% in 2022. As organizations scale, platform engineering becomes essential to creating a more efficient, consistent, and scalable process for software development and deployment. It also helps improve overall productivity and creates a better employee experience.

The rise of platform engineering for application development, coinciding with the rise of AI at scale, presents a massive opportunity. A helpful paradigm has already been established: Developers appreciate platform engineering for the simplicity these solutions bring to their jobs, abstracting away the peripheral complexities of provisioning infrastructure, tools, and frameworks they need to assemble their ideal dev environments; operations teams love the automation and efficiencies platform engineering introduces on the ops side of the DevOps equation; and the executive suite is sold on the return the broader organization is seeing on its platform engineering investment.

Potential for similar outcomes exists within the organization’s AI operations (AIOps). Enterprises with mature AIOps can have hundreds of AI models in development and production at any time. In fact, according to a new study of 1,000 IT leaders and practitioners conducted by S&P Global and commissioned by Vultr, each enterprise employing these survey respondents has, on average, 158 AI models in development or production concurrently, and the vast majority of these organizations expect that number to grow very soon.

When bringing AIOps to a global scale, enterprises need an operating model that can provide the agility and resiliency to support such an order of magnitude. Without a tailored approach to AIOps, the risk posed is a perfect storm of inefficiency, delays, and ultimately, the potential loss of revenue, first-market advantages, and even crucial talent due to the impact on the machine learning (ML) engineer experience.

Fortunately, platform engineering can do for AIOps what it already does for traditional DevOps.

The time is now for platform engineering purpose-built for AIOps

Even though platform engineering for DevOps is an established paradigm, a platform engineering solution for AIOps must be purpose-built; enterprises can’t take a platform engineering solution designed for DevOps workflows and retrofit it for AI operations. The requirements of AIOps at scale are vastly different, so the platform engineering solution must be built from the ground up to address those particular needs.

Platform engineering for AIOps must support mature AIOps workflows, which can vary slightly between companies. However, distributed enterprises should deploy a hub-and-spoke operating model that generally comprises the following steps:

Initial AI model development and training on proprietary company data by a centralized data science team working in an established AI Center of Excellence
Containerization of proprietary models and storage in private model registries to make all models accessible across the enterprise
Distribution of models to regional data center locations where local data science teams fine-tune models on local data
Deployment and monitoring of models to deliver inference in edge environments

In addition to enabling the self-serve provisioning of the infrastructure and tooling preferred by each ML engineer in the AI Center of Excellence and the regional data center locations, platform engineering solutions built for distributed AIOps automate and simplify the workflows of this hub-and-spoke operating model.

MORE FROM THIS AUTHOR: Vultr adds CDN to its cloud computing platform

Mature AI involves more than just operational and business efficiencies. It must also include responsible end-to-end AI practices. The ethics of AI underpin public trust. As with any new technological innovation, improper management of privacy controls, data, or biases can harm adoption (user and business growth) and generate increased governmental scrutiny.

The EU AI Act, passed in March 2024, is the most notable legislation to date to govern the commercial use of AI. It’s likely only the start of new regulations to address short and long-term risks. Staying ahead of regulatory requirements is not only essential to remain in compliance; business dealings for those who fall out of compliance may be impacted around the globe. As part of the right platform engineering strategy, responsible AI can identify and mitigate risks through:

Automating workflow checks to look for bias and ethical AI practices
Creating a responsible AI “red” team to test and validate models
Deploying observability tooling and infrastructure to provide real-time monitoring

Platform engineering also future-proofs enterprise AI operations

As AI growth and the resulting demands on enterprise resources compound, IT leaders must align their global IT architecture with an operating model designed to accommodate distributed AI at scale. Doing so is the only way to prepare data science and AIOps teams for success.

Purpose-built platform engineering solutions enable IT teams to meet business needs and operational requirements while providing companies with a strategic advantage. These solutions also help organizations scale their operations and governance, ensuring compliance and alignment with responsible AI practices.

There is no better approach to scaling AI operations. It’s never too early (or late) to build platform engineering solutions to pave your company’s path to AI maturity.

You may also like…

Platform Engineering is not (just) about infrastructure!

The real problems IT still needs to tackle for platforms

The post Working toward AIOps maturity? It’s never too early (or late) for platform engineering appeared first on SD Times.

APM: Cutting through the noise

Christina Cardoza and David Rubinstein — Thu, 01 Jul 2021 16:01:22 +0000

It seems like the industry is leaving application performance management (APM) behind and moving towards a new observability world. But don’t be fooled. While vendors are rebranding themselves as observability tools, APM is still an important piece of the puzzle.

“Observability is becoming a bigger focus today, but APM just by design will continue to have a critical role to play in that. Think about observability holistically, but also understand that your applications, your user-face applications and your back-end applications are driving revenue,” said Mohan Kompella, vice president of product marketing at the IT ops event correlation and automation platform provider BigPanda.

Because of the complexity of modern applications that rely on outside services through APIs and comprise microservices running in cloud-native environments, simply monitoring applications in the traditional way doesn’t cover all the possible problems users of those applications might experience.

“What’s important,” explained Amy Feldman, head of AIOps product marketing at Broadcom, “is to be able to take a look at data from various different aspects, to be able to look at it from the traditional bytecode instrumentation, which is going to give you that deep-level transactionability back into even legacy systems like mainframe or, TIBCO, or even an MQ message bus that a lot of enterprises still rely on.”

Further, as more applications are running in the cloud, Feldman said she’s seeing developers “starting to change the landscape” of what monitoring looks like, and they want to be able to have more control over what the output looks like. “So they’re relying more on logs and relying more on configuring it through APIs,” she said. “We want to be able to move from this [mindset of] ‘I’m just telling you what to collect from an industry and vendor perspective,’ to having the business be more in charge about what to collect. ‘This is the output, I want you to measure it, look at all the data and be able to assimilate that into that entire topological view.'”

APM, observability or AIOps?

Kompella explained there’s a lot of confusion in the market today because as vendors add more and more monitoring capabilities into their solutions, APM is being blended into observability suites. Vendors are now offering “all-in-one” solutions that provide everything from APM to infrastructure, logging, browser and mobile capabilities. This is making it even harder for businesses to find a solution that works best for them because although vendors claim to provide everything you need to get a deep level of visibility, each tool addresses specific concerns.

“Every vendor has certain areas within observability they do exceedingly well and you have to be really clear about the problem you’re trying to solve before making a vendor selection. You don’t want to end up with a suite that claims to do everything, but only gives you mediocre results in the one area you really care about,” Kompella said.

When looking to invest in a new observability tool, businesses and development teams need to ask themselves what the specific areas or technologies that they are interested in monitoring are and where they are located. Are they on-premises or are they in the cloud? “That is a good starting point because it helps you understand if you need an application monitoring tool that’s built for microservices monitoring and therefore in the cloud, or if you still have a large number of on-premise Java-based applications,” Kompella explained.

Much of monitoring applications in the cloud is reliant upon the providers giving you the data you need. Feldman said cloud providers could give you information through an API, or deliver it through their monitoring tool. The APM solution has to be able to assimilate that information too.

While Feldman said the cloud providers haven’t always provided all the data needed for monitoring, she believes they’re getting better at it. “There’s definitely an opportunity for improvement. And in a lot of areas, you do see APM vendors also provide their own way to instrument the cloud… being able to install an agent inside of the cloud service, to be able to give you additional metrics,” she said. “But we’re seeing, I think, a little bit more transparency than we had before in the past. And that’s because they have to be able to provide that level of service. And being able to have that trend, a little bit of transparency, helps to increase communications between the service and the provider.”

BigPanda’s Kompella said the overarching driver of monitoring is to not just “stick your finger in the wind” and decide to measure whichever way the wind blows. You really have to understand your systems to figure out what metrics are going to matter to you. One way to do that is by analyzing what is generating revenue. Kompella went on to explain that you have to look at where you’ve had outages or incidents in the last couple of months, how they’ve impacted your revenue and rating, and then that will lead you to the right type of APM or observability tools that can help you solve those problems.

Additionally, businesses need to look at their services from the evolution of their technology stack. For instance, a majority of their applications may be on-premises today, but the company might have a vision to migrate everything to the cloud over the next three years. “You want to make sure that whatever investments you make in APM tools are able to provide you the deep visibility your team needs. You don’t want to end up with a legacy tool that solves your existing problems, but then starts to break down over the next few years,” said Kompella. “Technology leaders should judiciously analyze both what’s in the bag today versus what’s going to happen in the next few years, and then make a choice.”

Getting the big picture

Broadcom’s Feldman explained that a monitoring solution should give you perspective and context around what is happening, so having the traditional inside-out view of APM coupled with an outside-in perspective can aid in resolving issues when they arise. Such things as synthetic monitoring of network traffic, and real user monitoring of how applications are used can provide invaluable insight to an application’s performance. She also noted if the application is running in the cloud, you could use Open Tracing techniques to get things like service mesh information to understand what the user experience is for a particular cloud service.

Kompella added that log management and network performance monitoring (NPM) can help extend your monitoring capabilities. While APM tools are good at providing a deep dive of forensics or metrics, log traces help you go even deeper into what’s going on with your applications and services and help improve performance, he said.

Network performance monitoring is also extremely important because most large enterprises are working in very hybrid environments where some parts of their technology stacks live on-premises and in the private or public cloud. Additionally, applications tend to have a multi-cloud strategy and are distributed across multiple cloud providers.

“Your technology stack is extremely fragmented and distributed across all these on-prem and cloud environments, which also means that understanding the performance of your network becomes super critical,” said Kompella. “You might have the most resilient applications or the best APM tools, but if you’re not closely understanding network traffic trends or understanding the potential security issues impacting your network, that will end up impacting your customer experience or revenue generating services.”

What is to come?

The reason monitoring strategies are becoming so important is because the pressure for digital transformation is just that much greater today. A recent report from management consulting company McKinsey & Company found the COVID-19 crisis has accelerated digital transformation efforts by seven years.

“During the pandemic, consumers have moved dramatically toward online channels, and companies and industries have responded in turn. The survey results confirm the rapid shift toward interacting with customers through digital channels. They also show that rates of adoption are years ahead of where they were when previous surveys were conducted,” the report stated.

This means that the pressure to move or migrate to the cloud quickly is that much greater, according to Mohan Kompella, vice president of product marketing at BigPanda, and as a result APM solutions have to be built for the cloud.

“Enterprises can no longer afford to look for APM tools or observability tools that just don’t work in a cloud-native environment,” he said.

Kompella also sees more intelligent APM capabilities coming out to meet today’s needs to move to the cloud or digitally transform. He went on to explain that APM capabilities are becoming very commoditized, so the differences between vendors are getting smaller and smaller. “Getting deep visibility into your applications has been largely solved by now. Companies need something to make sense of this tsunami of APM and observability data,” he said.

The focus is now shifting to bringing artificial intelligence and machine learning into these tools to make sense of all the data. “The better the AI or the machine learning is at generating these insights, the better it is at helping users understand how they’re generating these insights,” said Kompella.

“Every large company has similar problems, but when you start to dive in deeper, you realize that every company’s IT stack is set up a little bit differently. You absolutely need to be able to factor in that understanding of your unique topology in your unique ID stack into these machine learning models,” said Kompella.

The trouble with alerts

Alarms are a critical way to inform organizations of performance breakdowns. But alarm overload, and the number of false positives these systems kick off, has been a big pain point for those responsible for monitoring their application systems.

Amy Feldman, head of AIOps product marketing at Broadcom, said this problem has existed since the beginning of monitoring. “This is a problem we’ve been trying to sell for at least 20 years, 20 plus years … we’ve always had a sea of alarms,” she said. “There have always been tickets where you’re not sure where the root cause is coming from. There’s been lengthy war rooms, where customers and IT shops spend hours trying to figure out where the problem is coming from.”

Feldman believes the industry is at a point now where sophisticated solutions using new algorithmic approaches to datasets have given organizations the capability to understand dependencies across an infrastructure network. Then, using causal pattern analysis, you understand the cause and effect of certain patterns that go on to be able to determine where your root cause is coming from.

“I think we’re at a really exciting point now, in our industry, where those challenges that we’ve always seen for the last 20 years, are something that we truly can accomplish today,” she said. “We can reduce the noise inside of the Event Stream to be able to show what really has the biggest impact on your business and your end users. We’re able to correlate the data to be able to recognize and understand patterns. ‘I’ve seen this before, therefore, this problem is a recurring problem, this is how you fix the problem.'”

AI and ML are key, Feldman said. “I think APM was probably one of the first industries to kind of adopt that. But now we’re seeing that evolution of where it’s taking off across multiple data sets, whether that’s the cloud observability, data sets, networking, data sets, APM data sets, even, mainframe and queuing type information, all of that now is getting normalized in and then used your experience too. So all the information now is coming together is giving us a great opportunity.

The post APM: Cutting through the noise appeared first on SD Times.

AIOps enables 5-star multi-cloud apps

Sudip Datta — Thu, 01 Jul 2021 13:00:32 +0000

Every company is going digital today and user experience is everything. However, deployment of dynamic, hybrid cloud infrastructure and the explosion of connected devices creates a lot of challenges in monitoring performance of digital services. Therefore, organizations are still struggling to build end-to-end pipelines that help ensure their applications and the business remain available, reliable and resilient.

“Our customers are somewhere in the journey between on prem and cloud so they have a lot of distributed, multi-cloud applications. For example, if you have a retail application, the systems of engagement could be running in the cloud while the system of record, where the actual data is stored, could be running on prem deep within the data center,” said Sudip Datta, general manager and head of AIOps at Broadcom. “When you’re dealing with such complex distributed applications, managing and monitoring those applications becomes problematic. So, the more automation you have, the better.”

What Is AIOps?

AIOps operationalizes AI in IT. In the era of digital transformation, it is an important link in the overall BizOps chain because it connects business outcomes to the software delivery chain (governed by DevOps).

“Companies have to stay on top of their digital services to make sure that 100% of their customers are satisfied 100% of the time. At the same time, they have to deal with this complexity of cloud and on prem, and with a continuously evolving infrastructure and network. Especially when you consider ephemeral assets like containers, it’s not possible to keep pace with a rule-based approach,” said Datta. “That’s why we have AIOps.”

Essentially, AIOps helps ensure that companies can automatically find and fix application issues before customers notice them or at least shorten meantime to resolution (MTTR) if a noticeable problem occurs.

Observability is Important

Achieving five-nines of service level requires observability, which is the ability to observe outputs, and gain insights from them. This capability is extremely crucial for developers who are working with cloud-native, containerized architectures. Simply monitoring the environment to keep the lights on isn’t enough because the intelligence is limited: it only says whether a component, network or server is up or down.

“It’s not about collecting data, it’s about connecting data to glean insights out of it,” said Datta. “When you are dealing with a lot of components in a distributed, multi-cloud world, you need to connect topology data, metric data, unstructured data logs and traces to glean insights about what is really happening. With AIOps and the observability it provides, you can ideally predict problems before they happen, and in case they do, determine the root cause of the problem and automate the remediation.”

Why SREs are Critical

Site Reliability Engineers (SREs) are administrators with full-stack competencies who keep digital services running at peak performance. Today, most digitally progressive enterprises employ SREs for their mission-critical services.

“If you’re a bank or a retailer offering a bunch of consumer-facing services, who is responsible for the upkeep of the services?” said Datta. “You need a very specialized skillset with deep understanding of the architecture, because slow is the new ‘down. They have to be full stack engineers. And they have to be equipped with the right tool that can track Service Level Objectives (SLOs) and the underlying Service Level Indicators (SLIs).”

AIOps Speeds Issue Resolution

AIOps helps reduce the noise associated with issue resolution. As applications become more distributed and complex, the number of tools used to manage applications, networks and infrastructure grows so it may not be clear whether a service outage was caused by the network or the application. Datta said an average enterprise’s tech stack generates 5,000 to 10,000 alarms per day or more. AIOps uses natural language processing (NLP) and clustering technologies to reduce alarm noise by as much as 90%, giving developers and IT more time to deliver actual value.

“Customers joke about having a meantime to innocence – the time it takes to prove that it’s not my problem, and those responsibility debates are costing them four to five hours,” said Datta. “With AI and ML, we can determine the root cause or the probable root cause of a problem and fix it faster. The whole thing is about accelerating remediation and predicting problems before they happen.”

Developers and IT should understand which technology assets make up a business service and which business services have the highest priority so they can focus their efforts accordingly. In addition, it’s important to know the relationship of individual technology assets, such as what application connects to which database and what database connects to which network.

“It’s all about the data, and the ability to deal with the volume, velocity, variety and veracity,” said Datta. “What’s also critical for AIOps is making sure your solution is open so it can connect with the peer disciplines such as DevOps and BizOps. AIOps isn’t a nice to have, it’s a must have, especially in the modern digital era.”

Learn more at Broadcom’s Sept. 28 AIOps event.

Content provided by Broadcom.

The post AIOps enables 5-star multi-cloud apps appeared first on SD Times.

Industry Watch: BizOps — Bridging the age-old divide

David Rubinstein — Fri, 10 Jul 2020 16:21:00 +0000

Since my introduction into the software development industry in 1999, there has been one theme underlying all our coverage of tools, processes and methodologies: Getting business and IT to work closer together.

At first, this divide was chalked up to the fact that the two sides did not speak the same language. The business side didn’t understand what was involved in producing an application, and the developers created applications based on unclear or imprecise requirements, which led to a lot of finger-pointing, misunderstanding and more bad communication.

Today, as businesses go digital and releasing software no longer just supports the business but actually drives it, the need for the two sides to come together has never been greater. The need to engage and retain customers through software is a high-stakes effort.

There are now common collaboration tools that both sides can use for project tasks, progress tracking and more. A recent conversation I had with Serge Lucio, the general manager of the enterprise software division at Broadcom, led to a discussion of what the company is calling ‘digital BizOps.’

This concept links the planning, application delivery and IT operations at organizations by the use of tools that ingest data across the entire spectrum to provide a 360-degree view of what’s being produced and if it aligns with business goals. It uses automation to make decisions along the way that drive value.

Broadcom, Lucio said, is looking to deliver insights for the different parts of the organizations. For application delivery, Broadcom wants to give teams the ability to “release with confidence. That is, I have a release that’s ready to be deployed to production. Is it really ready? You have tests, you may have security violations. These are the numbers of data points that release engineers and operations teams are looking at to decide if it’s ready to go into production or not.”

At the planning level, where corporate higher-ups are deciding what the business needs, the questions are, is the release strategically aligned with business goals? Are we going to deliver on-time, and on-budget? They need data, for planning and investment management purposes, to ultimately see if what is in the works matters most to the business.

Then, Lucio explained, from the perspective of IT operations management, they need to triage problems to see which are most impactful on the business, and which are the highest priority to the business to resolve.

Some would say this looks a lot like AIOps. Others see some elements of value stream in this approach. Tom Davenport, a distinguished professor of information technology and management at Babson College who has spoken with Broadcom about BizOps, says the concept seems more aspirational than technological or methodological at this point, but has the goal of automating business decision-making. “I’ve done some work in that area in the past, and I do think that’s one way to get greater alignment between IT and business people,” Davenport told me. “You just automate the decision and take it out of the day-to-day hands of the business person, because it’s automated. But that’s a way you’re at least ensuring there’s a close tie between the information and the analytics and the actual decision, because it’s baked in.”

Davenport can see organizations automating decisions in areas such as human capital management. “You have more and more data now out of these HCM systems and you’re starting to see some recommendations about, ‘You should hire this person because they’re likely to be a high performer,’ based on some machine learning analysis of people we’ve hired in the past who’ve done really well. Or, they’re likely to leave the organization, so if you want to keep them, you might want to approach them with an intervention,’ an offer or enticement to keep them on board.

Some of that already is being done in marketing. “A lot of that, the offers that get made to customers, particularly the online ones — are almost all automated now,” Davenport explained.

Davenport pointed to an example of a casino hotel operator who had automated pricing on rooms, but that often was overridden by front-desk staff. So, they did a test. He explained: “Let’s see, do we make more money when we allow the front desk to override, and is there any implication for customer satisfaction? Or do we make more money with fully automated decisions? And it turns out, the automated decisions were better.”

There is some risk involved in letting machines make business decisions. But Davenport said if those decisions are driven by data, it’s not as risky as it might seem. “Use data to make the decision,” he said, “and use data to test whether it has a better outcome or not.”

The post Industry Watch: BizOps — Bridging the age-old divide appeared first on SD Times.

Digital experience monitoring: An outside-in view

David Rubinstein — Mon, 04 May 2020 14:00:33 +0000

Gartner research describes three things that are required for a solution to be categorized as application performance monitoring: application discovery, diagnostics and tracing; data analysis; and digital experience monitoring.

Digital experience monitoring, or DEM as it is sometimes called, is different from the other types of monitoring because it takes an outside-in view of the application, looking at what the customer or end user is seeing to see if the application is delivering what the customer wants and expects.

In a world where businesses are transforming into digital operations, knowing what the customer is seeing is most critical, because a poor experience will impact a company’s reputation and bottom line.

“Experience management is the ultimate goal of monitoring,” said Mehdi Daoudi, founder and CEO of DEM platform provider Catchpoint. “That’s the only way to know that you’re doing something that matters. The CEO is not going to care how many of your CPUs are in use 90 percent of the time.”

Two critical factors in digital experience monitoring are where you monitor from, and what you look for. You have to monitor where the user is. “It’s even more important today with this virus crisis than ever before, because the amount of people, where they are, has just exploded,” Daoudi said. With more people working remotely than ever, it’s important to know what the experience is like for the user base. So issues like reachability are important. Can the network handle the increased loads? Can the ISP deliver transactions to the customer? “If you monitor from your data center, you’re blind to the internet, you’re blind to geographic latency,” Daoudi said.

The other critical factor is what to monitor, and Daoudi said you have to monitor the end result. “Imagine if you had a restaurant today, and you’re having a critic come to your restaurant. He’s not going to go into the kitchen to see how you prepare the food, he’s going to order the soup, or salad, a main dish and a dessert, and he’s going to taste it. Did it taste good? And he’s going to look at the presentation. The plate is clean and it looks good. And they charge $50 bucks for that dish.. Is it worth it? It’s about the end result.”

Catchpoint takes a two-pronged approach to gather data on the user experience. One, Daoudi said, is the synthetic approach. Catchpoint has robots in 850 locations that simulate the user running in the last mile, and pays people at home to host a Catchpoint device, which provides the digital ‘mystery’ shopper experience. Catchpoint also has its robots run in data centers, “because it gives you that clean lab environment view,” Daoudi explained. “You can see, OK, I’m in Los Angeles, connected to Comcast, how does it look to try to launch a Zoom conference. How does it look to make a DNS query to Akamai. How does it look to load Salesforce.com?”

The second part is having customers put beacons on their web pages and web applications to provide a much larger data set, allowing Catchpoint to see “literally every single human transaction,” Daoudi said. Returning to the restaurant example, the ‘mystery shoppers’ are like the food critic, tasting the food. The beacons provide the look into the kitchen, to see how things are running, and how the food is being prepared.

In beta is a solution for employee experiences, with robots installed in the offices of Catchpoint customers and a browser application endpoint that customers install. Daoudi explained: “And so now, we’re seeing what Dave is doing, what John is doing, and we can see, who’s having a bad day with Salesforce? Is it just Dave because he’s on Long Island? Are other things happening on Long Island? Is the entire East Coast having problems with Zoom? We’re trying to get that visibility from as many perspectives as we can.”

The point of all this, he said, is to be customer-centric. It means IT working as a partner with the business. “Monitoring for the sake of monitoring is a stupid thing; it’s a waste of money,” Daoudi said. “If you go and spend a gazillion dollars on an APM tool or something, and you are not able to answer, ‘why is my business impacted in Germany,’ then it’s a waste of time and money. It’s important for the IT leaders to make this great transformation.”

The post Digital experience monitoring: An outside-in view appeared first on SD Times.

APM, AIOps and Observability

David Rubinstein — Mon, 04 May 2020 13:07:29 +0000

Monitoring your applications comes in many forms. There’s traditional application performance management, which begat AIOps, which begat observability.

But are there really any differences? If so, where are they? Some believe these are marketing terms used to differentiate tools. Others point to it as more of an evolution of monitoring. All that said, the performance of your application can be critical to your organization’s bottom line — whether that’s represented financially, or by increased membership, or the number of users on your site.

The key reason for the changes in monitoring is that software and IT architectures are much more distributed. Monolithic applications and architectures are being rewritten as microservices and distributed in the cloud. This, now, requires automation and many companies are also adding machine learning to help in the decision-making process. These two properties define AIOps, though traditional APM vendors are adding automations to their solution sets.

Stephen Elliot, program director of I&O at research firm IDC, said thinking back 10 years ago, application performance was very much about the application itself, specific to the data tied to that app, and it was a silo. “I think now that one of the big differences is not only do you have to have that data, but it’s a much broader set of data — logs, metrics, traces — that is collected in near real-time or real-time with streaming architectures,” Elliot said.

“Customers now have broadened out what they expect in terms of observability versus the traditional APM view,” he continued. “And they’re increasingly expecting more integration points to collect different pieces of data, and some level of analytics that can drive root cause, pattern matching, behavioral analysis, predictive capabilities, to then sort of filter up, ‘Here’s where the problem might be, and maybe, here’s what you should do to fix it.’ They need a lot broader set of data that they trust, and they need to see it in their own context, whether it’s a DevOps engineer, a site reliability engineer, cloud ops, platform engineers.”

AIOps not only looks at the application itself, it takes into account the infrastructure — how the cloud is performing, how the network is performing. The intelligence part comes in where you can train the system to reconfigure itself to accommodate changing loads, to provision storage as needed for data, and the like.

But, before declaring the holy grail of monitoring has been found, Gartner research director Charley Rich cautioned, “Just be aware … APM is a very mature market. In terms of our hype cycle, it’s way past the bump in hype and moving into maturity. AIOps, on the other hand, is just climbing up the mountain of hype. Very, very different. What that means in plain English is that what’s said about AIOps today is just not quite true. You have to look at it from the perspective of maturity.”

What is not quite true about AIOps?

“Oh, that it just automatically solves problems,” said Rich, who is the lead author on Gartner’s APM Magic Quadrant, as well as the lead author on the analysis firm’s AIOps market guide. “A number of vendors talk about self-healing. There are zero self-healing solutions on the market. None of them do it. You know, you and I go out and have a cocktail while the computer’s doing all the work. It sounds good; it’s certainly aspirational and it’s what everyone wants, but today, the solutions that run things to fix are all deterministic. Somewhere there’s a script with a bunch of if-then rules and there’s a hard-coded script that says, ‘if this happens, do that.’ Well, we’ve had that capability for 30 years. They’re just dressing it up and taking it to town.”

But Rich emphasized he didn’t want to be dismissive of the efforts around AIOps. “It’s very exciting, and I think we’re going to get there. It’s just, we’re early, and right now, today, AIOps has been used very effectively for event correlation — better than traditional methods, and it’s been very good for outlier and anomaly detection. We’re starting to see in ITSM tools more use of natural language processing and chatbots and virtual support assistants. That’s an area that doesn’t get talked about a lot. Putting natural language processing in front of workflows is a way of democratizing them and making complex things much more easily accessible to less-skilled IT workers, which improves productivity.”

Indeed, many organizations today are gaining greater event detection, correlation and remediation through the use of AI and machine learning in their monitoring. But to achieve that, organizations have to rethink the tools they use and the way they monitor their systems.

Is AIOps a better way to do things? Machine learning makes monitoring tools more agile, and through self-learning algorithms they can self-adjust, but that doesn’t necessarily make them AIOps solutions, Rich said.

“Everybody’s doing this,” he pointed out. “We in the last market guide segmented the market of solutions into domain-centric and domain-agnostic AIOps solutions. So domain-centric might be an APM solution that’s got a lot of machine learning in it but it’s all focused on the domain, like APM, not on some other thing. Domain-agnostic is more general-purpose, bringing in data from other tools. Usually a domain-agnostic tool doesn’t collect, like a monitoring tool does. It relies on collectors from monitoring tools. And then, at least in concept, it can look across different data streams, different tools, and come up with a cross-domain analysis. That’s the difference there.”

Changing cultures
One of the things pundits tell us is required to implement many new technologies is a change in culture, as if that were as simple as changing a pair of socks. Often, when they talk about culture change, they are really talking about learning new skills and reorganizing teams, not really changing the way they work.

In the case of monitoring, Joe Butson, co-founder of consulting company Big Deal Digital, sees automation in AIOps enabling a shift from the finger-pointing often associated with incidents to a healthier acceptance that problems are going to happen.

“One of things about the culture change that’s underway is one where you move away from blaming people when things go down to, we are going to have problems, let’s not look for root cause analysis as to why something went down, but what are the inputs? The safety culture is very different. We tended to root cause it down to ‘you didn’t do this,’ and someone gets reprimanded and fired, but that didn’t prove to be as helpful, and we’re moving to a generative culture, where we know there will be problems and we look to the future.”

AIOps is driving this organizational culture change by adding automation to their systems, which allows companies to respond in a proactive way rather than reacting to incidents as they occur, Butson explained.

“It’s critical to success because you can anticipate a problem and fix it. In a perfect world, you don’t even have to intervene, you just have to monitor the intervention and new resources are being added as needed,” he said. “You can have the monitoring automated, so you can auto-scale and auto-descale. You’d study traffic based on having all this data, and you are able to automate it. Once in a while, something will break and you’ll get a call in the middle of the night. But for the most part, by automating it, being able to take something down, or roll back if you’re trying to bring out some new features to the market and it doesn’t work, being able to roll back to the last best configuration, is all automated.”

Further, Butson said, machine learning empowers organizations to remove the human component “to a very large case.” Humans will still be reviewing the assumptions made behind the automations, but, he noted, “Every month, every year, machine learning is taking out more of the guesswork because of the data.”

Similarly, application monitoring runs on a certain set of assumptions as to how an application should behave, how the network should perform, and other metrics the organization deems as critical. So you have those assumptions, but then, Butson asked, how do you deal with the anomalies? “You prepare for the anomalies,” he said, “and that’s a different kind of culture for all of us.”

The human element
Gartner’s Rich said what people want is the algorithms to adapt to what you’re doing and analyze the current situation. This, he said, is a legitimate want, but no one really has yet. “It’s very hard because you don’t get all the signals that say, ‘this is the problem.’ You have to infer that from a lot of data, and then look at the past and look at topology, and come up with, ‘this is the best solution we can recommend to do this based on what you’ve used — then run the solution, rate itself and then improve it the next time. That cycle of continuous improvement is just not there.”

Further, he said, as you think about it, would you want machine to do that? Risk is the key determining factor in how much automation organizations will enable. “If it’s a password change, someone says they want to update their password, sure, the machine can do that. If the solution is ‘boot the server,’ like we do with Windows, or start up a new container, there’s no risk. But if the solution is ‘reconfigure the commerce server.’ and the downside might be we can’t book any orders today, would you want the machine doing that? No.”

IDC’s Elliot said it’s a matter of trust. Teams have to trust that the algorithm is going to do what it says it’s going to do correctly. “You can see the aspirations, and some tools emerging that are driving automated decision-making. For example, resizing a reserve instance on AWS, or shutting a reserve instance down, or potentially maybe moving storage automatically. There are different tasks that can be done automated that can be trusted and can be executed via policy. We are seeing that, and expect more of it down the road as customers get comfortable with replacing some of these manual tasks with automated, event-driven decision-making.”

Moving to AIOps won’t be quick; often it is a multi-year process, and every company will move at their own pace and scale. But automation is here and will only get better. “Even the public cloud providers really see automation as a way to differentiate their own platforms. And that’s pretty critical when customers hear, ‘You can put certain types of workloads on our platform, and because you’re using our platform, we are embedding automated capabilities onto those workloads.’ “

The post APM, AIOps and Observability appeared first on SD Times.