SRE Archives - SD Times

Are developers and DevOps converging?

Eric Futoran — Fri, 14 Jun 2024 14:56:49 +0000

Are your developers on PagerDuty? That’s the core question, and for most teams the answer is emphatically “yes.” This is a huge change from a few years ago when, unless you did not have DevOps or SRE teams, the answer was a resounding “no.”

So, what’s changed?

A long-term trend is happening across large and small companies, and that is the convergence of developers, those who code apps, and DevOps, those who maintain the systems on which apps run and developers code. There are three core reasons for this shift – (1) transformation to the cloud, (2) a shift to a single store of observability data, and (3) a focus of technical work efforts on business KPIs.

The impending impact on DevOps in terms of role, workflow, and alignment to the business will be profound. Before diving into the three reasons shortly, first, why should business leaders care?

The role of DevOps and team dynamics – The lines are blurring between traditionally separate teams as developers, DevOps, and SREs increasingly collide. The best organizations will adjust team roles and skills, and they will change workflows to more cohesive approaches. One key way is via communicating around commingled data sets as opposed to distinct and separate vendors built and isolated around roles. While every technical role will be impacted, the largest change will be felt by DevOps as companies redefine its role and the mentalities that are required by its team members going forward.

Cost efficiency – As organizations adjust to the new paradigm, their team makeup must adjust accordingly. Different skills will be needed, different vendors will be used, and costs will consolidate.

Culture and expectations adaptation – Who will you be on call with PagerDuty? How will the roles of DevOps and SREs change when developers can directly monitor, alert, and resolve their own questions? What will the expectation of triage be when teams are working closer together and focused on business outcomes rather than uptime? DevOps will not just be setting up vendors, maintaining developer tools, and monitoring cloud costs.

Transformation to the cloud

This is a well-trodden topic, so the short story is… Vendors would love to eliminate roles on your teams entirely, especially DevOps and SREs. Transformation to the cloud means everything is virtual. While the cloud is arguably more immense in complexity, teams no longer deal with physical equipment that literally requires someone onsite or in an office. With virtual environments, cloud and cloud-related vendors manage your infrastructure, vendor setup, developer tooling, and cost measures… all of which have the goals of less setup and zero ongoing maintenance.

The role of DevOps won’t be eliminated… at least not any time soon, but it must flex and align. As cloud vendors make it so easy for developers to run and maintain their applications, DevOps in its current incarnation is not needed. Vendors and developers themselves can support the infrastructure and applications respectively.

Instead, DevOps will need to justify their work efforts according to business KPIs such as revenue and churn. A small subset of the current DevOps team will have KPIs around developer efficiency, becoming the internal gatekeeper to enforce standardization across your developers and the entire software lifecycle, including how apps are built, tested, deployed, and monitored. Developers can then be accountable for the effectiveness and efficiency of their apps (and underlying infrastructure) from end-to-end. This means developers – not DevOps – are on PagerDuty, monitor issues across the full stack, and respond to incidents.

Single store of observability data

Vendors and tools are converging on a single set of data types. Looking at the actions of different engineering teams, efforts can easily be bucketed into analytics (e.g., product, experience, engineering), monitoring (e.g., user, application, infrastructure), and security. What’s interesting is that these buckets currently use different vendors built for specific roles, but the underlying datasets are quickly becoming the same. This was not true just a few years ago.

The definition of observability data is to collect *all* the unstructured data that’s created within applications (whether server-side or client-side) and the surrounding infrastructure. While the structure of this data varies by discipline, it is always transformed into four forms – metrics, logs, traces, and, more recently, events.

Current vendors generally think of these four types separately, with one used for logs, another for traces, a third for metrics, and yet another for analytics. However, when you combine these four types, you create the underpinnings of a common data store. The use cases of these common data types become immense because analytics, monitoring, and security all use the same underlying data types and thus should leverage the same store. The question is then less about how to collect and store the data (which is often the source of vendor lock-in), and more about how to use the combined data to create analysis that best informs and protects the business.

The convergence between developers and DevOps teams – and in this case eventually product as well – is that the same data is needed for all their use cases. With the same data, teams can increasingly speak the same language. Workflows that were painful before now become possible. (There’s no more finger-pointing between DevOps and developers.) The work efforts become more aligned around what drives the business and less about what each separate vendor tells you is most important. The roles then become blurred instead of having previously clean dividing lines.

Focus of work efforts on business KPIs

Teams are increasingly driven by business goals and the top line. For DevOps, the focus is shifting from the current low bar of uptime and SLAs to those KPIs that correlate to revenue, churn, and user experience. And with business alignment, developers and DevOps are being asked to report differently and to justify their work efforts and prioritization.

For example, one large Fortune 500 retailer has monthly meetings across their engineering groups (no product managers included). They review the KPIs on which business leaders are focused, especially top-line revenue loss. The developers (not DevOps) select specific metrics and errors as leading indicators of revenue loss and break them down by type (e.g., crashes, error logs, ANRs), user impact (e.g., abandonment rate), and area of the app affected (e.g., startup, purchase flow).

Notice there’s no mention of DevOps metrics. The group does not review the historically used metrics around uptime and SLAs because those are assumed… and are not actionable to prioritize work and better grow the business.

The goal is to prioritize developer and DevOps efforts to push business goals. This means engineering teams must now justify work, which requires total team investment into this new approach. In many ways, this is easier than the previous methodology of separately driving technical KPIs.

DevOps must flex and align

DevOps is not disappearing altogether, but it must evolve alongside the changing technology and business landscapes of today’s business KPI-driven world. Those in DevOps adapted to the rapid adoption of the cloud, and must adapt again to the fact that technological advancements and consolidation of data sources will impact them.

As cloud infrastructures become more modular and easier to maintain, vendors will further force a shift in the roles and responsibilities of DevOps. And as observability, analytics, and security data consolidates, a set of vendors will emerge – looking at Databricks, Confluent, and Snowflake – to manage this complexity. Thus, the data will become more accessible and easier to leverage, allowing developers and business leaders to connect the data to the true value – aligning work efforts to business impact.

DevOps must follow suit, aligning their efforts to goals that have the greatest impact on the business.

The post Are developers and DevOps converging? appeared first on SD Times.

Gremlin launches new tool for discovering common hidden risks in software

Jenna Barron — Wed, 30 Aug 2023 14:57:16 +0000

Detected Risks can help SREs find and fix common hidden risks that could impact reliability. It flags things that could potentially be failure points, and offers recommendations on how to resolve them.

According to Gremlin, the hope with this new solution is that it empowers companies to transition from “reactive problem-solving to proactive risk mitigation.”

Gremlin analyzed data from tens of thousands of systems that use Gremlin in order to determine the most common risks. For example, 26% of deployments had zero redundancies and 80% didn’t have two redundancies configured.

These are some of the possible issues that Detected Risks would flag. It also tests for common miscounfiguration problems, including missing Kubernetes liveness probes and misconfiguration of autoscaling.

“Reliability continues to grow in importance,” said Kolton Andrus, CTO and founder of Gremlin. “Our digital infrastructure is as important as our physical infrastructure. Government, healthcare, transportation, communication and finance all rely on this digital foundation, and it has risks. Fortunately, many of these risks are simple to mitigate—if they are known. That is why we are excited to announce our new Detected Risks. We have worked hard to quickly expose serious issues within our customers’ systems, risks that they can then mitigate to qualitatively improve the posture of their systems.”

It is currently generally available for all Gremlin users.

The post Gremlin launches new tool for discovering common hidden risks in software appeared first on SD Times.

The challenges with platform engineering don’t have to do with engineering

Jakub Lewkowicz — Thu, 01 Jun 2023 15:43:24 +0000

Platform engineering has become increasingly important for businesses as platforms have become more complex, spanning DevOps tools, APIs, and other components necessary for effective software development. It’s a delicate balancing act as developers have been calling for more simplified navigation throughout an organization’s platform.

According to a whitepaper by Humanitec, just five years ago, platform engineering was not a thing people talked about.

In the last decade, the concept of DevOps was all people thought about, ever since Werner Vogels remarked “you build it, you run it” at an AWS launch in 2006.

This shift to DevOps caused a notable shift left in roles, where developers are now responsible for more aspects of an application’s life cycle and delivery workflow, all while the industry moved to more complex microservice architectures and technologies like Kubernetes, GitOps, and Infrastructure as Code (IaC), the report added.

“Platform engineering emerged in response to the increasing complexity of modern software architectures. Today, non-expert end users are often asked to operate an assembly of complicated arcane services,” said Paul Delory, vice president analyst at Gartner. “To help end users, and reduce friction for the valuable work they do, forward-thinking companies have begun to build operating platforms that sit between the end user and the backing services on which they rely.”

The more successful engineering organizations invested in building Internal Developer Platforms (IDP) resulting in better performance on all DORA metrics. The IDP is the sum of all tech and tools that a platform engineering team binds together to pave a golden path or paths, according to Humanitec.

Gartner forecasts that by 2026, the majority of software engineering firms (80%) will have created platform teams to supply mutually accessible services, components, and tools for application delivery. This will ultimately address the key challenge of collaboration between software developers and operators.

Lack of structural discipline is causing many platform problems today

The default assumptions and behaviors around what came before platforms were basically very bureaucratic or consultative, but not in a good way, according to Charles Betz, vice president and research director at Forrester.

“These groups have always been saying, well, if you need a computer, a virtual machine, or some compute resources or a database, you really need to have a server engineer and database administrator there to make sure you don’t get yourself into trouble. We’re going to be intimately involved in your technical design,” Betz explained. “We have overloaded people and it’s going to take long and unpredictable amounts of time for us to give you the designs and the analysis you need. Then oh, by the way, no, you can’t have access to any of the resources you need until we approve. And this is what leads developers to really dislike classic, traditional IT.”

However, Betz said that he also sympathizes with the IT side of things because IT professionals have typically been overloaded. They are often subject to unreasonable demands, and there have been histories of people developing systems with horrendous architectures, and then insisting that the infrastructure group take production ownership.

“The reality is that the long periods of delay that the classic infrastructure group imposed simply is unacceptable in the modern Agile world,” Betz said.

He added that people love to point to companies such as Netflix to try to mimic their way of being Agile, but those are the companies that are highly disciplined from a platform engineering perspective.

“Netflix did not have autonomous product teams at a low granular level choosing willy nilly what products to use. That’s one of the myths that we often have to deconstruct when we come to a client. People say, well, all the product teams insist that they’re autonomous, and they can do whatever they want. And they point to Agile and various things. I’m like, you know, that’s actually not how it happened,” Betz said.

The companies that succeeded imposed architectural discipline although they might have not called it that. But the bottom line is that they didn’t tolerate a lot of unmanaged variability and sprawl and that’s how they managed to be successful, according to Betz.

Humanitec’s DevOps Benchmarking Study 2023 found that giving DevOps tasks to developers as a way of implementing self-service is often executed poorly in many organizations. An IDP can enable developers to have greater self-service capabilities with the ability to spin up environments, deploy, roll back, and make changes to the architecture without relying on Ops.

It is important, however, to keep the relationship between Ops and developers close while maintaining a separation of concerns. Through this, both teams can work together while having distinct roles.

The Humanitec report indicated that successful teams manage their app configurations across the entire organization in a standardized way. They also handle app configurations and infrastructure dependencies in the same manner, knowing how to distinguish between environment-specific and environment-agnostic configurations. These teams are more efficient at creating new environments and are able to provide more self-service when it comes to deployments, provisioning infrastructure, and assigning the infrastructure.

The path forward for platform engineering isn’t about changing the engineering

As a whole, platform engineering is now undergoing a lot of experimentation and there isn’t an agreed-upon best practice out there yet. However, the biggest challenges facing platform engineering don’t always deal with the “engineering” part at all, according to Forrester’s Betz.

“When I went to the DevOps Enterprise Summit last fall, I talked to as many people as I could who were having success in platform engineering, and the one thing they all had in common was they were figuring out some way to bring product management principles to platform engineering,” Betz said.

Platforms are products, according to the authors of Team Topologies, which suggests that the principles of organization, self-correction, monitoring environment, and setting standards are applied to the team.

The challenge is that platform teams are now being created from the existing, outdated IT and operations organizations. These organizations are best suited for engineering purposes but are not necessarily well-equipped for the other tasks of a platform team.

The problems are not in feasibility, which still continues to require high levels of engineering excellence, but rather in adding value, viability, and usability, according to Betz’s article, “Platform Product Management Versus Platform Engineering.”

This platform model encourages seeking automation wherever possible and managing queues anywhere else, making the internal offerings have advantages in terms of access to capital, reduced transactional friction, and maintaining high-quality service. Organizations should also implement explicit service design thinking in which employee net promoter score (eNPS) is tracked and customer journeys are understood, according to the article.

“To create a platform that is useful, you need to understand what is it the actual people you’re creating this platform for want and make sure that you’re regularly talking to them, iterating with them like you would have with any product backlog, product development, and product in the world of application development,” said Daniel Betts, senior director research analyst at Gartner. “You’re treating this in a similar way, you’re having to create a platform as an agile product.”

Most of the teams that the products are targeted at are software developers, application developers, infrastructure engineers, and someone who’s creating code, creating assets for the business.

“These people want to be able to create software or applications to deploy to a platform. They don’t want to have to think about tools, technology, governance, change management, all of those things. They want them all to be sort of made available to them and they want to focus on writing code,” Betts explained.

Often, platform teams are composed of a software engineer or two, as they add expertise to the benefits of writing machine-controlled code, code reviews, automated processes, and reusable components. They also aid in teaching scripting and coding best practices. Additionally, SREs can be part of the platform teams, as it is key for a successful product, Gartner’s Betts added.

On top of that, organizations are having to develop platform engineering talent internally because there is a skills gap and they can’t find someone in the market that will know exactly how their platform works, according to Forrester’s Betz.

“You have to do things like take somebody who’s a good technical engineer, and you’ve got to pivot them a little bit into somewhat less technical concerns like developer experience, product management, and products need to be valuable, viable, usable, and feasible,” Betz concluded. “We need to see more platform product managers.”

The post The challenges with platform engineering don’t have to do with engineering appeared first on SD Times.

Software engineering teams must collaborate with site reliability engineers

Daniel Betts — Tue, 07 Mar 2023 18:45:17 +0000

Software engineering leaders need to foster collaboration with site reliability engineers (SRE) in order to scale unplanned work and improve customer experience. Software engineering teams tend to focus on releasing new product features quickly, which causes them to not always prioritize the reliability of new features.

Gartner predicts that by 2027, 75% of enterprises will use SRE practices organization-wide to optimize product design, cost and operations to meet customer expectations, up from 10% in 2022. Today, more than ever, customers are expecting applications to be reliable, fast and available on demand. When organizations present products that do not meet these expectations, customers are quick to seek other alternatives.

To improve product reliability, IT organizations are starting to adopt SRE principles and practices when designing and operating systems. However, SRE is rarely embedded into every product’s development life cycle. While software engineering leaders are engaging site reliability engineers, they are only performing occasional reliability exercises.

Foster Collaboration With Site Reliability Engineers

Now is the time for software engineering leaders to be building lasting partnerships with site reliability engineers as a part of their continuous quality strategy by adopting SRE practices and tools. Software engineering leaders will only be able to deliver the business value of their products to customers if they are treating reliability as a differentiating feature.

Software engineering teams should be addressing reliability issues early on in their product’s life cycle and collaborating with site reliability engineers throughout the entirety of a product’s design and delivery activities. Doing so is more time-efficient and economical than needing to resolve a product’s issue after it has been released.

Collaboration with site reliability engineers can be fostered by defining service level indicators (SLIs) and service level objectives (SLOs) that capture customer expectations for both product reliability and product performance. SLIs and SLOs will allow teams to clearly evaluate how well a product is meeting customer needs.

Enforce an SLO Action Plan

Failure is an inevitable aspect of service delivery, so it is important that software engineering leaders have a plan of action to effectively manage risk. Design an action plan for each SLO with site reliability engineers. This plan should provide guidance on what needs to be done if an SLO is breached, trending toward breach and/or the breach is imminent.

Optimize Development and Design with SRE Practices

To further a culture of reliability within their teams, software engineering leaders need to incorporate SRE practices and tools that drive lasting improvement. There are several activities software engineers should be performing with site reliability engineers in order to optimize development and design for meeting SLOs and SLIs: blameless postmortems, chaos engineering, toil management, and monitoring and observability.

Blameless postmortems can be used to identify what is causing triggering events such as failure or SLO breach. This practice allows organizations to learn and avoid repeating the same mistakes, and prevent future ones. Chaos engineering uses experimental failure testing to uncover vulnerabilities. This provides information about system behavior during failures and enhances software engineering teams’ ability to improve product design. Toil management eliminates low-value work and repeatable tasks. Lowering toil allows teams to focus more on meeting SLOs. Monitoring and observability identifies the best methods needed to measure SLIs and SLOs.

These technologies will allow software engineering teams and site reliability teams to work collaboratively to improve their ability and solve reliability issues. Software engineering teams need to work closely with site reliability engineers to help define SLOs, share accountability for meeting SLOs and adopt SRE practices and tools.

The post Software engineering teams must collaborate with site reliability engineers appeared first on SD Times.

Platform engineering vs. SRE

Jakub Lewkowicz — Fri, 06 Jan 2023 15:42:30 +0000

Although the roles of the SRE and site platform engineer share some similarities and are at times conflated, they’re still distinct.

Platform engineers are responsible for designing, developing and maintaining the underlying platform that the application runs on including the infrastructure, operating systems, databases and other components that enable the application to function. SREs, on the other hand, focus on the reliability, scalability and performance of the application itself.

“The self-serviceability aspect comes under the realm of a platform engineering team that is trying to provide self-service capabilities for product teams to consume,” Gartner’s Betts said. “SRE is going to be involved in looking at some of the tools that are used to help with that, but their focus is very much on removal of repeatable manual tasks that could potentially go wrong.”

However, SREs can be placed within platform engineering teams to help with some of the tasks.

“As the SRE teams mature, they get into the platform side of the business where they’re actually calling out gaps in the self-service capabilities so the development teams and the product teams can fix it and benefit from it,” Red Hat’s Raghavan said.

While in large organizations, there’s a division between the two roles, the more resource-constrained ones might have the same person performing both roles, according to Ellis.

Gear up your SRE

Here are some of the tools to help gear the SRE up for battle as provided by Forrester’s report “Role Profile: Site Reliability Engineer”:

1. Automation: SREs will need to use scripting, code, or orchestration tools to manage a system or environment. This can include tools like Ansible, CircleCI, GitLab, Jenkins, and Google Cloud Build.

App modernization: This can be used to migrate legacy applications to newer ones through revising the code base or rewriting the code using Docker, Git, Google Cloud Run, Kubernetes, and more.

Chaos engineering: SREs can use this method to find faults in a system by injecting specific faults in a testing or production environment using Chaos Machine, Chaos Mesh, Chaos Monkey, Chaos Toolkit, and more.

Networking: This is all about Analyzing the communication process among various computing devices or computer systems using Nagios, Netdata, SolarWinds, Terraform, and more.

Observability: SREs need to manage observability to monitor and generate insights about a platform, site, or environment under management using DataDog, Dynatrace, Google Error Reporting, New Relic, and a host of others.
Security: SREs also take part in safeguarding an environment through strategies, policies, processes, and technology at every part of the life cycle using tools like Chef InSpec, Google Cloud Audit Logs, Sysdig, and Virus Total.

To read more, click here.

The post Platform engineering vs. SRE appeared first on SD Times.

The perfect SRE doesn’t exist, but the right one might already be in your organization

Jakub Lewkowicz — Fri, 06 Jan 2023 15:17:31 +0000

There’s been an explosion of interest in SRE over the last 18 months and a lot of this has been from companies that are looking at scaling their DevOps or DevSecOps initiatives to look at the reliability concerns of their customers.

Vendors are recognizing this and a lot of general software interfaces (GSIs) and Managed service providers (MSPs) are offering some form of SRE-as-a-service, according to Brent Ellis, senior analyst at Forrester.

Since the role emerged at Google in 2003 to build reliable and high-quality services while reducing costs, it has since evolved, according to Narayanan Raghavan, senior director of site reliability engineering at Red Hat.

“I think the core SRE function, in many ways, becomes a foundation and then you build on top of it. So as the teams that focus on SRE capabilities start to mature, you get into ‘how do I get into robust CI/CD practices?’” Raghavan said. “How do I build capabilities for my development teams to onboard quickly and easily because it then makes my life easier as an SRE, it makes the developers’ lives easier because they don’t have to worry about things like observability, logging, metrics, alerting. They don’t need to think about disaster recovery, incident management, or incident rehearsals.”

For SRE to work in an organization, other teams also need to be receptive to the input that SREs offer and the level of role and this responsiveness differs based on the maturity of the organization. This level of engagement can be divided into three different buckets, according to Raghavan.

One is that toil for SREs should become tech debt for development almost immediately so as to avoid a separate quote prioritization process.

The second is that when developers actually start to architect a component that’s completely new, they need to pull in the SREs and engage with SREs up front, according to Raghavan. This is so the SREs can participate and think about how to scale that particular component. In mature organizations, this becomes an important bucket in which developers start to engage out of their own volition instead of being told that they have to do something.

Then, the third bucket is that as the SRE practice matures and is creating the building blocks that matter to all teams (observability, logging, metrics, and alerting) it’s also engaging development teams up front.

“That becomes important because it’s the development teams that are then adopting those self- service capabilities that SREs are putting out,” Raghavan said.

SREs can also lead things like blameless post-mortems in which they’ll look to get to the bottom of what caused the problem. They won’t blame any person, but will look at the processes or the technology that enabled that to take place, according to Daniel Betts, senior director analyst at Gartner.

“If you want to get full value from your SRE, try not to use them as a developer resource,” Betts said. “They should be more of like a reliability focused engineer who’s looking at the overall picture of what’s going on across the product or service that you have.”

SREs often come in at the beginning of the product life cycle and work to help the product team or the platform engineering teams build a product that is very reliable and robust, that meets the customers’ needs, he added. From there, they can perform tasks across the whole development life cycle.

“They can be involved throughout the life cycle to the point where the actual product is highly automated and incredibly reliable. It’s now running that product quite maturely and it has very effective automation, monitoring, and observability in place,” Betts said. “The SRE may actually just be keeping an eye on or looking after that product from a standpoint of the dashboards or monitoring tools or observability tools to see if it’s doing what we expect it to do. It doesn’t need that much attention anymore. They can now focus on other solutions to help with the automation and improvement of those.”

Unleash the SRE from within

With potential hiring freezes and budget cuts looming, organizations often try to look for to-be SREs already within their company.

“The perfect SRE is a myth. That perfect SRE would get bored a month, two months down the road, they’d say ‘been there, done that, give me something else, give me something new, I want to learn something different.’ So I am generally looking for people with potential,” Red Hat’s Raghavan said. “And when I say potential, these are people that are, in some cases, traditional software engineers.”

These software engineers would already have a systems mindset with which they can think about systems at scale and approach problems that way. A good pool of potential SREs can also exist with systems engineers that can understand software engineering principles.

“So I am from a hiring practice perspective looking for people that fall in that bucket specifically, because then I know that I can invest in them. And as I invest in them, and as they learn the space, they invest back into the company and back in the team,” Raghavan said. “So I am not looking for a perfect fit. I’m in fact, looking for people who are, in many ways eager to learn, can understand technology and understand how to pick up different spaces quickly.”

It’s also important to assign new SREs to a production process early on and to have a mentor guide them.

Gartner’s Betts sees that some organizations that want to start an SRE practice just wind up rebranding an existing I.T. operations team or person in that role which is the wrong approach.

“An SRE is giving value not just by focusing on things like incident problems, operational improvements, monitoring, and being able to have better insights,” Betts said. “It’s also looking at how we can take some of that software engineering or engineering mindsets to the world of infrastructure operations and look at how we can have reusable modules, efficient infrastructure delivery, efficient response to incidents, and being able to scale capacity.”

In their day to day work, SREs are often embedded into a product team like a development product team where they’ll act as a reliability consultant to inform the team of expectations around reliability in the organization, help to look for some of the toil, and will look to automate some of those practices as part of the backlog in that product team, according to Betts.

“In the early maturity stages, having a completely decentralized model makes a lot of sense, because you’re a lot more nimble and agile. But as the product matures, having a more central function to think about reliability at scale becomes important,” Red Hat’s Raghavan continued.

SRE…the social butterfly?

One skill set that often goes overlooked for this role is soft skills, which should instead be called ‘critical skills’, according to Gartner’s Betts.

SREs need to be great communicators because part of the job function is to communicate effectively, both in terms of data that they see with service level objectives (SLOs), budgets, and other things. They also need to show that they can empathize with customers and talk about specific things that are impacting customers’ experience. The SREs are often the ones interacting with customers, partners, development teams, product managers, and more.

“So if you’re talking to maybe a product owner or a strategy person, you take it to a higher level, you’re talking to someone that’s in the team, as an engineer or a developer, you need to get maybe down into the depths and talk a little bit more detail with them,” Betts said.

Red Hat’s Raghavan added that these soft skills are even more important for an SRE than the technical skills. This is because technical skills are trainable, but it’s often much harder to find people with both soft skills and technical skills.

“That mindset and the ability to articulate that is absolutely vital for a reliability engineering function, because then we start to look at if something really matters to the customer, you should probably be looking at the specific causes that matter and therefore the symptoms that show up to the customer and what it is that we need to get alerted on,” Raghavan said.

To read more, click here.

The post The perfect SRE doesn’t exist, but the right one might already be in your organization appeared first on SD Times.

SD Times news digest: Gremlin Automatic Service Discovery, WhiteHat Attack Surface Management, and Jamf’s same-day Apple OS support

Jakub Lewkowicz — Tue, 27 Apr 2021 15:51:36 +0000

Gremlin has added Automatic Service Discovery to its chaos engineering platform in an effort to help companies improve resilience and reduce downtime by identifying the various services running across distributed systems.

“The rise in popularity of microservices necessitate services functioning as first-class citizens. The infrastructure layer is becoming more abstract and engineers are increasingly thinking about their systems as a collection of services,” said Matthew Fornaciari, the CTO and co-founder of Gremlin. “We want to replicate that mental model in Gremlin and reduce the cognitive load necessary to create controlled chaos.”

Gremlin also built a new way to track reliability progress by enabling SREs and DevOps teams to click into a particular service and view the full history of events that were run over time.

More information is available here.

WhiteHat Attack Surface Management announced

WhiteHat Security released Attack Surface Management powered by Bit Discovery to offer enterprises a more streamlined way to discover, manage and secure their comprehensive attack surface.

Bit Discovery automatically generates a comprehensive inventory of exposed assets including websites, VPNs, DNS servers, IoT devices and phishing sites and security teams can use the dashboard to bring specific assets under WhiteHat’s application security service, according to the company

“Attack Surface Management Powered by Bit Discovery not only bolsters WhiteHat’s platform with innovative tools that provide a tremendous amount of value for our clients, it also advances our vision to build security into each step of the entire software development lifecycle,” said Craig Hinkley, the chief executive officer at WhiteHat Security.

Jamf announces same-day support for Apple OS releases

Jamf announced that it is prepared with same-day feature support and compatibility for Apple’s latest operating system releases including iOS 14.5, iPadOS 14.5, macOS 11.3 and tvOS 14.5 when they become available.

Jamf said that this functionality is especially useful to education customers that are looking to access education apps in the Mac App Store and make them available to students.

The company’s other products Jamf Now, Jamf Connect and Jamf Protect are also offering same-day support for the latest releases from Apple with compatibility for new operating systems.

Microsoft announces plans to end support for .NET Framework 4.5.2, 4.6 and 4.6.1

Microsoft announced it will be ending support for .NET Framework 4.5.2, 4.6 and 4.6.1 in one year. After which, it will no longer provide updates including security fixes or technical support for these versions.

There will be no change to the support timelines for any other .NET Framework version including .NET Framework 3.5 SP1, which will continue to be supported as documented on the .NET Framework Lifecycle FAQ.

Microsoft found that updating .NET Framework 4.6.2 and newer versions to support newer digital certificates for the installers would satisfy the vast majority 98% of users without them needing to make a change.

Additional details are available here.

The post SD Times news digest: Gremlin Automatic Service Discovery, WhiteHat Attack Surface Management, and Jamf’s same-day Apple OS support appeared first on SD Times.

SD Times news digest: Qt acquires froglogic, the Embedded Software Testing & Compliance Summit, and Catchpoint’s virtual SRE community event

Jakub Lewkowicz — Wed, 14 Apr 2021 15:20:06 +0000

Qt announced that it will acquire froglogic GmbH, a major provider of quality assurance tools, to bring froglogic’s test automation tools into the Qt product portfolio.

“As The Qt Company continues its growth, the acquisition of froglogic is an important milestone in broadening Qt’s best-in-class software development tools and building in automated testing and code coverage analysis directly into our suite of products. Understanding that speed of delivery for new products is crucial to our customers, our goal is to improve developer productivity and make the product development process as streamlined as possible,” said Juha Varelius, president and CEO of Qt Group Plc.

Froglogic GmbH offers tooling to support GUI test automation, code coverage analysis and test result management, enabling customers to assess and steer their quality assurance efforts across an application’s life cycle.

Embedded Software Testing & Compliance Summit announced
Parasoft announced that it is hosting a live virtual event on May 6th in which industry leaders will share their embedded software quality stories of overcoming safety-critical compliance and security challenges with automated software testing solutions.

“Companies across all industries need to have confidence in their software quality and deliver safe and secure software to their users,” said Arthur Hicken, evangelist and event moderator at Parasoft. “Many embedded software companies are turning to automated and integrated testing that includes static code analysis, unit testing, regression testing, code coverage, and requirements traceability to ensure compliance with functional safety, security, and coding standards. In this summit you’ll hear how organizations are solving real safety and security software issues.”

The talks will cover how a medical device technology company successfully adopted a unit testing solution, how an avionics developer and manufacturer achieved code compliance and streamlines productivity and much more.

Additional details on the event are available here.

Catchpoint announces virtual SRE community event on June 10th
Catchpoint announced that it will launch its SRE from Anywhere, a virtual, interactive event that focuses on helping SREs connect with peers to share best practices, industry trends and organizational dynamics.

The event will feature panel discussions, practitioner sessions and lightning talks to foster an open forum for inclusion and learning.

Other talks include results from the 2021 SRE survey sponsored by Catchpoint, VMWare Tanzu, and the DevOps Institute about true observability, DevOps principles and the latest use cases and trends such as Platform Ops.

Accolade for Smart Products
Sopheon announced Accolade for Smart Products, a new management solution that brings together traditionally siloed software and physical product development.

The solution aims to foster cross-functional collaboration and synchronization that results in trusted, timely data for faster, better and more dynamic decision making.

“As the digital and physical worlds collide, many companies struggle to find the best ways to manage innovation across different disciplines. Accolade for Smart Products enables companies – from traditional manufacturers to new technology stars – to accelerate product delivery, while also implementing the best practices needed for product reliability without dragging down innovation,” said Paul Heller, the chief technology officer of Sopheon.

The post SD Times news digest: Qt acquires froglogic, the Embedded Software Testing & Compliance Summit, and Catchpoint’s virtual SRE community event appeared first on SD Times.

Report finds chaos engineering can significantly decrease MTTR and increase availability

Jakub Lewkowicz — Wed, 27 Jan 2021 14:25:56 +0000

A new report revealed those who have successfully implemented chaos engineering have 99.9% or higher availability and greatly improved their mean time to resolution (MTTR).

Gremlin’s inaugural 2021 State of Chaos Engineering report found 23% of teams who frequently run chaos engineering projects had a MTTR of under 1 hour, and 60% under 12 hours.

Gartner echoed similar sentiments about the report’s availability finding by predicting that by 2023, 80% of organizations that use chaos engineering practices as part of SRE initiatives will reduce their MTTR by 90%.

According to Gremlin’s report, the highest availability groups commonly utilized autoscaling, load balancers, backups, select rollouts of deployments, and monitoring with health checks.

The most common way to monitor standard uptime was synthetic monitoring, however, many organizations reported they use multiple methods and metrics.

In the report, Gremlin also found that chaos engineering has seen much greater adoption recently, and that the practice has matured tremendously since its inception 12 years ago.

“The diversity of teams using Chaos Engineering is also growing. What began as an engineering practice was quickly adopted by SRE teams, and now many platform, infrastructure, operations, and application development teams are adopting the practice to improve the reliability of their applications,” the report stated.

While it’s still an emerging practice, the majority of respondents (60%) said that they ran at least one chaos engineering attack and more than 60% of respondents have run chaos against Kubernetes.

The most commonly run experiments reflected the top failures that companies experience, with network attacks such as latency injection at the top.

However, some companies are not adopting chaos engineering mostly due to lack of awareness, experience, and time at 80%. Less than 10% of people said that it was because of fear of something going wrong.

“It’s true that in practicing Chaos Engineering we are injecting failure into systems, but using modern methods that follow scientific principles, and methodically isolating experiments to a single service, we can be intentional about the practice and not disrupt customer experiences,” the report stated. “We believe the next stage of Chaos Engineering involves opening up this important testing process to a broader audience and to making it easier to safely experiment in more environments.”

The post Report finds chaos engineering can significantly decrease MTTR and increase availability appeared first on SD Times.

GitLab collaborates with Google Cloud on app deployment

Matt Santamaria — Thu, 05 Apr 2018 17:15:26 +0000

GitLab announced a new collaboration with Google Cloud to offer native integration into Google Kubernetes Engine (GKE). This new integration aligns with GitLab’s vision of Auto DevOps.

Auto DevOps is GitLab’s way of automating DevOps and delivering ideas to production faster. It consists of a collection of build, test and deployment features. The new integration aims to simplify the complexity of setting up and deploying to a Kubernetes cluster. It will also automatically configure CI/CD pipelines to build, test, and deploy applications, GitLab explained.

“We’re excited to collaborate with GitLab to make GKE even simpler to set up through integration with GitLab’s automated DevOps capabilities,” said William Denniss, Kubernetes product manager at Google. “We are constantly looking to further GKE’s mission of enabling customers to easily deploy, manage and scale containerized applications on Kubernetes, and this collaboration with GitLab unlocks accelerated DevOps for containerized applications at scale.”

According to GitLab, it can be time consuming and resource rigorous for developers to setup and maintain Kubernetes. With GKE, developers can automatically spin up a cluster to deploy applications with just a few clicks, Google explained. The clusters are managed by Google SREs and run on Google Cloud Platform’s (GCP) infrastructure.

“Before the GKE integration, GitLab users needed an in-depth understanding of Kubernetes to manage their own clusters,” said Sid Sijbrandij, CEO of GitLab. “With this collaboration, we’ve made it simple for our users to set up a managed deployment environment on GCP and leverage GitLab’s robust Auto DevOps capabilities.”

The company is also in process of migrating GitLab.com to the Google Cloud platform. According to GitLab, the main reason for the migration is that Google has the most mature Kubernetes platform. This will allow access to security functionality and includes default encrypted data at rest, an expanding list of localities served globally, and tight integration with the existing CDN for faster caching.

The post GitLab collaborates with Google Cloud on app deployment appeared first on SD Times.