data scientists Archives - SD Times

Data scientists and developers need a better working relationship for AI

Tiago Miyaoka — Tue, 06 Aug 2024 20:02:41 +0000

Good teamwork is key to any successful AI project but combining data scientists and software engineers into an effective force is no easy task.

According to Gartner, 30 percent of AI projects will be abandoned by the end of 2025 thanks to factors such as poor data quality, escalating costs and a lack of business value. Data scientists are pessimistic, too, expecting just 22 percent of their projects to make it through to deployment.

Much of the debate on turning these poor figures around by delivering better AI has focused on technology but little attention has been paid to improving the relationship between those scientists and engineers responsible for producing AI in the first place.

This is surprising because although both are crucial to AI, their working practices don’t exactly align — in fact they can be downright incompatible. Failing to resolve these differences can scupper project delivery, jeopardize data security and threaten to break machine learning models in production.

Data scientists and software engineers need a better working relationship – but what does that look like and how do we achieve it?

DevOps forgot the data science people

As cloud has burgeoned, much of the industry’s attention has been devoted to bringing together developers and operations to make software delivery and lifecycle management more predictable and improve build quality.

Data scientists, during this time, have flown under the radar. Drafted into enterprise IT to work on AI projects, they are joining an environment that’s not quite ready for them.

What do I mean? Data scientists have a broad remit, taking a research-driven approach to solving business- and domain-level challenges through data manipulation and analysis. They operate outside the software delivery lifecycle using special tools and test platforms to build models using a subset of languages employed by developers.

Software engineering, while a creative and problem-solving discipline, takes a different approach. Engineers are delivery-focused and tackle jobs in priority order with results delivered in sprints to hit specific goals. Tool chains built on shared workflows are integrated and automated for team-based collaboration and communication.

These differences have bred friction in four notable areas:

Process. Data scientists’ longer cycles don’t fit neatly into the process- and priority-driven flow of Agile. Accomplish five tasks in two days or deliver a new release every few hours? Such targets run counter to the nature of data science and failure to accommodate this will soon see the data science and software engineering wheels on an AI running out of sync.
Deployment. Automated delivery is a key tenet of Agile that’s eliminated the problems of manual delivery in large and complex cloud-based environments and helps ensure uptime. But a deployment target of, say, 15-30 minutes cannot work for today’s large and data-heavy LLMs. Deployment of one to two hours is more like it — but this is an unacceptable length of time for a service to go offline. Push that and you will break the model.
Lifecycle. Data scientists using their own tools and build processes breed machine learning model code that lives outside the shared repo where it would be inspected and understood by the engineering team. It can fly under the radar of Quality Assurance. This is a fast-track to black-box AI, where engineers cannot explain the code to identify and fix problems, nor undertake meaningful updates and lifecycle management downstream.
Data Security. There’s a strong chance data scientists in any team will train their models on data that’s commercially sensitive or that identifies individuals, such as customers or patients. If that’s not treated before it hits the DevOps pipeline or production environment, there’s a real chance that information will leak.

No right or wrong answer

We need to find a collaborative path — and we can achieve that by fostering a good working environment that bridges the two disciplines to deliver products. That means data scientists internalizing the pace of software engineering and the latter adopting flexible ways to accommodate the scientists.

Here’s my top three recommendations for putting this into practice:

Establish shared goals. This will help the teams to sync. For example, is the project goal to deliver a finished product such as a chatbot? Or is the goal a feature update, where all users receive the update at the same time? With shared goals in place it’s possible to set and align project and team priorities. For data scientists that will mean finding ways to accelerate aspects of their work to hit engineering sprints, for example by adopting best practices in coding. This is a soft way for data scientists to adopt a more product-oriented mindset to delivery but it also means software engineers can begin to factor research backlogs into the delivery timelines.
Create a shared workflow to deliver transparent code and robust AI. Join the different pieces of the AI project team puzzle: make sure the data scientists working on the model are connected to both the back-end production system and front-end while software engineers focus on making sure everything works. That means working through shared tools according to established best practices, following procedures such as common source control, versioning and QA.
Appoint a project leader who can step in when needed on product engineering and delivery management. This person should have experience in building a product and understand the basics of the product life cycle so they can identify problems and offer answers for the team. They should have the skills and experience to make tactical decisions such as squaring the circle of software sprints. Ultimately they should be a project polyglot — capable of understanding both scientists and engineers, acting as translator and leading both.

Data scientists and software developers operate differently but they share a common interest in project success — exploiting that is the trick. If data scientists can align with Agile-driven delivery in software engineering and software engineers can accommodate the pace of their data-diving colleagues it will be a win for all concerned. A refined system of collaboration between teams will improve the quality of code, mean faster releases and — ultimately — deliver AI systems that make it through deployment and start delivering on the needs of business.

You may also like…

Generative AI development requires a different approach to testing

The secret to better products? Let engineers drive vision

The post Data scientists and developers need a better working relationship for AI appeared first on SD Times.

The MissingLink between deep learning and data scientists

Christina Cardoza — Wed, 26 Sep 2018 15:33:29 +0000

Machine learning is rapidly advancing and being deployed across a number of different industries and solutions to intelligently manage data, gain insights and provide value. However, to effectively train and deliver machine learning models is still very time-consuming for many, and involves a lot of grunt work, according to a newly launched company.

MissingLink.ai has launched this week to streamline and automate the entire deep learning life cycle for data scientists and engineers.

“Work on MissingLink began in 2016, when my colleagues Shay Erlichmen [CTO], Rahav Lussato [lead developer], and I set out to solve a problem we experienced as software engineers. While working on deep learning projects at our previous company, we realized we were spending too much time managing the sheer volume of data we were collecting and analyzing, and too little time learning from it,” Yosi Taguri, CEO of MissingLink, wrote in a post. “We also realized we weren’t alone. As engineers, we knew there must be a more efficient solution, so we decided to build it. Around that time, we were joined by Joe Salomon [VP of product], and MissingLink was born.”

The team decided to focus on machine learning and deep learning because of the potential to
“impact our lives in found ways.” Machine learning has already been used for detecting diseases, in autonomous vehicles and in public safety situations, according to the company.

The company will work on providing tools that streamline data, code, experiments and resources as well as automate repetitive, time-consuming tasks, Taguri explained.

The solution will feature an experiment manager than enables engineers to start experiments with as little as three lines of code. It will feature version-aware data management to reduce the load time and make data exploration faster and easier. In addition, it will support popularw AI frameworks including Keras, TensorFlow, PyTorch and Caffee. The product will also offer real-time experiment monitoring and tracking with visual dashboards. Other features include experiment history, comparison, code tracking, auto-documentation, external metrics and charts, data streaming, and data versioning.

Training data is your most valuable asset, so why manage it with a file system? By managing data in a version-aware data store, MissingLink eliminates the need to copy files and only syncs changes to the data. The result is reduced load time and easy data exploration,” the company wrote on its website.

The post The MissingLink between deep learning and data scientists appeared first on SD Times.

CMU, Meltwater and Oxford look to advance data science research

Christina Cardoza — Wed, 15 Aug 2018 15:05:12 +0000

Media intelligence company Meltwater is opening up its artificial intelligence platform to give developers and data scientists real-time insights into business data. Fairhair.ai features a knowledge graph for connecting and organizing internal and external data, access to pre-trained and configurable AI models, and the ability to make data-driven decisions.

“We quickly realized our AI platform had far more potential, beyond Meltwater’s core products, to democratize competitive insights created from real-time online data. Fairhair.ai goes beyond Business Intelligence by filling the gap between historic trends and real-time market indicators, and predict what’s ahead with forward-looking insights – a practice we call Outside Insight,” said Jorn Lyseggen, Meltwater’s founder and CEO.

Lyseggen went on to explain that while online data is one of the most valuable sources for business decisions, it also is one of the biggest challenges for businesses to obtain and understand the right data in order to make valuable conclusions. “Historically, online data has been hard to track and analyze in a systematic and rigorous way because of it sheer scale, plethora of data types, and its multitude of languages. Fairhair.ai addresses many of these challenges and helps companies analyze the noisy and messy web to better understand their competitive landscape,” he said.

As part of the announcement, Carnegie Mellon University announced it would be collaborating with Meltwater to levage the Fairhair.ai platform and advance AI education and research. The university believes the platfrom can help students and faculty create, connect and organize web-scale information.

“”Sharing access to real-world data helps students, researchers and data scientists solve real-world problems more rapidly,” said Eric Nyberg, director of CMU’s Master of Computational Data Science program and a professor in the Language Technologies Institute at CMU. “In addition to realistic real-time data sources, the platform also includes AI modeling and integrated cloud computing to greatly simplify the process of building and optimizing new web-scale analytics.”

The University of Oxford will also be partnering with Meltware to help researchers access and work with the platform. Oxford plans to use the data and AI platform to support four major projects: 1. Value Added Data Systems, 2. FakeNewsRank, 3. Realistic data models and query compilation for large-scale probabilistic databases, and 4. DeepReason.ai.

“These research projects are collectively focused on pushing the boundaries of knowledge graphs, from populating and scaling to processing and reasoning. Meltwater is already leveraging these techniques, together with recent advancements in machine learning, to further strengthen its Fairhair.ai platform,” said Georg Gottlob, professor of informatics, fellow at St. John’s College and member of Meltwater’s Scientific Advisory Board. “Together, we have a shared interest to explore what we believe is the future of AI – the combination of machine learning and logical reasoning – and significantly advance the greater data science ecosystem.”

Other features of Fairhair.ai include access to open and licensed global data, custom sources, custom models, rule-based filtering, predictive insights, reasoning and precision filtering.

“Existing AI platforms often require a lot of expertise and experimentation to be effective. They also don’t help with the expensive and difficult preparation of data. Fairhair.ai provides a fast-track to reap the benefits of AI and advanced insights by opening up access to our data and models organized in a knowledge graph,” said Aditya Jami, CTO and head of AI at Meltwater.

The post CMU, Meltwater and Oxford look to advance data science research appeared first on SD Times.

GitLab to create tool for data teams

Christina Cardoza — Thu, 02 Aug 2018 15:36:17 +0000

GitLab has revealed it is working on a new tool for the data science lifecycle. Meltano is an open-source solution designed to fill the gaps between data and understanding business operations.

“Meltano was created to help fill the gaps by expanding the common data store to support Customer Success, Customer Support, Product teams, and Sales and Marketing,” the team wrote in a post. “Meltano aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle. While this might sound familiar if you’re already a fan of GitLab, Meltano is a separate product. Rather than wrapping Meltano into GitLab, Meltano will be the complete package for data people, whereas GitLab is the complete package for software developers.”

The goal of Meltano is to make analytics accessible to everyone, not just data professionals. According to GitLab, while the company has experience collecting data and presenting it in a readable format to business users in order to make predictions based on the data, it takes too many steps and tools to complete this.

“The idea of bringing best practices from software development to data analytics is a huge draw for the Data team at GitLab. Ideally, all of our work could be done in open source tools, and could be version controlled, and we’d be able to track the state of the analytics pipeline from raw data to visualization,” GitLab wrote.

The tool currently only supports Postgres with plans to add Snowflake in the near future. The company knows it will need to support a wide variety of database types, and its looking for contributions from the open-source community to help meet that goal.

“As an open source tool, we think Meltano will make a big difference for teams without much money to invest in data analytics. It’s a new field for many organizations, and we want to do everything we can to make it easier for teams and business to access their data and make better decisions,” GitLab wrote.

The post GitLab to create tool for data teams appeared first on SD Times.

Google doubles down on machine learning at Google Cloud Next 2018

Christina Cardoza — Thu, 26 Jul 2018 17:12:44 +0000

Google announced new solutions for machine learning at its Google Cloud Next 2018 conference happening in San Francisco this week. The company revealed BigQuery ML, AIY Edge TPU Dev Board, and AIY Edge TPU Accelerator.

BigQuery ML is a new capability inside the Google BigQuery solution, which provides interactive analysis of large datasets in order to share insights and make better decisions. The newly added feature is designed to enable data scientists and analysis to build and deploy machine learning models.

“[M]any of the businesses that are using BigQuery aren’t using machine learning to help better understand the data they are generating. This is because data analysts, proficient in SQL, may not have the traditional data science background needed to apply machine learning techniques,” Google’s research scientists Umar Syed and Sergei Vassilvitskii, wrote in a post. “BigQuery ML is a set of simple SQL language extensions which enables users to utilize popular ML capabilities, performing predictive analytics like forecasting sales and creating customer segmentations right at the source, where they already store their data.”

The newly announced AIY Edge TPU Dev Board, and AIY Edge TPU Accelerator are new devices designed to help engineers build on-device machine learning. The devices are powered by the company’s Edge TPU and mark a step towards expanding AIY, or do it yourself artificial intelligence, into a platform.

The AIY Edge TPU Dev Board enables developers to prototype embedded systems that demand machine learning inferencing. AIY Edge TPU Accelerator is a neural network coprocessor for existing systems. “On-device ML is still in its early days, and we’re excited to see how these two products can be applied to solve real world problems — such as increasing manufacturing equipment reliability, detecting quality control issues in products, tracking retail foot-traffic, building adaptive automotive sensing systems, and more applications that haven’t been imagined yet,” Billy Rutledge, director of AIY projects at Google, wrote in a post.

The post Google doubles down on machine learning at Google Cloud Next 2018 appeared first on SD Times.

SD Times news digest: CSHTML5 1.1, Dialogflow updates and Yubico’s mobile SDK for iOS developers

Christina Cardoza — Wed, 23 May 2018 14:00:46 +0000

The first stable release of C#/XAML for HTML5 1.1 is now available. According to the team, the release signals CSHTML5 as a final product that will get regular updates. CSHTML5 enables developers to create HTML5 apps with C#, XAML and Visual Studio.

The release features bug fixes as well as new improvements including support for custom fonts, ItemContainerStyle, AncestorType, RadialGradientBrush, and the ability to customize the Loading screen.

The team also announced it is looking to integrate Blazor/WebAssembly and Bridge.NET technologies into CSHTML5.

Release notes are available here.

Google announces new Dialogflow features

Google has announced new updates to its conversational interface development suite for building AI experiences. Dialogflow now features beta versions and environments, improved conversation history tool for debugging, and training with negative examples. Google explained the updates are designed to help enterprises create, maintain and improve their AI conversational experiences.

Versions and environments enable users to build, test, deploy and iterate their interfaces. The improved history tool now provides conversations between users and agents, and flags places where the agent was unable to carry out a user’s intent. Lastly, the training with negative examples aims to improve precision.

Cloudera updates focus on data scientists and data engineers

Cloudera wants to make data scientists and data engineers more productive with new ways to operationalize data insights and grow, connect, and protect businesses. The company announced a slew of new features designed to help data teams collaborate and deliver models to production faster.

Updates include: Cloudera Data Science Workbench 1.4, Cloudera Atlus availability on Microsoft Azure, and Cloudera Enterprise 6.0 with improved performance.

“We believe data can make what is impossible today, possible tomorrow. With enhanced capabilities in machine learning, analytics, and cloud, the new software products and cloud services we are announcing will enable our customers to more rapidly gain competitive advantages in the data economy,” said Tom Reilly, CEO at Cloudera. “These enhancements demonstrate Cloudera’s commitment to market-leading innovations that empower enterprises to securely transform complex data into clear and actionable insights to propel their digital transformation.”

Yubico brings mobile SDK to iOS

The company known for creating passwordless experiences is now providing a new mobile software development kit for iOS app developers. The SDK will enable developers to integrate the company’s NEO near-field communication two-factor authentication into iOS apps.

“It’s absolutely critical to have a hardware-based root of trust, like the YubiKey, to establish an approved relationship between a mobile phone and the apps we use,” said Stina Ehrensvard, CEO and founder of Yubico. “Mobile authentication methods, like SMS or push apps, cannot be considered as trusted second factors to authenticate in a mobile app setting. They can be spoofed by porting a number to a different mobile device or can be very unreliable at the mercy of the phone networks.”

The post SD Times news digest: CSHTML5 1.1, Dialogflow updates and Yubico’s mobile SDK for iOS developers appeared first on SD Times.

Industry leaders launch data manifesto

Christina Cardoza — Tue, 06 Mar 2018 14:00:52 +0000

The data community is getting new guidelines to approach data effectively and ethically. Data.world and industry leaders today are announcing the Manifesto for Data Practices.

According to Brett Hurt, CEO of data.world, data is being used today to fuel business efforts and improve customer satisfaction; however there are no formal guidelines to approach data and data collaboration. The Manifesto for Data Practices aims to help organizations and users maximize their data’s internal value and impact while taking things like privacy and security into account.

“Today, every choice that a company makes about data has the potential to help or harm consumers, communities, and even entire countries,” said Hurt. “The Manifesto for Data Practices is critically important because it defines a new model for improving data practices themselves. It can be used by any organization to foster ethical, productive data teamwork, and we feel privileged to have collaborated with so many industry luminaries to co-authors and release this manifesto.”

The Manifesto for Data Practices is built on four values and 12 principles. The values are inclusion, experimentation, accountability and impact.

According to the manifesto, the principles aim to help data teams:

Use data to improve life for our users, customers, organizations, and communities.
Create reproducible and extensible work.
Build teams with diverse ideas, background and strengths.
Prioritize the continuous collection and availability of discussions and metadata.
Clearly identify the questions and objectives that drive each project and use to guide both planning and refinement.
Be open to changing our methods and conclusions in response to new knowledge.
Recognize and mitigate bias in ourselves and in the data we use.
Present our work in ways that empower others to make better-informed decisions.
Consider carefully the ethical implications of choices we make when using data, and the impacts of our work on individuals and society.
Respect and invite fair criticism while promoting the identification and open discussion of errors, risks, and unintended consequences of our work.
Protect the privacy and security of individuals represented in our data.
Help others to understand the most useful and appropriate applications of data to solve real-world problems.

“We believe these values and principles, taken together, describe the most effective, ethical, and modern approach to data teamwork,” according to the manifesto.

The manifesto is currently signed by more than 1,200 data leaders from government, business and academia.

According to Hurt, while the manifesto doesn’t explicitly touch upon the upcoming General Data Protection Regulation in the EU, it is compatible to it.

In addition, data.world is working to develop exercises that can help organizations successfully visualize the manifesto practices and principles.

The post Industry leaders launch data manifesto appeared first on SD Times.

DefinedCrowd releases version 1.0 of its smart AI data platform

Matt Santamaria — Thu, 18 Jan 2018 19:31:09 +0000

Intelligent data platform for AI provider DefinedCrowd is stating the new year off with the 1.0 version of its Software-as-a-Service (SaaS) platform. The platform allows data scientists to collect, enrich, and structure data for artificial intelligence (AI), and is now publicly available.

“Artificial Intelligence needs data to be properly trained. A lot of data. Currently, it is estimated that 80 [percent] of the data generated globally is unstructured. This environment makes access to high-quality structured data expensive and difficult to obtain quickly. As a result, data scientists have been scrubbing their own data, which is a tedious and time-consuming process,” founder and CEO Daniela Braga wrote in a blog post.

The new platform can be accessed through the company’s online platform as well as its public API that was released in November. The API enables data scientists to integrate DefinedCrowd into their own machine learning infrastructures. According to the company, the new SaaS platform specializes in speech technologies, natural language processing, computer vision, image recognition and other machine learning capabilities.

“This is a huge milestone for DefinedCrowd,” said Braga. “It is the culmination of two years working on this solution, and we are really proud to release to the world our SaaS platform, with features designed to make data scientists’ lives easier.”

The solution allows users to select their data source, customize their projects, analyze in real time, and use high-quality training data to train and model AI systems.

“With our solution, it has never been easier to access high-quality data to train your AI systems, faster than ever. Whether you want to train a smart personal assistant, a savvy chatbot, or an accurate self-driving car, we have a solution for you,” Braga wrote.

In addition, the company recently released their own skilled community Neevo, designed to give users a place to contribute to the development and improvement of AI systems. “We live in a time where technology is evolving really fast and its presence in our daily lives is rapidly growing. The interest in AI is growing and many people want to be part of it. This is where Neevo comes into the picture,” explained DefinedCrowd’s chief business development officer Aya Zook. “With Neevo, everyone has a chance to be part of the next wave of innovation by contributing to a better AI.”

The post DefinedCrowd releases version 1.0 of its smart AI data platform appeared first on SD Times.

LinkedIn: Machine learning jobs are on the rise

Jenna Barron — Wed, 27 Dec 2017 19:02:20 +0000

Machine learning engineers, data scientists, and Big Data engineers are among the top emerging jobs in technology. This is based off of a recently released report from LinkedIn.

As technology changes and expands, employment trends change with it. As a result, the skills that are important to have to be successful in the workforce are constantly changing. LinkedIn’s report is from its data over the last five years on what jobs and skills are becoming the most popular.

“It may come as no surprise that technology-centric roles stole the show among emerging jobs in the United States, but the prevalence of machine learning and data science roles and skills indicate a shift in the types of technology we can expect to be using in the near future, as well as what professionals should be preparing themselves for,” the LinkedIn economic graph team wrote in a post.

The results also indicate that in general, specialized roles are on the decline as companies seek to hire those with a more comprehensive skill set. Having a comprehensive skill set was also a strong trend among machine learning engineers and data scientists.

These growing tech jobs are mainly located in urban areas, including San Francisco, New York, and Los Angeles. There has also been an increase in freelance work, most heavily concentrated in Oregon, New York, and California.

The post LinkedIn: Machine learning jobs are on the rise appeared first on SD Times.

Building a data science team for the enterprise

Madison Moore — Fri, 01 Sep 2017 14:00:18 +0000

Data scientists are no magicians, but they are in high demand.

Researchers and analysts in this space recognize the diversity and explosion of Big Data, but the only way enterprises are going to be able to prepare for the future of Big Data is with a data science team capable of working with dirty data, complex problems, and open-source languages, experts in the field say.

According to Forrester research from 2015, 66 percent of global data and analytics decision-makers reported that their firms either expanded or are planning to implement Big Data technologies within the next 12 months. Enterprises today are becoming more serious about Big Data and analytics, and they’re looking to attract data science talent so they can achieve all of their objectives for their data programs.

“There are certain data science rock stars [who are] completely up to speed on deep learning and typically have a doctorate degree,” said Thomas Dinsmore, a Big Data science expert who works at Cloudera. “The big tech companies basically bid up the salaries of those folks, so hiring is challenge and difficult, but not impossible.”

Just take a look at salary data. Glassdoor reports that the national annual salary for a data scientist is $113,436, with big tech companies paying their data scientists anywhere from $108,000 to almost $135,000 annually. And on LinkedIn and other job board sites, recruiters are constantly searching for people that fit the data science role.

Finding a “data rock star”
Part of the reason it’s so difficult finding a data scientist is the role is still not completely clear in many organizations, said Dinsmore. Companies are not always sure what qualities, characteristics, or background they should be looking for in a candidate. In addition to the data science “rock stars,” there are entry-level data scientists, or those who are young and typically have a great understanding of popular open-source languages, coding, and hacking. And because they are “steep in this data science culture, they can add value very quickly when they come on board to a large enterprise,” said Dinsmore.

The number one characteristic a solid data scientist candidate should have is the passion to develop insight from data, which Dinsmore says some data scientists have, and some don’t.

“It’s not necessarily a matter of training in a particular language, because this field is changing so fast, the language or framework or library that is most popular today may not be the most popular in two years,” said Dinsmore. “The thing that sets capable data scientists apart is these people typically have gone out and grappled with data, and drawn some sort of insight from it.”

Since data scientists will have the skills needed to work with analytics and data insights, they become critical components of the actual ‘insights teams’ in place in some organizations, according to Forrester analyst Brian Hopkins, It’s the insights teams that build applications that connect data, insights, and action in closed loops through software, he said.

“If utilized in this way, data scientists becomes critical to achieve an insight advantage, which spells profitable growth in the digital economy,” said Hopkins.

Data scientists should also stay connected to business outcome changes, according to Hopkins. Organizations feel that they need to place their data science team in a room and just feed them data in order for them to do their job, but he disagrees.

“Data scientists need to be embedded in the Agile teams that deliver models and insights directly to impact business changes that are big and differentiating,” said Hopkins. Working this way allows data scientists to stay motivated, and eventually organizations will attract more intelligent data scientists. It also means organizations need to find or provide a data center of excellence, said Hopkins.

“You need a very good working relationship between data science and data engineering, so your engineers are working to free data scientists from having to futz with data access, data quality, and data timeliness,” he added.

Trend towards open data science
Besides the influx of unstructured data, researchers and analysts have noticed a major trend towards open data science. As of today, there is an overwhelming majority of data scientists who prefer open-source data tools, said Dinsmore. In contrast, 10 years ago, if you spoke to working analytics, you would find that commercial tools were the most preferred.

These enterprises were adamant about using standardized tools or one commercial tool, and now they are broadening their horizons and are comfortable working with open-source data science tools, he said. No one is replacing their commercial tools, but instead, using open-source tools to work side-by-side in an organization.

Plus, the cadence of development in open-source languages like R and Python is so much faster. Take TensorFlow, for instance. It was released in 2015 and within a few months, it was the most widely developed machine learning package in the ecosystem, said Dinsmore. And it’s still a trending repository on GitHub.

The main issue for data science teams today is not which tool to use, it’s the inherent conflict between the needs of the data scientists and the needs of the IT organization, said Dinsmore.

Data scientists want flexibility, they want the ability to add and work with new packages, and they want to experiment. They want to work with the tools they are comfortable with, said Hopkins, so if they want Python notebooks, so be it. If they need a version of Spark, the company needs to provide it.

On the other hand, the IT organization wants stability, a limited number of discrete types of software, and above all, security, said Dinsmore.

This conflict can be managed, but it depends on the degree to which an organization can take advantage of open source tools, which can create a “data jail model,” according to Dinsmore. This means data scientists end up running a query and extracting data to their laptops so they can continue to do the work their way. This leads to what Dinsmore says are “laptops in the wild.”

Essentially, there is a whole slew of data scientists who are doing Big Data science on their laptops from extracted data platforms. All of that data is uncovered, insecure, unmanaged, and unsynchronized, said Dinsmore.

A solution is to give data scientists the ability to simultaneously, through containers, have a secure environment for the data, and also have contained and isolated instances available for data scientists to work in the packages they choose, he said.

“It’s very important for organizations to find a way to standardize their data science tooling without sacrificing the ability to experiment and innovate,” said Dinsmore. “That implies choosing a basic platform that will enable experimentation and collaboration, but does not entail just sort of laying down a dictate.”

The post Building a data science team for the enterprise appeared first on SD Times.