Almost all classmates and friends in my professional network are working in the field of data sciences now. From small bank in Cleveland Ohio to the FANG (Facebook, Amazon, Netflix, Google) in Silicon Valley. From conversations with them, it occurred to me that different Data Science teams operate in different maturity levels.

Capability Maturity Model (CMM) for Data Science

Evolved from the Capability Maturity Model for Software, The Capability Maturity Model Integration (CMMI) “is a proven set of global best practices that drives business performance through building and benchmarking key capabilities.” CMMI best practices focus on “what needs to be done to improve performance and align operations to business goals.” Applied to the data science practice area, CMMI involves below activities:

With data as the input, business needs/goals as the expected results and performance as the actual results, we need to focus on below two areas to drive performance:

  1. Build up the team capabilities to cover business needs and goals.
  2. Gain maturity in the process

Capability

Capability is the mechanisms necessary to meet business needs and to realize business goals. CMMI’s characterization of capability as “an organization’s performance and process improvement achievements in individual practice areas” is flatly wrong. Capability is what it is according to dictionary definition: “a talent or ability that has potential for development or use.” Assessment of capability thus should be measured not by “levels” but rather by degree of “coverage” for the business needs and goals by team capacities.

Below diagram depicts a 3-D space spanned by three sub-practice area of data science: tools, skill set and delivery.

With all the popularity Tableau enjoys in the data science community, few people examine what it truly is. It’s designed as a visual analysis system for data analyst to ask and answer questions about data, visually. As much as Tableau claims that it can also do data visualization, D3.js is the ultimate tool for custom and complex visualizations of data (Want to visualize COVID-19 genetic info with Tableu? Good luck!). The consumer of Tableau’s visualization is data analyst, who formulate new hypothesis based on the visualization. The consumer of D3.js’s visualization is the end user, who make business decisions based on the visualization. As is the case in every spectrum, Plot.ly is the middle-of-range player which is built on D3.js with higher level charts and less flexibility than D3.

When it comes to skill set of a data science team, a common mishap is the 100% focus on math. I recommend having at least one team member with computer science background, for below reasons:

  • Many advanced tools in data science are “code first”: D3.js requires knowledge of web technologies such as DOM and CSS. Effective use of Apache Spark involves programming with Scala, the functional and object-oriented programming language with advanced features such as pattern matching and type inference…
  • Many technologies in data science are computational in nature: differential and probabilistic programming that underpin the recent success of deep artificial neural network with image and natural language data are nothing but numerical optimization and simulations.
  • The number one value data science provides to business is not analytics, rather it’s automation. Tagging an image of a silicon wafer as “defective” is not that amazing, human can effortlessly make the same call based on that apparent crack on the wafer. But sifting through thousands if not millions of wafer images and stop the production line if the defective rate exceeds specified threshold, the automatic process is amazing! The process of automating decision making with data science requires mostly computer science skills.

Process and Maturity

According to American Heritage college dictionary, a process is “a series of actions, changes, or functions bringing about a result.” The IEEE’s defines a process as “a sequence of steps defined for a given purpose”.

Process maturity is “the extent to which a specific process is explicitly defined, managed, measured, controlled and effective.”

As a data science team gains in process maturity, it “institutionalizes its process via policies, standards, and organization structures. Institutionalization entails building an infrastructure and a corporate culture that supports the methods, practices, and procedures of the business so that they endure after those who originally defined them have gone.”

In Summary

Building up capabilities to cover business needs and goals, and gaining maturity in processes that drives performance are the two areas a new data science team can focus on to ramp up the process of value creation for the business.

References:

  1. CMMI Model At-A-Glance offers a preview of the available capabilities and practices areas.
  2. I’m having trouble reconciling CMMI with other process formalisms such as DMAIC of Lean Six Sigma and IDEF0. The process is ongoing but one thing I’m pretty sure: They are all good!
  3. Can’t agree more with the “Analytics is code” principle of the DataOps Manifesto!
  4. Discussions about data scientist job from the field
Capability Maturity Model for Data Science Teams

Leave a Reply