← Back to Blog
Friday, September 9 2022

Modern Data Stack – The Missing Piece of Every MDS

The Modern Data Stack is incomplete without an effective Data Activation Platform that facilitates the journey from raw data to business impact and ROI.

[lwptoc toggle=”0″ width=”full” itemsFontSize=”14pt” backgroundColor=”#f2f2f2″]

As a data practitioner (data analyst, data scientist, data science manager, data engineer….and so on), I’ll bet my t-shirt that you have heard the term Modern Data Stack (“MDS”) in the last 2 years. And, if you have not, then I’ll send you a shiny t-shirt (link at the bottom!) while I introduce the MDS and then discuss what it’s missing.

What is the Modern Data Stack?

You know the story is iffy at best if the story changes from vendor to vendor, platform to platform and one use case to another. Truth be told, there is no singular “MDS”. Every company that claims to have a tool that is “Modern” defines the MDS differently. I’ll spare you the work of reviewing these differences and highlight the key factors that truly define what the MDS actually is:

The MDS refers to a collection of complementary tools & platforms that are easy to integrate, are largely scriptable, and, collectively, solve challenges primarily related to data ingestion, storage and application:

  • Storing the data in a cost efficient manner;
  • Publishing it for use by humans, batch-processes and other platforms;
  • Transforming it to increase usability and efficiency;
  • Facilitating business value creation from it via intelligence tools (BI, ML, AI), and 
  • Enabling clear observability & organizational security.

Together, the various MDS tools facilitate:

  • Data Discovery – connect to, find and explore the data
  • Data Sharing / Collaboration – share data collections with the team/organization
  • Secure Governance – discover and share in a secure and transparent manner
  • Workflows – combine (often repetitive) tasks into workflows 
  • Personalized Exploration – easily discover and work with the data/analyses/scripts that matter to them
  • Data Products – using all the features of the tool(s) to generate insights and deploy those insights into action in the form of data products (dashboards, apps, APIs)

Here is an exhaustive diagram of the toolsets in the MDS and their use cases. There is a good reason why it’s called a “stack”!

Modern Data Stack - Detailed Version - Modern Data Stack (Detailed System Diagram)
Adapted from Emerging Architectures for Modern Data Infrastructure — by partners @ a16z

 

If that diagram doesn’t strain your eyes then good on you! Such an infrastructure is usually managed by many many engineers (>20) at every company that wants to build a modern data infrastructure.

In essence, the MDS is a set of cloud-based technologies centered on data to empower users to explore and use data. A slightly simpler version here is a bit more approachable and understandable.

Modern Data Stack - Detailed Version - Modern Data Stack (Toolset)
An overview chart of the Modern Data Stack by @ValentinUmbach

What does the Modern Data Stack do well?

The tools in the MDS are leaps and bounds ahead of traditional tools that do ingestion, streaming, storage, transformation, analytics…. Tools like Fivetran, Segment, Rudderstack make ingestion almost plug-and-play. The time to get started and create value from the investment becomes significantly short — sometimes hours. For transformation, dbt combines the best practices of the software industry and with the convenience of using SQL to write data pipelines. This has democratized data pipelines and is arguably one of the most loved tools in the data engineering space.

So, what’s missing?

Imagine you lead the product recommendation engine in the Starbucks app and your goal is to engage the customer with an enticing upsell offer for their next drink. You use multiple tools in your workflow to: build a pipeline from streaming transactions data transform it with SQL/NoSQL run a ML recommendation inference using previously trained model using a Python script return a product recommendation write the results to a production table send the promotion notification to the customer.

From Raw Data to Business Value - A typical end-to-end analytics project that traverses from raw data to an action that generates business value
Transforming Raw Data into Business Value – A typical end-to-end analytics project that traverses from raw data to an action that generates business value

Eight years ago when I was working on Consumer Insights at Starbucks, something like this was impossible to achieve unless you wanted to write and maintain all of it in Java! But now, you can achieve this with a combination of SQL and Python, both of which are really common skills among data practitioners.

The MDS has enabled data practitioners to approach problems like this with a high degree of confidence. However, the MDS is still missing a piece that is critical to deriving value (right side of the diagram) from all the data captured and all the investment in the tools (left side of the diagram). In the above example, you would have to patch together many analytics tools to successfully deliver this project:

  • ingestion tool (to tap into streaming data), 
  • a webhook (likely on a cloud platform like AWS), 
  • an analytics tool that can run Python to call the ML inference code and return the response, 
  • an API or a script to write the recommendations to a production table,
  • an API platform, an auto-scaling mechanism to make sure the infrastructure doesn’t get bogged down with requests.

That infrastructure work gets in the way of actually creating value from the data. 

What’s missing in the MDS is a “data activation” tool. Data Activation is defined as the ability (including ease and speed) to discover, explore, derive meaningful insights, and put those insights into action.

From Raw Data to Business Value - A typical end-to-end analytics project that traverses from raw data to an action that generates business value
A typical journey, involving many tools, stakeholders and teams, that a data analyst or data scientist needs to traverse in order to go from data to impact.

Most organizations patch together 5+ tools to traverse the data activation journey creating an inefficient ecosystem that makes the data practitioners (data scientists, data analysts and data data engineers) continuously hop between tools to fully execute a data project like this one. Managing these systems, integrations, permissions, auto-scaling and training is extremely expensive. For a small or medium business, it’s unachievable to hire and retain the talent that can build and maintain a reliable multi-tool MDS. 

This highlights the need for unification of the capabilities of analytics tools (those on the right side of the stack) into a single platform that minimizes the tool hopping while allowing users to leverage other tools as and when needed.

The Ideal Data Activation Platform

With all the work that has gone into making sure that data is easily ingested, easily stored and transformed, the focus now is on the later stages of that diagram — i.e. do the analytics tools in the MDS make it easy for BI engineers, data scientists and data leaders to generate value from data? In other words, is there a tool that enables you to:

  • Connect to and Explore organizational with SQL/Python/R or a combination of them,
  • Build interactive tables & data visualizations,
  • Train and deploy Machine Learning models in any library,
  • Collaborate across teams, stakeholders and customers over reports, insights, data points or code, like you would in a Google doc,
  • Democratize insights and best practices across your organization,
  • Prototype/deploy/publish data products (dashboards, APIs, web applications),
  • Orchestrate and schedule entire projects or parts of it,
  • Present and publish a data story supported by charts, code, descriptions and comments,
  • Measure and improve organizational engagement with data resources, code quality standards,
  • Not worry about the infrastructure, compute resources, workloads, configurations and security

Enter Jupyter Notebook (and Noteable)

With over 10 million Jupyter Notebooks on GitHub (as of Dec 2020) and over 500 open source contributors to the Jupyter project, notebooks have emerged as the powerhouse enabling data practitioners to efficiently work with data and reduce the Time-to-Insight.

Collaborative Notebook platforms like Noteable that are based on the Jupyter Notebook protocols have further fulfilled a long wishlist of notebook features that have grown notebooks into a versatile data activation tool. Notebooks, by design, are language agnostic and give the end user the ability to combine code, descriptions and visualizations in a single readable document. This enables a data practitioner to cover a large part of the data activation journey in a single platform, saving time, reducing errors, and eliminating inefficiencies.

Noteable —  the missing piece of your Modern Data Stack

 

Noteable — Powerful features for technical teams. Simplicity for everyone else.
Noteable — Powerful features for technical teams. Simplicity for everyone else.

 

At Noteable, we are building a collaborative data workspace with a vision to enable everyone with data. That vision has guided us to build the most collaborative, and interactive notebook experience that eliminates various friction points that currently exist in a data team’s workflow (think ideating, working on, sharing, managing any data project). And, we have done that while keeping in mind the various segments of our audience — the hobbyist, the Kaggler, the learner, the teacher, the reviewer, and the enterprise data team. 

Expand this section to see the complete list of Noteable’s features

  • Leverage notebooks to run queries, visualize, share insights
  • Analyze data with SQL, Python, and R or a combination of those
  • Interactively explore, clean and transform data
  • Build no-code interactive visualizations and dashboards
  • Collaborate with team, stakeholders and customers around interactive visualizations, analysis, code, data points and documents
  • Easily connect with a number of data sources
  • Easily launch, upgrade and manage compute resources right from the notebook
  • Train and Deploy Machine Learning models with any library
  • Enterprise-grade security, permissions and secrets store, 
  • Build ETL workflows using notebooks as data pipeline or batch process 
  • Use it as part of your DevOps toolkits
  • Ability to run headless as an inference engine
  • Prototype and publish  data products, dashboards web applications and dashboards
  • Managed infrastructure that scales with your needs and workloads

By focusing on Data Collaboration, Noteable has, essentially, brought together capabilities that currently live in a disparate set of tools into an enterprise-grade collaborative notebook experience. When cross-functional teams interact with the same tool to make decisions and glean insights, it creates shared context and encourages precise exploration, improves productivity, and eliminates communication gaps.

Noteable, by design, supports the best practices of analytics, brings the ease of use of BI tools to notebooks and tops it off with enterprise grade security, versioning and scheduling capabilities to meet the needs of any data team.

Together, these features make Noteable a powerful data activation platform, empowering data teams to:

  • quickly get value out of your  organization’s data,
  • explore & visualize effectively,
  • collaborate, discuss and present efficiently,
  • turn your work into a data product,
  • securely manage permissions and secrets,
  • and, in-turn deliver impact with actionability.

Noteable is free to try for as long as you’d like. Sign up and try it yourself –>

(…and then, be on the lookout for an email regarding that shiny Noteable t-shirt that I promised!)