Introducing Deep Lake, the Data Lake for Deep Learning

Executive Summary

One of three ML projects fails due to the lack of a solid data foundation. Projects suffer from low-quality data, under-utilized compute resources, and significant labor overhead required to build and maintain large amounts of data. Traditional data lakes break down data silos for analytical workloads, enable data-driven decision-making, improve operational efficiency, and reduce organizational costs. However, most of these benefits are unavailable for deep learning workloads such as natural language processing (NLP), audio processing, computer vision, agriculture, healthcare, multimedia, and robotics/automotive, and safety & security verticals. Hence repeatedly, organizations opt-in to develop in-house systems.

Deep Lake main picture

Deep Lake maintains the benefits of a vanilla data lake and enables you to iterate on your deep learning models 2x faster without teams spending time building complex data infrastructure.

Deep Lake stores complex data, such as images, videos, annotations, embeddings, and tabular data, in the form of tensors and rapidly streams the data over the network to Tensor Query Language, in-browser visualization engine, and deep learning frameworks without sacrificing GPU utilization. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake will become the new norm

Behind the Scenes at Activeloop

In 2016, before starting the company, I started my Ph.D. research at the Connectomics lab at Princeton Neuroscience Institute. I have witnessed the transition from a gigabyte to a terabyte, then to petabyte-scale datasets to achieve super-human accuracy in reconstructing neural connections inside a brain in just several years. Our problem was to figure out how to optimize and cut the cost 4-5x by rethinking how the data is stored, streamed from storage to the compute, which models to use, and how to compile them and run them at scale. While the industry moved slowly, we have observed how similar patterns repeat themselves on a much larger scale.

We started Activeloop (formerly known as Snark AI) as part of the Y Combinator Summer 2018 batch to enable organizations to be more efficient at deploying deep learning solutions. We helped build a large language model for patents for a legal tech startup and streamable data pipelines for the petabyte-scale machine learning use case in AgriTech. Through trial and error and talking to hundreds of companies, we learned that all the awesome databases, data warehouses, and data lakes (joined by lakehouses) are great at analytical workloads but not as much for deep learning applications. The demand for storing unstructured data such as audio, video, and images has exploded over the years (more than 90% of the data is now generated in unstructured form). We knew that building the database for AI, the solution to store it, was the proper challenge for us.

In 2020, we open-sourced the dataset format called “Hub”, which enabled storing images, videos, and audio as chunked arrays on objects storages and connecting to deep learning frameworks such as PyTorch or Tensorflow. We have collaborated with teams from Google AI, Waymo, Oxford University, Yale University, and other deep learning groups to figure out the nuts and bolts of a solid data infrastructure for deep learning applications.

In 2021, the open-source project trended #1 in Python & #2 across all GitHub repositories and was even named as one of the top 10 python ML packages. As of writing this post, the project has 4.8K stars, 75+ contributors, and +1K community members. It is in production both at research institutions, startups, and public companies.

We also released the managed version of Activeloop that lets you visualize datasets, version-control or query them, and stream to deep learning frameworks. Apart from providing access to 125+ machine learning datasets, it enables sharing private datasets and collaboration on building and maintaining datasets across organizations. Of course, I couldn’t be more proud of our small and under-resourced team for achieving results in such a short time, but the industry has been innovating at a staggering speed.

Large Foundational Models Taking Over by Storm

Deep learning achieved super-human accuracy in applications across industries in a few years. Cancer detection from X-Ray images, anatomical reconstruction of human neural cells, playing highly complex games such as Dota or Go, driving cars, unfolding proteins, having human-like conversations, generating code, and even realistic images that took the internet by storm (it took about 40 words to create the perfect prompt, but AI generated the stunning title image of this post). Three factors enable this speed: (1) novel architectures such as Transformers, (2) massive compute capabilities using GPUs or TPUs, and the large volume of datasets such as ImageNet, CommonCrawl, and LAION-400M.

At Activeloop, we firmly believe that connecting deep learning models to the value chain in the next five years will produce a foundational shift in the global economy. While innovators primarily focused on models and computing hardware, maintaining or streamlining the complex data infrastructure has been an afterthought. In the build versus buy dilemma, organizations (for the lack of a “buy” option) repeatably build hard-to-manage in-house solutions. All this led us to decide on the next chapter for the company - Deep Lake.

Activeloop Deep Lake and autonomous vehicles

Introduction to Deep Lake, the Data Lake for Deep Learning

What is the Deep Lake?

ML cycle with Deep Lake

Deep Lake is a vanilla data lake for deep learning, but with one key difference. Deep Lake stores complex data, such as images, audio, videos, annotations, embeddings, and tabular data, in the form of tensors and rapidly streams the data over the network to Tensor Query Language, an in-browser visualization engine, or deep learning frameworks without sacrificing GPU utilization.

Deep Lake provides key features that make it the optimal data storage platform for deep learning applications, including:

A scalable and efficient data storage system that can handle large amounts of complex data in a columnar fashion
Querying and visualization engine to fully support multimodal data types
Native integration with deep learning frameworks and efficient streaming of data to models and back
Seamless connection with MLOps tools.

Machine Learning Loop with Deep Lake

These are five fundamental pillars of Deep Lake.

Version Control: Git for data
Visualize: In-browser visualization engine
Query: Rapid queries with Tensor Query language
Materialize: Format native to deep learning
Stream: Streaming Data Loaders

We discuss those features in-depth in the Deep Lake White Paper and shed light on how it works in the Academic Paper.

Deep Lake and the Data Loader Landscape

Data loaders are one of the more significant bottlenecks of machine learning pipelines (Mohan et al., 2020), and we’ve built Deep Lake to specifically resolve the data to compute handoff bottleneck.

graph 1

Comparison with AWS & Deep Lake

We are thankful to Ofeidis, Kiedanski, & Tassiulas from Yale Institute For Network Science, who have spent a lot of time producing an independent, extensive survey, & benchmarking of open-source data loaders. The research concluded that the third major iteration of our product, Deep Lake, is not only 2x faster than the previous version but is superior to other data loaders in various scenarios.

Dataloader comparison 1

*Comparing the performance of Activeloop Hub, Deep Lake, and Webdataset when loading data from different locations: Local, AWS, and MinIO. (Ofedis et al. 2022)

Dataloaders 2

*Speed as a function of the number of workers for RAN- DOM on a single GPU. (Ofedis et al. 2022)

The reasoning behind some of Deep Lake’s architectural decisions

Naturally, it took a lot of thinking and iteration cycles to arrive at the way Deep Lake is architected - and here are a few of considerations we’ve had.

Where does Deep Lake fit in the MLOps?

As numerous MLOps tools get into the market, it becomes hard for buyers to understand the landscape. We collaborated with the AI Infrastructure Alliance to craft the new MLOps blueprint that provides a clear overview across tools. The blueprint goes bottom-up from infrastructure to human interface and left-to-right from ingestion to development. In the blueprint, Deep Lake takes on the role of a solid data foundation.

aiia

Why we renamed Hub to Deep Lake?

Originally, Hub was a chunked array format that naturally evolved with version control, streaming engine, and query capabilities. Our broad open-source community - users from companies, startups, and academia were instrumental in iterating on the product. Increasingly, we found the name too generic of a descriptor (or, as one of our team members put it, “everyone has a Hub nowadays”). Often, it would cause confusion with dataset hubs. Internally, we were calling it a “deep lake” (or named it after the deepest lakes in the world). We were delighted to see people like A. Pinhassi also think in the same direction. Overnight, calling the tool, we’re building “deeplake” instead of "hub", which felt just right (although our marketing department wasn’t too thrilled on account of freshly-ordered swag with the Hub branding).

pip3 install deeplake

Is there a Deep Lakehouse, and where does it come into place?

The format includes versioning, and lineage is fully open-source. Query, streaming, and visualization engines are built in C++ and are closed source for the time being. Nonetheless, they are accessible via Python interface for all users. Being committed to open-source principles, we plan to open-source high-performance engines as they commoditize.

Deep Lake Architecture

Does Deep Lake connect to the Modern Data Stack and MLOps tools?

The Deep Lake Airbyte destination allows ingesting a dataset from vast amounts of data sources. On the MLOps side, we have been collaborating with W&Bs, Heartex LabelStudio, Sama, CleanLab, AimStack, and Anyscale Ray to provide seamless integrations, which we are going to release in subsequent blog posts.

The MLOps Ecosystem and Deep Lake

What’s next for Deep Lake?

As Deep Lake evolves, we will continuously optimize the performance, add a custom data sampler, sub-tile queries for constructing complex datasets for the 3.1.0 release, performant TensorFlow support, and ACID transactions scheduled for the 3.2.0 release (watch our GitHub repo to stay tuned).

We believe that the next step for AI research is to capture text, audio, images, and videos by large multi-modal foundational models. Just think about how many days it took to get to Dall-E and how many took from that milestone to Stable Diffusion or Make-A-Video by Meta AI. Having a solid data infrastructure is going to be a necessary condition for delivering those models into consumers’ hands. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake is becoming the new norm.

You can dive right into Deep Lake (yes, we will be making endless water puns) by trying out this Getting Started with Deep Lake Colab, and checkout our new C++ dataloader and query engine (Alpha) in this Colab. Join our slack community or book an introductory call with us if you want to start the onboarding immediately.

Citations

The Future of Deep Learning with Deep Lake. Activeloop
Hambardzumyan, Sasun, et al. "Deep Lake: a Lakehouse for Deep Learning." arXiv preprint arXiv:2209.10785 (2022).
Deep Lake — an architectural blueprint for managing Deep Learning data at scale — part I. Pihnasi, A.
Ofeidis, Kiedanski, & Tassiulas "An overview of the data-loader landscape: comparative analysis." arXic preprint arXiv:2209.13705 (2022)
Mohan, Jayashree, et al. "Analyzing and mitigating data stalls in dnn training." arXiv preprint arXiv:2007.06775 (2020).