Months of hard work culminated in this release - we’ve rebuilt and introduced certain features for more powerful & collaborative workflows like data lineage with your datasets.
Query your Hub datasets in Platform using our highly-performant query engine, powered by C++
Today’s release of Hub 2.7.1 improved queries and data lineage (you can check out how these features work together in this playbook).
Querying just got a major revamp. This allows you to query large datasets in seconds using SQL-like queries. Example queries can be:
1(select * where contains(labels, 'person') limit 1000) union (select * where contains(labels, 'frisbee') limit 1000)
2
3select * where contains(labels, 'car') and contains(labels, 'truck')
4
5select * where shape[0] > 10
6
Upon querying, you may now save query results or subsets of your data and optimize them for training.
1ds_view = ds[1:2:100].save_view() # Saving views
2ds_view = ds.load_view(id, optimize = True, num_workers = 4) # Loading views
3
If optimize = True, the method copies and re-chunks the subset of the data so that it’s materialized for streaming. And, finally,
1ds_view = ds.delete_view(id) # Deleting views
2
Full details on how/where dataset views are saved are available here
The enhanced querying feature is available on all Activeloop datasets for free, and on Growth & Enterprise plans for private datasets.
No more need to copy all your data to Hub format
Based on user feedback, we’ve decided to ship a feature to allow your Hub datasets store references to your data using hub.link. This is now possible for original data is saved as discreet files(images, videos, audios, etc.).
When you want to materialize your data for training, save your data as a dataset view and optimize (rechunk) it using the ds_view = ds.load_view(id, optimize = True) API above.
More collaboration for less
Up to 10 collaborators per organization
Best datasets are built in teams of people - and Activeloop exists to make such collaboration more seamless. That’s why with this update, we’ve decided to increase the capabilities of our Community plan, opening it to up 10 collaborators (instead of 3 previously). We believe this will help smaller ML teams across companies and in Academia to be more productive together.
Growth plan is more accessible
Simultaneously, we’ve decided to lower the Growth plan to just $495 per month. This change will be reflected in our self-serve option starting this week.
Additional Features
- Delete samples from your data using ds.pop(index) or ds.tensor.pop(index)
- Instead of returning all data as numpy arrays, access more information about your data using ds.tensor[index].data(), which returns a dictionary containing all the important information about your sample.
Note that for htype = class_labels, the key “numeric” has been changed to “value” in .data(), in order to make it more consistent with other htypes.
- Performance optimizations for small samples and bug fixes.