Hub 2.3.3 is out! Version control upgrade, helper functions, GSOC 2022, and exciting community contributions. Here’s what’s new.
New Hub features
Now you can delete uncommitted changes using ds.reset(). Also, with Hub 2.3.3 you can merge branches and commits using ds.merge(). Copying datasets from one location to another is now possible using hub.copy() and hub.deepcopy() (includes version control history). Metadata from file headers appended using hub.read(fn) is now automatically stored in ds.tensor_name.sample_info.
Community shoutouts
Abid Ali Awan has written a great guide on Hub and the Activeloop Platform for KDnuggets!
Alex Wang has uploaded and documented the KMINST dataset on our Machine Learning Datasets Catalogue.
Manas Gupta has documented the Google Objectron dataset on our Machine Learning Datasets Catalogue.
Paul created an example for using Hub, Tensorboard & Docker to train a model in PyTorch.
Jinyi Chen is currently finalizing the Chinese version of the readme! Let us know if you’d like to translate it into other languages.
Bikram Maharjan is working on support for additional image formats in hub.auto Also thanks to Suhaas Neel for the multiple PRs he’s working on!
GSOC 2022
GSOC proposals opened yesterday, make sure you contribute/finalize your PRs by April 19 and apply!
1import requests
2import tqdm
3from typing import List
4
5#financial reports of Amazon, but can be replaced by any URLs of pdfs
6urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf' ]
7
React 2023
---------
1import requests
2import tqdm
3from typing import List
4
5#financial reports of Amazon, but can be replaced by any URLs of pdfs
6urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf', 'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf', 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf' ]
7
8def load_reports(urls: List[str]) -> List[str]:
9 """ Load pages from a list of urls"""
10 pages = []
11
12 for url in tqdm.tqdm(urls):
13 r = requests.get(url)
14 path = url.split('/')[-1]
15 with open(path, 'wb') as f:
16 f.write(r.content)
17 loader = PagedPDFSplitter(path)
18 local_pages = loader.load_and_split()
19 pages.extend(local_pages)
20 return pages
21
22pages = load_reports(urls)
23