Introduction

Being in a Deep Learning field, you most probably know Jeremy Howard and his fastai [1] and projects related to it. One of these related projects is fastdownload [2] that I found only recently (even tho it is out almost a year) and tried to use it. It was not without problems, but I found it useful and I decided to share my findings and enhancements.

If you have datasets or other archives that you want to make available to your users and ensure they always have the latest versions and that they are downloaded correctly, fastdownload can help.

fastdownload can handle multiple URLs pointing at the same archive and ensure that users always get the latest version of the archive. Getting a dataset is as easy as calling the FastDownload.get method and passing the URL of the archive. The URL will be downloaded and extracted to the specified location and the path to the extracted file will be returned.

For instance, fastai uses fastdownload to provide access to datasets for deep learning. fastai users can download and extract them with a single command, using the return value to access the files. The files are automatically placed in appropriate subdirectories of a .fastai folder in the user's home dir. If a dataset is updated, users are informed the next time they use the dataset, and the latest version is automatically downloaded and extracted for them.

Basic example

As described above, the most common usage for fastdownload is to download a dataset from the internet. Datasets are usually formatted as a bunch of files separated into directories based on the class of the data in a file. All of these class directories are then placed in a single dataset directory that is compressed. So usually we have an URL to the compressed dataset and we want to do the following:

  • check if we already have the dataset, if it is up to date and not corrupted,
  • download the dataset, if needed,
  • extract the dataset, if needed,
  • have a path to the extracted dataset.

With fastdownload we can do all of this in two lines (after installing a package and importing it).

!pip uninstall -y -q fastdownload
!pip install -q fastdownload
import fastdownload
from fastdownload import FastDownload

d = FastDownload(module=fastdownload)
path = d.get('https://s3.amazonaws.com/fast-ai-sample/mnist_tiny.tgz')
path
Path('/home/jovyan/.fastdownload/data/mnist_tiny')

Download seems to be successful and we can also see that path is pointing to mnist_tiny/ directory that is stored in .fastdownload/data/ directory that is stored in my home directory.

This is it, you now know how to download datasets in Python using fastai's fastdownload. But if you are interested in more advanced usage, keep reading.

Parametrization

In Basic example we pretty much used default parameters for fastdownload but we can also parametrize it. When creating a FastDownload object we can specify four parameters: base, archive, data and module. First three parameters are connected with location where files will be downloaded and extracted. base is a path to parent directory where it all will live in. Archives are then saved to {base}/{archive}, and extracted to {base}/{data}. When no values are specified, default values are as follows:

  • base = ~/.fastdownload
  • archive = archive
  • data = data

This is exactly matching with our path in Basic example.

Last parameter is module that is useful when downloading datasets connected with some package. One part of downloading is a check if the dataset is not corrupted. To do so, we need to have access to true values of file size and hash. fastdownload is using file download_checks.py for this and it is expected to be located in the same directory as a module we specified using module parameter. Author of a package and datasets should provide this file.

When we look back at the Basic example, we specified module=fastdownload which was pretty useless since fastdownload module contains no download_checks.py file. There is a small bug in the current distribution of fastdownload package that causes it to fail when module is not specified. Fix is already in fastdownload's github but it is not released yet. Therefore we need to specify some module parameter for now even tho it is not really used.

Enhancement 1: Working with a compressed files

Without doubts, fastdownload is a good tool for downloading datasets. However, I stumbled upon a problem when I tried to work with a compressed file (not a compressed directory). This problem might not be apparent at first because first downloading and unpacking of a compressed file is without a problem.

path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz')
path
Path('/home/jovyan/.fastdownload/data/cds.fa')

Problem shows when removing file using force=True parameter in d.get() (which forces new download of file even it was downloaded before).

try:
    path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz', force=True)
except Exception as e:
    print(e)
100.03% [5578752/5577257 00:03<00:00]
[Errno 20] Not a directory: Path('/home/jovyan/.fastdownload/data/cds.fa')

The file was downloaded but unpacking failed. More precisely, removing an old uncompressed file before unpacking a new one failed. fastdownload is expecting only directories as it is the most used format for datasets. I understand that downloading compressed files is not a primary use case for fastdownload but I would like to use it for it. Luckily, I was able to add support for compressed files into fastdownload. Enhancement lives in a fork of fastdownload on my github [3] for now but I will try to get it into fastdownload. It can be easily installed as a python package using the pip command.

!pip uninstall -y -q fastdownload
!pip install -q git+https://github.com/katarinagresova/fastdownload

Now we need to import the newly installed fastdownload module and create a FastDownload object again.

Note: we don’t have to specify module parameter here since we are installing a version with a fix already in it. However, if you would download a dataset where download_checks.py file is created, you should specify a module where it is located.
from fastdownload import FastDownload

d = FastDownload()
path = d.get('https://silkdb.bioinfotoolkits.net/__resource/Bombyx_mori/download/cds.fa.tar.gz', force=True)
path
100.03% [5578752/5577257 00:03<00:00]
Path('/home/jovyan/.fastdownload/data/cds.fa')

We can also verify that our path is really pointing to the extracted file and not a directory.

print(path.is_file())
True

Enhancement 2: Adding support for new compression formats

While trying to use fastdownload I found another use case that was not supported - downloading a compressed file with a .gz extension.

try:
    path = d.get('http://ftp.ensembl.org/pub/release-106/mysql/regulation_mart_106/dmelanogaster_external_feature__external_feature__main.txt.gz', force=True)
except Exception as e:
    print(e)
100.00% [5152768/5152564 00:01<00:00]
Unknown archive format '/home/jovyan/.fastdownload/archive/dmelanogaster_external_feature__external_feature__main.txt.gz'

I dug a little deeper into fastdownload and found that it uses shutil module [4] for decompressing files. This module supports only some of compression formats and .gz is not one of them.

import shutil

shutil.get_archive_formats()
[('bztar', "bzip2'ed tar-file"),
 ('gztar', "gzip'ed tar-file"),
 ('tar', 'uncompressed tar file'),
 ('xztar', "xz'ed tar-file"),
 ('zip', 'ZIP file')]

However, we can add support for a new compression format by creating a decompression function and registering it in shutil. There is a native support for this in shutil.

print(shutil.register_archive_format.__doc__)
Registers an archive format.

    name is the name of the format. function is the callable that will be
    used to create archives. If provided, extra_args is a sequence of
    (name, value) tuples that will be passed as arguments to the callable.
    description can be provided to describe the format, and will be returned
    by the get_archive_formats() function.
    

Thanks to the smart people at stackoverflow I was able to figure out how to do it. I adapted the code from stackoverflow [5] and I replaced problematic file name extraction with a pathlib.Path [6]. We might register any compression format in this way.

from pathlib import Path
import gzip
import shutil


def gunzip_something(gzipped_file_name, work_dir):
    """gunzip the given gzipped fil

    Args:
        gzipped_file_name (str): path to the gzipped file
        work_dir (str): path to the directory where the file will be unzipped
    """

    filename = Path(gzipped_file_name).stem

    with gzip.open(gzipped_file_name, 'rb') as f_in:
        with open(Path(work_dir, filename), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)


shutil.register_unpack_format(
    name='gz',
    extensions=['.gz'],
    function=gunzip_something,
    description='Gzipped file'
)

And now we can download also files with .gz extension.

path = d.get('http://ftp.ensembl.org/pub/release-106/mysql/regulation_mart_106/dmelanogaster_external_feature__external_feature__main.txt.gz')
path
Path('/home/jovyan/.fastdownload/data/dmelanogaster_external_feature__external_feature__main.txt')

Conclusion

fastdownload is a quite useful tool for managing datasets. If you like to use it also for managing any type of compressed files, use my extended version. You can also use it for any compression type by registering a decompression function in shutil as described above.

Fast downloading!