Handling Large Files in Projects

Quarto
Author
Affiliation

Jeff Jacobs

Published

December 16, 2024

The Problem

Since final projects in DSAN span a wide set of domains/application areas, and therefore a wide range of dataset structures, it’s common to have files larger than the 50 MB recommended maximum (and 100 MB absolute maximum) imposed by GitHub!

In these cases, at least in the research domains I’ve worked in, the most common approach is to:

  • Store your code and small data files on GitHub, but
  • Store your large data files in a cloud storage service like Google Drive

Integrating the larger data files into your code, then, becomes the hard part. Since Georgetown students have a large Google Drive allocation built into your @georgetown.edu accounts, here I will show how to store large data files in Google Drive and then load them into your code, in both R and Python.

Uploading Files and Setting Permissions

The first step is to upload your large data file(s) to a folder in Google Drive, then right-click on the file(s) you’d like to incorporate into your code and click “Share”, then set permissions to “Anyone with the link”.

The example I’m using here is a US county-level dataset on outcomes from the Opportunity Atlas. It contains 10827 attributes for 3219 counties, hence taking up about 184 MB and putting it over GitHub’s 100 MB limit.

So, once I set the permissions as described, I am given the following sharing link:

https://drive.google.com/file/d/1q8LHQ8Etdd1aYYChLZ94ujJ4GAOd6s2g/view?usp=drive_link

Bypassing the Large-File Warning with gdown

Unfortunately for us, however, when files are over 100 MB the direct-download link comes with a “warning screen”, telling users that the file is “too big to scan for viruses”, and asking if they would like to proceed. You can see this warning screen if you click the following direct-download link for the file we’re looking at:

https://drive.google.com/uc?export=download&id=1q8LHQ8Etdd1aYYChLZ94ujJ4GAOd6s2g

So, the final missing piece to bypass this warning screen and actually direct-download the file into Python or R comes from the gdown library! I’ve made the following Colab notebooks to walk you through how to use this library in either language to directly incorporate this large .csv file! (In Python we can use it directly, since it’s written in Python and installable using pip. In R, we can still use it by programmatically executing the gdown terminal command using R’s system() function. See the notebooks for more!)