Handling Large Files in Projects

Quarto

Author

Affiliation

Jeff Jacobs

jj1088@georgetown.edu

Published

December 16, 2024

The Problem

Since final projects in DSAN span a wide set of domains/application areas, and therefore a wide range of dataset structures, it’s common to have files larger than the 50 MB recommended maximum (and 100 MB absolute maximum) imposed by GitHub!

In these cases, at least in the research domains I’ve worked in, the most common approach is to:

Store your code and small data files on GitHub, but
Store your large data files in a cloud storage service like Google Drive

Integrating the larger data files into your code, then, becomes the hard part. Since Georgetown students have a large Google Drive allocation built into your @georgetown.edu accounts, here I will show how to store large data files in Google Drive and then load them into your code, in both R and Python.

Uploading Files and Setting Permissions

The first step is to upload your large data file(s) to a folder in Google Drive, then right-click on the file(s) you’d like to incorporate into your code and click “Share”, then set permissions to “Anyone with the link”.

The example I’m using here is a US county-level dataset on outcomes from the Opportunity Atlas. It contains 10827 attributes for 3219 counties, hence taking up about 184 MB and putting it over GitHub’s 100 MB limit.

So, once I set the permissions as described, I am given the following sharing link:

https://drive.google.com/file/d/1q8LHQ8Etdd1aYYChLZ94ujJ4GAOd6s2g/view?usp=drive_link

Converting to a Direct-Download Link

If you click that link, however, you’ll see that it is not a direct link to the county_outcomes.csv file itself, but a link to a “wrapper” page showing info about the file, with a link you can manually click to then download the file itself. This link as-is, therefore, will not be very helpful for us in terms of writing code to load this dataset the same way we’d load a local data file.

A not-very-well-known (but crucially important for our purposes!) aspect of Google Drive is that you can convert this “viewing” link into a direct-download link! If our files were less than 100 MB, we could use this direct-download link as-is, plugging it into pd.read_csv() (in Python) or read_csv() in R’s tidyverse.

Bypassing the Large-File Warning with `gdown`

Unfortunately for us, however, when files are over 100 MB the direct-download link comes with a “warning screen”, telling users that the file is “too big to scan for viruses”, and asking if they would like to proceed. You can see this warning screen if you click the following direct-download link for the file we’re looking at:

https://drive.google.com/uc?export=download&id=1q8LHQ8Etdd1aYYChLZ94ujJ4GAOd6s2g

So, the final missing piece to bypass this warning screen and actually direct-download the file into Python or R comes from the gdown library! I’ve made the following Colab notebooks to walk you through how to use this library in either language to directly incorporate this large .csv file! (In Python we can use it directly, since it’s written in Python and installable using pip. In R, we can still use it by programmatically executing the gdown terminal command using R’s system() function. See the notebooks for more!)

The Problem

Uploading Files and Setting Permissions

Converting to a Direct-Download Link

Bypassing the Large-File Warning with gdown

Bypassing the Large-File Warning with `gdown`