Load Data from Kaggle
Overview
Kaggle, in addition to its competitions and other offerings, has an expansive offering of curated and community submitted datasets. The datasets span numerous domains, sizes, and file types. This tutorial will give you the foundational information to load data from Kaggle directly into Saturn Cloud, quickly and easily!
Before starting this, you should create a RStudio server resource. See our quickstart if you don’t know how to do this yet.
Process
Create Kaggle Credentials
The first step for accessing data from Kaggle is to create an API token.
Access the account page of your Kaggle account by signing in and clicking on your username and picture in the top right. Click on the Account tab:
Then scroll down to the API section, and click Create New API Token:
This will download a file named “kaggle.json.” This file contains your username and API key. Save it in a safe place!
Open the “kaggle.json” file in your favorite text editor and you will see your Kaggle username and key.
Add Kaggle Credentials to Saturn Cloud
Sign in to your Saturn Cloud account and select Credentials from the menu on the left.
This is where you will add your Kaggle API key information. This is a secure storage location, and it will not be available to the public or other users without your consent.
At the top right corner of this page, you will find the New button. Click here, and you will be taken to the Credentials Creation form.
You will be adding two credentials items: your Kaggle username and API key. Complete the form one time for each item.
Credential | Type | Name | Variable Name |
---|---|---|---|
Kaggle Username | Environment Variable | kaggle-username | KAGGLE_USERNAME |
Kaggle API Key | Environment Variable | kaggle-api-key | KAGGLE_KEY |
Copy the values from your “kaggle.json” file into the Value section of the credential creation form. The credential names are recommendations; feel free to change them as needed for your workflow. You must, however, use the provided Variable Names for Kaggle to connect correctly.
With this complete, your Kaggle credentials will be accessible by Saturn Cloud resources! You will need to restart any Jupyter Server or Dask Clusters for the credentials to populate to those resources.
Setting Up Your Resource
Kaggle is not installed by default in Saturn images, so you will need to install it onto your resource. This is already done in this example recipe, but if you are using a custom resource you will need to pip install kaggle
. Check out our page on installing packages to see the various methods for achieving this!
Download a Dataset
Now that you have set up the credentials for Kaggle and installed kaggle, downloading Kaggle data is really straightforward!
In Kaggle, find the dataset you want to download.
On the dataset page, click on the three dots to the right and select Copy API Command.
Now, in Saturn Cloud, open the terminal, then paste the API command. For example:
kaggle datasets download -d deepcontractor/swarm-behaviour-classification
That’s it! Your dataset will download to your current path, and you will be able to use it for calculations!
Download a Competition Dataset
Downloading a competition dataset is similarly straightforward, but it is a slightly different process.
In Kaggle, find the competition you want to download the dataset for.
Click on Data in the top menu and then copy the command displayed.
Now, in Saturn Cloud, open the terminal, then paste the API command. For example:
kaggle competitions download -c titanic
That’s it! Your dataset will download to your current path, and you will be able to use it for calculations!