Wrangling an Image Dataset (under the sea)
Follow my process to scrape and wrangle a dataset of Spongebob title card images for eventual GAN processing.
This post covers:
- Scraping images from the web using Beautiful Soup
- Converting image data to a format digestible by a Generative Adversarial Network (GAN) algorithm
Preface: I’m your biggest fan
In 1999 I was bombarded with Spongebob promos while watching CatDog, memorized the date and time of the premiere, and made sure I was seated in front of the TV right when it aired. As history can attest, it did not disappoint.
A scientist from birth, I was won over by the extent to which real-world biological details were used to inform Spongebob’s characters and plot lines. Only recently did I learn that the late creator, Stephen Hillenberg, was actually a trained marine biologist.
I was equally enamored with Spongebob’s visual style. The early seasons’ aesthetic elicits the same feeling for me as visiting San Francisco’s Fisherman’s Wharf at the turn of the millennium. It’s something akin to coloring the children’s menu at an old, dusty seafood restaurant with an assortment of crayons pulled at random from the Crayola 60 pack while wearing a Hawaiian print skort from Limited Too.
One of the most delightful aspects of Spongebob is the title cards. Many of these are seared in my mind to this day, paired with the sounds of a ukulele or slide guitar. Each card has a distinct personality – many in the same general style but each unique enough to remember for years.
I recently came across a Spongebob fan website with a comprehensive archive of title cards. I wondered if it might be possible to scrape the title card images into a dataset and wrangle them into a neural network friendly format. I’ve been interested in training a GAN to generate images for a while; is it possible to generate novel Spongebob title cards with this data?
In this first post of the Spongebob title card adventure, I will walk through scraping and wrangling image data into a format compatible with GAN algorithms.
Scraping
I used the Beautiful Soup library to scrape the title card image links and episode names from this website. Detailed information on using Beautiful Soup can be found in this blog post on scraping web data.
import requests
from bs4 import BeautifulSoup
import bleach
import pandas as pd
url = "https://spongebob.fandom.com/wiki/List_of_title_cards#.23"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="content")
images = results.find_all('img')
sources =[]
titles = []
for image in images:
source = image.attrs['src']
sources.append(source)
titles_raw = results.find_all(class_= 'lightbox-caption')
for title in titles_raw:
title = bleach.clean(str(title), tags=[], strip=True).strip()
titles.append(title)
sources = [x for x in sources if not x.startswith('data')]
sources = [x.replace('/scale-to-width-down/185', '') for x in sources]
spongedata = pd.DataFrame(list(zip(sources, titles)),columns = ['Image_Source','Title'])
spongedata
Downloading Images
Now that we have the image links, we can use the requests module to open and save the images. There are a number of ways to do this – this stack overflow thread was particularly helpful in finding the right method.
Essentially, requests is used to download the image from the url into memory. Once downloaded, the shutil module is used to properly store the image to a jpg file.
I imported a random title card below:
import matplotlib.pyplot as plt
import matplotlib.image as mpim
import shutil
response = requests.get(spongedata.Image_Source[390], stream=True)
with open("sponge1.jpg", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)
Note that the image is approximately 1000×1400 pixels.
Here is another random title card:
response = requests.get(spongedata.Image_Source[211], stream=True)
with open("sponge1.jpg", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)
This title card is approximately 1000×1750 pixels – not quite the same size as the first one we looked at. To avoid issues down the road, I decided to resize each title card to a uniform number of pixels:
image.resize((300, 200))
CIFAR10 Image Structure
Now let’s take a look at the kind of image data that others have used with GANs successfully. I did some research and found an incredibly helpful tutorial on constructing a GAN using the CIFAR10 image database. The article is located here.
The CIFAR10 database consists of 60,000 32×32 color images stored as arrays and split into a 50,000 image train and 10,000 image test dataset.
The train and test datasets are stored as 4D arrays – for example, the shape of the train dataset is: (50000, 32, 32, 3). The deepest dimension of the array represents the 3 RGB values assigned to each pixel in a photograph. The next dimension represents the number of rows of pixels (32) and the next the number of columns of pixels (also 32). There are a total of 50,000 series of 32×32 arrays of RGB values.
An image can be converted to a 3D array very simply by calling the numpy.array() function after opening the image file using the PIL python image library:
from PIL import Image
import numpy as np
image = PIL.Image.open('sponge1.jpg')
image = (np.array(image))
image
image.shape
Now that we know how to get an image into an array format similar to those in the CIFAR10 dataset, we can write code to loop through our scraped data, convert all our title cards to arrays and combine them all into one 4D array like the CIFAR10 data:
title_cards = []
index = 0
for x in spongedata.Image_Source:
response = requests.get(spongedata.Image_Source[index], stream=True)
with open("sponge1.jpg", 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
img = PIL.Image.open('sponge1.jpg')
img = img.resize((300, 200))
img = (np.array(img))
title_cards.append(img)
index = index + 1
#print(index)
title_cards = np.array(title_cards)
title_cards.shape
Yay! We’ve managed to get a dataset of Spongebob title cards in something analogous to the CIFAR10 format that we can now try and feed into a GAN. Stay tuned for more Spongebob title card neural network posts!