Speculative fiction is an umbrella genre of fiction covering science fiction, fantasy, horror, dystopian, weird, and other types of fiction.
Last year I began devouring speculative fiction short stories with the eventual goal of writing and publishing short stories in this genre. Since August 2019, I’ve read and analyzed over 650,000 words of speculative fiction spread over 130 short stories, many of them award winning stories represented in the Hugo Awards, Nebula Awards, British Fantasy Awards, Bram Stoker Award, and World Fantasy Awards.
I wanted to build a database of speculative fiction stories - including story texts - for my own personal use. This corpus would eventually help to answer questions like:
Archived pulp magazines are great sources for older speculative fiction stories. However, I am interested in the modern state of speculative fiction.
The modern body of award winning speculative fiction can largely be found within a relatively small set of physical and online speculative fiction magazines and anthologies.
These magazines include:
Many of these magazines are freely available on the internet. But copying and pasting them manually takes a long time and is error prone.
Web scraping
- programmatically gathering content from websites - is one way to build a large corpus from various sources on the internet.
To start building this corpus I decided to scrape one of my favorite speculative fiction magazines: Lightspeed Magazine.
Beautiful Soup 4 is a Python package for pulling data from HTML and XML files.
It uses a parser
- a program that takes a source document (typically text) and builds a data structure (typically a tree) from the source. These data structures are typically called parse trees
or abstract syntax trees
.
Beautiful Soup
provides a set of methods for using a parser of your choice to extract information from web pages.
A typical workflow for using Beautiful Soup
is:
Beautiful Soup
and your parser of choice to parse the requested HTMLBeautiful Soup
to find content in the parse tree objectMake a new directory and track it with git:
mkdir science-fiction-magazines-nlp && cd science-fiction-magazines-nlp
mkdir old-pickles
git init
touch app.py
touch requirements.in
Inside .gitignore
, ignore our virtual environment files and pickle files (we don’t want to upload copyrighted material to the internet):
venv
*.p
old-pickles
Set up a virtual environment
for managing dependencies: python3 -m venv venv
(-m
for module name, meaning Python will search for and run a module named venv
).
I like to use pip-tools
(github link) for managing dependencies.
It lets you define top-level or explicit
dependencies - ones you actually want to use in your project - separately from implicit dependencies
- dependencies installed because other dependencies need them. It also automatically pins
dependencies to specific versions which makes it easier for other developers to rebuild your project in the future. It’s a good idea to pin dependencies.
Install pip-tools
with pip3 install pip-tools
.
Put our top-level dependencies in requirements.in
(.in
is a generic file extension meaning input
):
pylint
requests
beautifulsoup4
I like pylint because it will gently remind me when my code is not following common programming patterns and it will format my code in a standard way. Pylint is technically a developer requirement
and the normal pattern is to save those in dev-requirements.in
but I didn’t do that here.
Running pip-compile
will place requirements into a requirements.txt
with all dependencies pinned
. This file has all of your dependencies pinned
to a specific version and tells which dependencies required the other dependencies. My requirements.txt
:
#
# This file is autogenerated by pip-compile
# To update, run:
#
# pip-compile
#
astroid==2.3.3 # via pylint
beautifulsoup4==4.8.2
certifi==2019.11.28 # via requests
chardet==3.0.4 # via requests
idna==2.8 # via requests
isort==4.3.21 # via pylint
lazy-object-proxy==1.4.3 # via astroid
mccabe==0.6.1 # via pylint
pylint==2.4.4
requests==2.22.0
six==1.14.0 # via astroid
soupsieve==1.9.5 # via beautifulsoup4
typed-ast==1.4.1 # via astroid
urllib3==1.25.7 # via requests
wrapt==1.11.2 # via astroid
Activate the virtual environment: . venv/bin/activate
(also, maybe look at venv/bin/activate
because it’s only 70 lines of code and just sets a few environment variables).
All Python code for this project will go in app.py
. Let’s import dependencies there:
"""Functions for requesting and processing html from Science Fiction magazines"""
import os
import re
import time
import pickle
import traceback
import requests
from bs4 import BeautifulSoup
We can use the requests library to GET
pages and parse them with Beautiful Soup
using the built-in HTML parser html.parser
.
Beautiful Soup
will use the parser to generate a soup
object. This object has the parse tree
created by the parser and some helper methods for operating on that parse tree
.
In app.py
create a function for requesting a page and returning the soup
object representing that page:
def request_soup_page(url: str):
"""Request an html page and return a BeautifulSoup object for that page"""
print(f"Requesting page {url}")
request = requests.get(url)
status = request.status_code
print(f"Received page {url} with status {status}")
if status == 404:
return {}
soup = BeautifulSoup(request.content, 'html.parser')
return soup
Here requests.get(url)
performs an HTTP GET
request for the HTML at the given url
. If the page was found, use Beautiful Soup
to parse the page into an object (soup
) and return that object.
The magazine lists all issues as a paginated list
- a list separated over several pages. We will need to GET
all pages of this list and then use Beautiful Soup
to find the link to each issue on each page.
We can construct URLs for the paginated list pages: http://www.lightspeedmagazine.com/category/issues/page/NUMBER
.
The base_url
is http://www.lightspeedmagazine.com/category/issues/page/
.
We will just increment NUMBER
until the page isn’t found (when request_soup_page
returns an empty object instead of a soup object). We will collect the soup
objects representing each page of the paginated list in a list(pages
) and return this list:
def request_all_paginated_list_pages(base_url: str) -> list:
"""Return html for each page of a paginated list as list of BeautifulSoup objects"""
print(f"Requesting paginated list pages from {base_url}")
status = 200
list_page = 0
pages = []
while status != 404:
list_page += 1
list_page_url = f"{base_url}/{list_page}"
print(f"Request list page {list_page}")
soup = request_soup_page(list_page_url)
if soup == {}:
status = 404
print(f"Received list page {list_page} with status {status}")
if status == 404:
break
pages.append(soup)
time.sleep(5)
return pages
Note that time.sleep(5)
waits for five seconds between requests. It’s important to be mindful of server loads when scraping - we don’t want to DDOS our favorite magazines.
Now we have a list of soup
objects representing the paginated list of issues with links to every issue.
Beautiful Soup
uses a set of querying methods that return a soup
object. This means you can chain queries together to find HTML elements deep in the parse tree
.
In find_all_issue_links
, we will iterate through each of the soup objects we collected using request_all_paginated_list_pages()
. We’ll use several methods on the soup
object that all return another soup
object:
body
to find the HTML bodyfind
to find a specific HTML elementfind_all
to find all elements matching some conditionThese methods let us find links (an a
HTML element) for every issue of the magazine. Then we can access the href
property for every link to get a URL string. We return a list of these URLs.
Finding out where to look for the URLs was a manual process of inspecting with the Chrome DevTools.
def find_all_issue_links(issues_pages: list):
"""Find all issue url strings in a list of issue html pages (BeautifulSoup objects)"""
print(f"Finding all issue links")
urls = []
for index, page in enumerate(issues_pages):
print(f"Finding issue links in page {index + 1}")
post_wrapper_divs = page.body \
.find("div", {"id": "wrapper"}) \
.find("div", {"id": "main"}) \
.find("div", {"id": "content"}) \
.find("div", {"class": "content_box"}) \
.find_all("div", {"class": "post_wrapper"})
for div in post_wrapper_divs:
url = div \
.find("div", {"class": "post_content"}) \
.find("h2", {"class": "posttitle"}) \
.find("a")["href"]
urls.append(url)
print(f"Total issue links found so far: {len(urls)}")
return urls
I use a wrapping function (request_and_find_and_save_issue_links()
) for collecting the issue links to avoid polluting main()
. It uses the functions from above and saves the list of URLs to a pickle
- a way for Python to persist objects to disk - called issue_links.p
.
def request_and_find_and_save_issue_links(pickle_url: str):
"""Request, find, and save list of issue link url strings as a pickle"""
print(f"Getting all issue list pages")
base_url = "http://www.lightspeedmagazine.com/category/issues/page"
pages = request_all_paginated_list_pages(base_url=base_url)
issue_links = find_all_issue_links(issues_pages=pages)
print(f"Dumping {len(issue_links)} links to pickle issue_links")
with open(pickle_url, 'wb') as file_pointer:
pickle.dump(issue_links, file_pointer)
Since we’re saving pickle files, let’s make a function for loading them:
def load_from_pickle(url: str):
"""Return an object loaded from a local pickle file"""
with open(url, 'rb') as file_pointer:
return pickle.load(file_pointer)
Each issue has its own page with links to every story in that issue. From inspecting the HTML, I can see that the fictional stories are always marked as Science Fiction
or Fantasy
. We need to collect every story link marked as Science Fiction
or Fantasy
from every issue.
Each issue is divided into posts
where each post
is in a separate div
with class post_wrapper
. Each post has an h2
with the link to the story and an h3
with category information. If the category is Science Fiction
or Fantasy
we save it to a list.
A function for using Beautiful Soup
to find all of the story links on an issue’s page:
def find_all_story_links_from_issue(issue: object):
"""Return a list of all story url strings in an issue (BeautifulSoup object)"""
print(f"Getting all story links from issue")
post_wrapper_divs = issue.body \
.find("div", {"id": "wrapper"}) \
.find("div", {"id": "main"}) \
.find("div", {"id": "content"}) \
.find("div", {"class": "content_box"}) \
.find_all("div", {"class": "post_wrapper"})
print(f"Found {len(post_wrapper_divs)} post_wrapper divs")
links = []
for post in post_wrapper_divs:
post_content = post.find("div", {"class": "post_content"})
category_header = post_content.find("h3")
title_header_link = post_content.find("h2", {"class": "posttitle"}).find("a")
if not category_header or not title_header_link:
continue
link = title_header_link["href"]
if (category_header.contents[0] == "Science Fiction" or \
category_header.contents[0] == "Fantasy"):
print(f"Added link {link}")
links.append(link)
print(f"Found {len(links)} story links from issue")
return links
We use another wrapper function to avoid polluting main. This function takes a list of issue links and the path to a pickle file of story links (if this file exists).
For each issue:
The pickle file helps preserve state between processing each issue, so that if an error is encountered we don’t have to request every issue again.
def request_and_find_and_save_story_links_from_issues(issue_links: list, pickle_url: str):
"""Request, find, and save a list of all story url strings in issues as a pickle"""
print(f"Getting all story links from issues")
story_links = []
for issue_link in issue_links:
if os.path.exists(pickle_url):
story_links = load_from_pickle(pickle_url)
print(f"Loaded pickle {pickle_url} with {len(story_links)} story links")
issue_page = request_soup_page(url=issue_link)
links = find_all_story_links_from_issue(issue=issue_page)
story_links = story_links + links
print(f"Found {len(story_links)} story links so far")
print(f"Dumping {len(story_links)} links to pickle {pickle_url}")
with open(pickle_url, 'wb') as file_pointer:
pickle.dump(story_links, file_pointer)
time.sleep(5)
Now we have links for every story in the magazine.
Each story has its own page with the following information that we can store in a dictionary
:
Fiction
)The story content is spread over many HTML elements that need to be processed differently. We will build a string for the story starting with an empty string:
p
elements are converted to text and added to the stringol
and ul
are converted to text and added to the stringimg
are ignoreddiv
with class divider
are added as text as . . .\n\n
Again we can inspect each story’s page, find the information we care about, and use Beautiful Soup
to extract it and return a dictionary representing that story:
def find_story_from_story_page(story_page: object):
"""Return a dictionary representing a story from a story html page"""
print(f"Finding story from story page")
story = {}
content_box = story_page.body \
.find("div", {"id": "wrapper"}) \
.find("div", {"id": "main"}) \
.find("div", {"id": "content"}) \
.find("div", {"class": "content_box"}) \
story_type = content_box.find("h3").contents[0]
author = content_box \
.find("div", {"class": "about_author"}) \
.find("h2") \
.find("span") \
.contents[0]
post_div = content_box \
.find("div", {"id": re.compile("^post-")})
title = post_div \
.find("h1", {"class": "posttitle"}) \
.contents[0]
issue_paragraph = post_div \
.find("p", {"class": re.compile("postmetadata date")})
issue = issue_paragraph.find("a").contents[0]
issue_url = issue_paragraph.find("a")["href"]
word_count = re.search(r'\b\d+\b', issue_paragraph.contents[2]).group(0)
story_div = post_div \
.find("div", {"class": "entry"})
story_div_elements = list(story_div.children)
content = ""
for element in story_div_elements:
# ignore paragraphs containing images or links
if element.find("img") or element.find("a"):
continue
if element.name in ["p"]:
content += f"{element.get_text()}\n\n"
if element.name in ["ol", "ul"]:
text = element.get_text('\n')
content += f"{text}"
if element.name == "div" and element["class"] == "divider":
content += f". . . .\n\n"
story.update(
{ "author": str(author)
, "title" : str(title)
, "issue": str(issue)
, "issue_url": str(issue_url)
, "word_count": str(word_count)
, "type": str(story_type)
, "content": content
})
return story
Now we have a list of story URLs and a function for extracting the stories from a story’s HTML.
We’ll save each story dictionary
in another dictionary
called stories
. Between processing each story, we’ll save the current stories
to a pickle that we provide as pickle_url
.
Assuming stories don’t change, we’ll only retrieve information for stories that haven’t been processed before.
For every story link:
request_soup_page()
to get a soup
object for the story’s HTMLfind_story_from_story_page()
to get a dictionary
representing the storyfailed_story_urls.p
for processing later and continue to the next story linkstories
dictionarystories
dictionary to a pickle and continue to the next story linkdef request_and_find_and_save_stories_from_story_links(story_links: list, pickle_url: str):
"""
Request, find, and save all stories and story metadata as a pickle
Builds a dictionary of stories and saves it as a pickle.
stories dictionary structure:
keys: "story title-Author Name"
example key: "The Traditional-Maria Dahvana Headley"
values: a story dictionary containing story content and metadata
{ "author": "author name"
, "title": "story title"
, "story_url": "magazine.com/story1234"
, "issue": "story issue name"
, "issue_url": "magazine.com/issue1234"
, "word_count": "1234"
, "type": "Fiction"
, "content": "body of story"
}
"""
print(f"Getting all stories from story links")
stories = {}
failed_story_urls = []
for story_link in story_links:
try:
if os.path.exists(pickle_url):
stories = load_from_pickle(pickle_url)
print(f"Loaded pickle {pickle_url} with {len(stories)} stories")
story_urls = [story["story_url"] for story in list(stories.values())]
if story_link in story_urls:
print(f"Found story in pickle, skipping story {story_link}")
continue
story_page = request_soup_page(url=story_link)
story = find_story_from_story_page(story_page=story_page)
story["story_url"] = story_link
if not all(keys in story for keys in \
("author", "title", "issue", "issue_url", "word_count", "content")):
print(f"Problem getting story data from story url {story_link}")
raise Exception("Malformed story data")
stories[f'{story["title"]}-{story["author"]}'] = story
print(f"Found {len(stories)} stories so far")
print(f"Dumping {len(stories)} stories to pickle {pickle_url}")
with open(pickle_url, 'wb') as file_pointer:
pickle.dump(stories, file_pointer)
time.sleep(5)
except Exception as err:
print(f"Failed to retrieve and process story {story_link}, adding to failed_story_urls.p")
print(traceback.format_exc())
failed_story_urls.append(story_link)
with open('failed_story_urls.p', 'wb') as file_pointer:
pickle.dump(failed_story_urls, file_pointer)
Now all stories from the magazine are saved in a local pickle
file.
If app.py
is run using python3 app.py
we want the main()
function to be executed. Running a .py
file with python3
sets the __name__
attribute to __main__
by default. So check if __name__
is __main__
:
if __name__ == '__main__':
main()
The main()
function will get and save issue links as a pickle, get and save story links as a pickle, and get and save stories as a pickle:
def main():
"""
Run functions to retrieve and process html from Lightspeed magazine
Only need to run each once to save data to pickle files.
Three pickle files are saved (with the suggested names):
issue_links.p is a list of url strings for all magazine issues
story_links.p is a list of url strings for all stories in the magazine
stories.p is an object containing story content and metadata for all stories
"""
request_and_find_and_save_issue_links(pickle_url="issue_links.p")
issue_links = load_from_pickle("issue_links.p")
request_and_find_and_save_story_links_from_issues( \
issue_links=issue_links, pickle_url="story_links.p")
story_links = load_from_pickle(url="story_links.p")
request_and_find_and_save_stories_from_story_links( \
story_links=story_links, pickle_url="stories.p")
if __name__ == '__main__':
main()
That’s it! Run python3 app.py
and after some time we’ll have 869
of the magazine’s stories available as a pickle file stories.p
.
A few stories have a different HTML structure and throw errors when processed - we’ll have to manually account for those in our code later.
To update the data in the future, remove issue_links.p
and story_links.p
(or move them to old_pickles directory) and run the script again. You don’t need to remove stories.p
since the Python script only saves new stories.
This process can be extended to other speculative fiction magazines that have content available online for free. In the future we’ll use these stories to answer some interesting questions about speculative fiction.