On the first weekday of each month, at 11 am, the whoishiring user posts a thread titled ‘Ask HN: Who is hiring?’ on HackerNews. Companies from across the tech industry respond, posting jobs they’re looking to fill.
I like to read these threads to see what companies are hiring for in my area, and also to scope out remote-friendly companies that I might be interested in working for in the future. Reading the thread in a browser, though, introduces a few problems:
There is a lot of stuff in the threads I don’t care about.
The browser’s search functionality isn’t able to narrow results enough — searching “remote” in this month’s Who is hiring? thread yields 153 matches; many remote job listings mention the word more than once, and some of these matches are in child comments where people are asking things like “is this job remote friendly?”. I am only interested in top-level comments — these are the job posts — and I only want one match per job.
There is no way for me to track state in the browser. I want to know which job listings I’ve already read and track which jobs I’m interested in, which I’ve applied to, etc.
I want to be able to pull the data for only the jobs I care about down to my computer and store them in a format that is both human and machine readable and easy to edit. HackerNews offers an API which should make grabbing the data easy; I want to save it as an org-mode file because they are plain text — easy to script — but offer powerful editing and task management capabilities that will allow me to track state by marking jobs I’m interested in as org tasks.
In the rest of this post, I’m going to walk through my process of writing the script. I’m going to use Ruby because I’m still learning it, and I’d like an opportunity to play with its Net::HTTP and JSON libraries. I’m developing the script using Babel which lets you run blocks of code directly from an org document, sort of like a Jupyter notebook. This blog post itself is the actual org document I’m using to write the script — when I’m done, I’ll export it to Jekyll-compatible markdown for my blog. I’m hoping that this captures some of the thought process behind writing the script — I’m going to develop it iteratively, circling back and refactoring things as I’m going.
Scraping the thread
The first thing we need to do is get the thread using the API. I create a class for the scraper here, and initialize it with an instance variable @threads that I’ll eventually use to store the threads I’m downloading. For now I just have the method return the response body, rather than store it in @threads so I can test it’s working. I also test to make sure I receive a 200 response from the API — if not, the method returns the response code instead of the response body.
Parsing the JSON
Running that returns a really big string of unprocessed JSON data that starts off like this: {"by":"whoishiring","descendants":743,"id":22465476,"kids":[22513275,22466243.... A string isn’t a useful way to structure this data; let’s parse it into something more usable with the JSON module from Ruby’s standard library, which will convert the JSON into a hash. I also store the response body in the @threads instance variable now, and have get_jobs_threads always return the response code. I make @threads a hash of threads in the event that I want to use the script to scrape more than one thread at a time.
This returns:
This matches up with the fields in the API’s documentation — that’s good! The fields I care most about are title, which contains the title of the thread, and kids, which contains the item ids of the thread’s children — these are the top-level comments on the thread, i.e. the job postings.
Scraping the comments (the jobs)
The next step is to download each job listing. I add a @jobs array to the class to store them, and I pull the code to make requests to the API out of get_jobs_thread and into its own method, get_by_id, since it will also need to be used in the new get_jobs method. Then, I loop through each item in the thread['kids'] array and download them by item ID using the HackerNews API, pushing each response onto the @jobs array.
This outputs:
Filtering the jobs
Now that we have all the jobs, I want to pull out only the ones I’m interested in. I add the following method to the HNJobScraper class. Given a regular expression, it loops through each job and pushes the ones with matching text fields to a matches array. Then it returns the matches array.
Writing it to an org file
Now we have an array of all the jobs matching our search criteria, but we need a way to write that data to a file in a structured way. This class opens a file and writes the data to an org-mode file.
I prepend a * to the first line of every job post — this turns that line into a top-level org heading. Since most of the jobs follow a pattern like Company Name | Job Role | Location on the first line, this is going to make a list of collapsible org-headings of brief descriptions of the jobs which I can expand to view the whole listing.
The rest of the substitutions are hacked together from reading some of the jobs’ text fields and trying to make them more human-readable.
Making it a reusable tool
To make this a useful, reusable tool, the final thing I’m going to do is add some command line options and glue everything together.
Running ./hn_scraper.rb -i 22465476 -f '(boston|remote)' -o jobs.org returns an org file that looks like this in Emacs:
It’s more readable, I can quickly delete anything I don’t want, and I can turn any heading into an org-mode task to track its state. I’d consider this a success.