Adam Kewley

I’ve been using a web scraper—named scrape-site, for the sake of this blog post—that takes around 5 minutes to recursively scrape a website. During one of my scrape sessions, I’ll continually look for more sites to scrape. Because it would be annoying to wait, I’d like to be able to immediately queue any site I find; however, scrape-site is just a plain-old command-line application. It wasn’t designed to support queueing.

If scrape-site was a UI-driven commercial product, I’d be furiously writing emails of displeasure to its developers: what an oversight to forget a queueing feature! Luckily, though, scrape-site only being a single-purpose console application is its biggest strength: it means that we can implement the feature ourselves.

If I had a list of sites (sites-to-scrape) in advance then I could use xargs to do the following:

$ cat sites-to-scrape
http://siteA.com
http://siteB.com

$ xargs -I {} scrape-site {} < sites-to-scrape

This works fine; however, it requires that I have a complete list of sites-to-scrape in advance. I don’t. My workflow has a dynamically changing queue that I want to add sites to. With that in mind, one change would be to omit the sites-to-scrape input, which will cause xargs to read its input from the console:

$ xargs -I {} scrape-site {}
http://siteA.com/
http://siteB.com

This is better: I can just paste a site into the console and press enter to queue it. However, I’m now restricted to writing everything into the console rather than being able to submit list files. In effect, I’ve gained the ability to add sites dynamically (good) but can now only write, or copy and paste, items into a console window (bad).

What we need is a way of having the xargs -I {} scrape-site {} application listen on something that can dynamically receive messages from any source at any time. One way to do this is to setup a server that listens for queue items on a socket. Applications can then just write messages to that socket.

That would require a fair bit of coding it if was done bespokely. However, luckily, we live in a world with netcat. I wrote about the fun and games that can be had out of netcat previously. I’ve been falling in love it ever since. It’s a fantastic union of network protocols (TCP/UDP) and standard input/output, which is exactly what we need.

With netcat, almost any command-line application can be setup as a FIFO server:

$ netcat -lk -p 1234 | xargs -I {} scrape-site {}

This command causes netcat to listen (-l) on port 1234. Whenever it receives a message on that port, it will write it to its standard output. In this case, its standard output has been piped into an xargs instance that, in turn, calls scrape-site with the message. netcat can also be told to keep listening after receiving a message (-k).

With the server setup, we then configure a client command that sends a message to the server. This can also be done using netcat:

$ echo "http://siteA.com" | netcat localhost 1234

This echoes the site into netcat’s standard input. The message is then sent to localhost (assuming you’re running the server on the same computer).

I found this approach very useful during my scraping sessions because I could just continually queue up sites at will without having to worry about what was currently running. Because it’s so simple, the entire thing can be parametrized into a bash script quite easily:

scrape-srv.sh

#!/bin/bash

# Usage: scrape-srv

netcat -lk -p 1234 | xargs -I {} scrape-site {}

scrape.sh

#!/bin/bash

# Usage: scrape site_url

echo "$1" | netcat localhost 1234

Another benefit of this is that I can now run a remote queueing server, which I doubt scape-site was ever designed for. The magic of the Unix philosophy. I imagine this pattern will come in handy for any long-running or state-heavy application that needs to continually listen for messages.