I’ve been using a web scraper—named scrape-site
, for the sake of this blog post—that takes around 5 minutes to recursively scrape a website. During one of my scrape sessions, I’ll continually look for more sites to scrape. Because it would be annoying to wait, I’d like to be able to immediately queue any site I find; however, scrape-site
is just a plain-old command-line application. It wasn’t designed to support queueing.
If scrape-site
was a UI-driven commercial product, I’d be furiously writing emails of displeasure to its developers: what an oversight to forget a queueing feature! Luckily, though, scrape-site
only being a single-purpose console application is its biggest strength: it means that we can implement the feature ourselves.
If I had a list of sites (sites-to-scrape
) in advance then I could use xargs
to do the following:
$ cat sites-to-scrape
http://siteA.com
http://siteB.com
$ xargs -I {} scrape-site {} < sites-to-scrape
This works fine; however, it requires that I have a complete list of sites-to-scrape
in advance. I don’t. My workflow has a dynamically changing queue that I want to add sites to. With that in mind, one change would be to omit the sites-to-scrape
input, which will cause xargs
to read its input from the console:
$ xargs -I {} scrape-site {}
http://siteA.com/
http://siteB.com
This is better: I can just paste a site into the console and press enter to queue it. However, I’m now restricted to writing everything into the console rather than being able to submit list files. In effect, I’ve gained the ability to add sites dynamically (good) but can now only write, or copy and paste, items into a console window (bad).
What we need is a way of having the xargs -I {} scrape-site {}
application listen on something that can dynamically receive messages from any source at any time. One way to do this is to setup a server that listens for queue items on a socket. Applications can then just write messages to that socket.
That would require a fair bit of coding it if was done bespokely. However, luckily, we live in a world with netcat
. I wrote about the fun and games that can be had out of netcat
previously. I’ve been falling in love it ever since. It’s a fantastic union of network protocols (TCP/UDP) and standard input/output, which is exactly what we need.
With netcat
, almost any command-line application can be setup as a FIFO server:
$ netcat -lk -p 1234 | xargs -I {} scrape-site {}
This command causes netcat
to listen (-l
) on port 1234
. Whenever it receives a message on that port, it will write it to its standard output. In this case, its standard output has been piped into an xargs
instance that, in turn, calls scrape-site
with the message. netcat
can also be told to keep listening after receiving a message (-k
).
With the server setup, we then configure a client command that sends a message to the server. This can also be done using netcat
:
$ echo "http://siteA.com" | netcat localhost 1234
This echo
es the site into netcat
’s standard input. The message is then sent to localhost
(assuming you’re running the server on the same computer).
I found this approach very useful during my scraping sessions because I could just continually queue up sites at will without having to worry about what was currently running. Because it’s so simple, the entire thing can be parametrized into a bash script quite easily:
scrape-srv.sh
#!/bin/bash
# Usage: scrape-srv
netcat -lk -p 1234 | xargs -I {} scrape-site {}
scrape.sh
#!/bin/bash
# Usage: scrape site_url
echo "$1" | netcat localhost 1234
Another benefit of this is that I can now run a remote queueing server, which I doubt scape-site
was ever designed for. The magic of the Unix philosophy. I imagine this pattern will come in handy for any long-running or state-heavy application that needs to continually listen for messages.