How to save a perfectly-scraped webpage into DEVONthink using IFTTT, Diffbot, Hazel, & several command line tools
DEVONthink is a key piece of software for me on my Mac. In particular, I use it to store copies of webpages that I run across that I want students to read or that I want to refer back to for teaching, or for writing, or for my own use. Now, it’s very easy to get webpages into DEVONthink by using the browser extensions that come with the software. You click on the extension, & you get a small window:
See the Format menu? When you click it, you get several choices:
This is great, as is the checkbox for Instapaper, which runs the webpage through that awesome service & gives you results with just the featured content & none of the crap. But even with Instapaper, these results are not perfect, at least for me.
Here’s my problem: I want a webpage so that I can see images & hyperlinks & other stuff that only comes with the Web. I like PDFs, but not when I can just have good ol’ HTML to deal with. But if I choose the HTML Page or Web Archive options, then I get a bunch of junk I don’t want, like ads & extraneous content. If I check the box next to Instapaper, I get less junk, but I lose a lot of control over what gets selected & what doesn’t get selected, & the original URL of the webpage, along with a lot of other important metadata, gets stripped away by Instapaper. In other words, I want this:
See? Neat & clean, with the title of the Web article at the top as an H1, & then the author, date of publication, & URL below, all H2s in the HTML, & finally the content & nothing else.
Yes, I know this is picky, but it’s what I really want. So I set out to create it over several months, & I finally got it all figured out & set up & working this summer. After testing it for months to verify that it works well, I am now ready to unveil this process to you, the readers of Chainsaw on a Tire Swing.
Before I dig in to the details, let me give you the 20,000-foot summary of the process. It might seem complicated, & I guess it kinda is, but it’s not that bad if you go through it step by step, & it does work beautifully. I’m going to mention several services in this introduction that you might not have heard of. Don’t worry; I’ll explain everything below.
- Send an email to the IFTTT (If This Then That) service which contains the URL of the webpage at the top of the message.
- IFTTT saves the email as a file in a specific folder in your Dropbox.
- Hazel on your Mac notices the new file in the folder & runs a shell script.
- The shell script grabs the URL out of the file & sends a request to the Diffbot service, which saves the result to the
/tmp
directory as a webpage. - The shell script converts that resulting webpage to a
.webarchive
file & saves it to DEVONthink’s Inbox folder, where it is automatically imported into DEVONthink.
Got that? OK, let’s set it all up!
Table of Contents
Diffbot
I love Diffbot. I really do. It’s the best service of its type I’ve seen, the price is right (free for the 1st 10,000 requests each month!), & the support I’ve received when I’ve had questions or issues has been top-notch. So what’s it do?
Simple. It’s a scraper: you send a request to Diffbot using its API, you get back the data from a webpage, shorn of all the junk. It’s like Safari’s Reader feature, but available programmatically. Here’s an example.
First, a blog post at The Atlantic’s website, as it appears in a browser:
Next, the same post after it’s been passed through Diffbot & brought into DEVONthink:
A bit cleaner, eh?1
So, here’s what you need to do: go to Diffbot’s website, create an account, find out your Diffbot Developer Token (you’ll need it for the shell script), & then come back here.
Dropbox
You don’t have to create these folders exactly where I specify, but if you change their locations, you’re going to need to edit the shell script that’s coming up.
Create a folder at root of your Dropbox named Incoming
. Inside the Incoming
folder, create another folder named DEVONthink
. Your folder structure should therefore look like this: ~/Dropbox/Incoming/DEVONthink
IFTTT
If you don’t already have an account with If This Then That (IFTTT), go get yourself one! It’s a free service that lets you tie together online services so that when one event happens at one service, then something happens as a response. For example, every time you post a picture to Facebook, a copy is placed in a Dropbox folder, or every time a particular RSS feed is published, it’s scanned by IFTTT, & if certain words are in the title, that post is emailed to you. It’s such a great service that I’d pay for it if I had to.
To use it with my process here, create an account at IFTTT if you don’t already have one, log in to IFTTT, & activate the Dropbox & Email channels.
Now go to My Recipes & click Create A Recipe. Here’s what you’re going to fill in:
- Description:
App emails IFTTT a URL, which gets saved as a text file
- Trigger: Send IFTTT an email from your email address with a tag of
dt
(for DEVONthink, get it?). - Action: Create a text file in Dropbox
- Dropbox folder path:
Incoming/DEVONthink
- File name:
Subject
- Content:
Body
- Dropbox folder path:
Save it, & you’re good to go.
So here’s what happens: you find a webpage that you want to capture in DEVONthink. You email the link to yourself, with the URL as the first line of the body of the email (you can have other stuff in the email, like your signature, but it will be ignored by the upcoming shell script). As for the subject, it really doesn’t matter—it can be words, it can be a URL as well, it can be nothing—as long as you have #dt
in it (I always put it at the end because that’s easy).
When the email arrives at IFTTT, it is saved as a text file in the specified Dropbox folder. The subject of your email becomes the name of the file, & the body of your email becomes the contents of the file.
We now have a place in Dropbox for incoming text files containing URLs that we want to use, & a method for getting those text files into Dropbox: emailing IFTTT. But what do we do with those text files once they’re in there? Time for some shell scripting!
Needed command line software
The shell script I’m going to provide has several requirements:
gecho
(the GNU version ofecho
)gsed
(the GNU version ofsed
)dos2unix
(converts text files between Windows & UNIX/Mac OS X formats)jsonpp
(prettifies JSON files)terminal-notifier
(send Mac OS X notifications)webarchiver
(create Safari .webarchive files)
All of those but one are available through Homebrew, so if you haven’t already installed that, you’ll need to do so.
Once you have Homebrew up & running, run this command (it’s not obvious, but coreutils
takes care of gecho
—& a whole lot more besides):
$ brew install coreutils gnu-sed dos2unix jsonpp terminal-notifier
Update from 2016-06-03: Homebrew now includes
webarchiver
, so just add that to the line above. You do not need to download the code & compile it using Xcode, unless you really want to.
If you use MacPorts (who uses that anymore?), you can download webarchiver
pretty easily, according to the developer:
$ sudo port install webarchiver
I don’t use MacPorts, so I have no idea how effectively this is. Instead, you’re going to have download the code & compile it using Xcode.
I went to the GitHub page for the webarchiver project, got a copy of the code (don’t download the release, as that’s 0.3, which is ancient & won’t compile on newer Macs; instead, get the latest code, which is version 0.5), & double-clicked on webarchiver.xcodeproj
to open the project in Xcode. Once in Xcode, I went to Product > Build, which successfully compiled the code, leaving the binary in /Users/scott/Library/Developer/Xcode/DerivedData/webarchiver-dreeepqxmdlkgieggztknlbwsula/Build/Products/Debug/webarchiver
. Obviously, your path under DerivedData
will be different2. I then moved the webarchiver
binary to /usr/bin
.
Once you’ve moved webarchiver
to its new home, test it:
If you see that output, you’re good to go.
The shell script
Place the shell script you see below in your ~/bin
directory. I named it conv_to_webarchive.sh
(you can use your own, but if you change the name, you’ll need to also change the instructions for Hazel that are coming up). I’ve commented the heck out of it, so I hope that helps explain what each step is doing.
Note the following about the script:
-
Make sure the variables are correct for your setup.
-
In particular, you’ll need to enter your Diffbot Developer Token for
diffbot_token
. And no, that’s not mine. I randomly generated a lookalike. -
The paths that start with
/Users/
are all for my Mac. You’ll need to change them for yours. -
You might notice that I encode URLs in the middle; in other words, I turn
http://www.chainsawonatireswing.com/
intohttp%3A%2F%2Fwww.chainsawonatireswing.com%2F
. This is what Diffbot wants, so it is what Diffbot gets. -
It’s pretty easy to test and make sure you’re getting the right results from Diffbot. Just use the line from the script:
curl "http://www.diffbot.com/api/article?token=$diffbot_token&url=$encoded_url&html&timeout=20000" | /usr/local/bin/jsonpp
, but put in your Diffbot Developer Token instead of$diffbot_token
& the encoded URL you want to test instead of$encoded_url
. By piping the output tojsonpp
, you get readable results. -
Yes, I use
sed
(actuallygsed
) a lot. I refer to a file namedconv_to_webarchive.sed
a few times. That file is detailed in the next section. -
You don’t need the lines with
gecho
, but I found them very useful while I was developing & testing the script, & they don’t do any harm, so I left them. If they bother you, take ’em out. -
Notice the lines that say
mv $i $fail_safe_dir
. All are fail-safes in case files can’t be renamed or parsed. This became critical when I did not have them in place, & one night a file got stuck trying Diffbot repeatedly, so that I racked up 15,000 queries or so in just a few hours. Fortunately, I shamefacedly explained what happened to Diffbot support, & they very kindly forgave me. And then I immediately put in place those fail-safes, as I should have from the beginning. So if you see files on your desktop, look in them, as they indicate problems that you need to fix by hand.
The sed
file
In my shell script I refer to conv_to_webarchive.sed
a number of times. If you don’t know what sed
is, it’s basically a way to edit files programmatically from the command line. It’s also very cool & does a million things, most of which I know nothing about (although I’d love to learn!).
Here’s the contents of conv_to_webarchive.sed
:
s/\\u0092/’/g
s/\\u0093/“/g
s/\\u0094/”/g
s/\\u0097/—/g
s/\\u2013/–/g
s/\\u2014/—/g
s/\\u2018/‘/g
s/\\u2019/’/g
s/\\u201C/“/g
s/\\u201c/“/g
s/\\u201D/”/g
s/\\u201d/”/g
s/\\u2020/†/g
s/\\u2021/‡/g
s/\\u2022/•/g
s/\\u2026/…/g
s/\\u2033/"/g
I have built this file up over time, as I have found errors in the results generated by Diffbot & the other programs. Basically, Diffbot stuck the encoding for a character in the results, & I want the actual character itself. So, for instance, instead of an ellipses, I saw \u2026
in the file; my sed
file turns \u2026
back into …
so that it’s readable. As I discover more, I’ll add to the file.
Hazel
So we have a shell script that processes files, but how do we tell the shell script to run? Enter Hazel. Basically, Hazel watches the folders you tell it to watch, & when something changes in those folders, Hazel processes the files according to the rules you specify.
In this case, we’re going to tell Hazel to watch the Incoming/DEVONthink
folder, & when a file is placed inside, the shell script detailed above should run, processing the file. Lather, rinse, repeat.
Open Hazel. On the Folders tab, press the + under Folders to add a new folder. Select ~/Dropbox/Incoming/DEVONthink
.
Under Rules, press the + to add a new rule. A sheet will open, named for the folder; in this case, DEVONthink.
For a Name, I chose Convert URL to DEVONthink webarchive
.
Now you need to make selections so that the following instructions are mirrored in Hazel.
If all
of the following conditions are met:
Extension
is
txt
Do the following:
Run shell script
- Choose Other… & select
~/bin/conv_to_webarchive.sh
Press OK to close the sheet, & then close Hazel.
Test
To test your work, email a URL to trigger@ifttt.com
with #dt
in the subject. A few seconds later, you should see a new entry appear in your DEVONthink Inbox, stripped of extraneous formatting & info thanks to Diffbot.
In subsequent posts, I’m going to tell how to automate emailing that URL using Keyboard Maestro on the Mac & Mr. Reader & other apps on your iOS devices. I could do it here, but this post is long enough already! And even without that info, I still find this process I’ve detailed here to be incredibly useful, so much so that I use it at least 10 times a day to save webpages into DEVONthink. I hope you find it useful too!
-
Now, Diffbot normally does a great job, but not always. In those cases, you need to go to the Custom API Toolkit & tell Diffbot what to do, based on CSS selectors. I’ve been collecting materials for a long post about this that I’ll put up on this site later. ↩
-
Note that this is maybe the second time in my life I’ve messed around with Xcode, so if there’s a better way to do what I did, please let me know. ↩