Archiving your past tweets as Markdown with a good text editor & a few simple bash commands

Over the past week or so, several very smart folks have been blogging about using the way-cool multi-service If This, Then That (IFTTT) to automatically archive all of your tweets to a file in your Dropbox. I’ve learned lots of useful info from these articles:

Justin Blanton: IFTTT recipe to log tweets to a Dropbox file
Ian Beck: Archiving tweets using IFTTT and Dropbox
Brett Terpstra: Web Excursions: Twitter Hacking Edition

The original IFTTT recipe archived tweets to a plain text file (good!) but without any Markdown goodness (not so good!). Later, others figured out to output to Markdown (great!), so that’s what I’ve been doing. In my case, I save all tweets to file named Twitter Archive.txt in my Dropbox, using this template:

<br><br>
[]()<br><br>
---<br>

Now, this is great for all future tweets, but what about previous tweets? If you read Brett’s article above, he links to several solutions to that problem: scripts that download your previous tweets¹ so that you can add them to your Twitter archive file in Dropbox before you enable the IFTTT recipe (or after, if you don’t mind cutting & pasting). However, I had several problems with those solutions:

They were all based on Python or other programming languages, & I don’t know those. I can mess around with (some of) them, but not with any high (or even low) level of competence.
I like bash & shell scripting & command line tools so I’d prefer to use those.
The outputs of those other scripts may not match the template I like, & I’d prefer not having to muck about in scripting languages I don’t speak to fix that.

So instead, here’s what I did to grab my archive of tweets & get them ready for any new additions provided by IFTTT. This worked great on Mac OS X, but it should work as well on any UNIX out there. Windows? I have no idea?

Grab your archive of old tweets

The easiest way I know of to grab your old tweets is to use a free site called, cleverly enough, All My Tweets, at http://www.allmytweets.net.

Go there in Google Chrome², enter your Twitter handle—you do NOT need to enter your password, so this is safe—press Get Tweets, and a few moments later you should see a list of your tweets.

Save the webpage to your Desktop (just the HTML is fine—you don’t need any of the supporting CSS or JavaScript) & open it in BBEdit³.

First of all, let’s get this wad of HTML turned into something a bit easier to read & parse. In BBEdit, go to Markup > Utilities > Format… & select Pretty Print.

Now look through your HTML & delete everything above & below the list of tweets (so that the first thing on your page is <li> & the very last thing is </li>), so that you now have a long series of list items that looks like this:

<li>RT @wassilyk: Donald Fagen has posted a lovely tribute to Levon Helm on his website - <a href="http://t.co/x0rJV0wM">http://t.co/x0rJV0wM</a> #RIPLevon @levonhelmramble <span class="created_at">Apr 29, 2012</span> <a href="https://twitter.com/#!/scottgranneman/status/196429134795251712"><img src="./All My Tweets - View all your tweets on one page._files/extlink.png"></a></li>
<li>Me: How do you kill a zombie, Finny? Finny: You shoot him! Me: That’s right! But *where* do you shoot him? Finny: Uh ... outside! <span class="created_at">Apr 29, 2012</span> <a href="https://twitter.com/#!/scottgranneman/status/196419681203138560"><img src="./All My Tweets - View all your tweets on one page._files/extlink.png"></a></li>
<li>The final shot in Monsters Inc is as powerful &amp; evocative as that in Chaplin’s City Lights. I’m sure Pixar meant it as an homage. Beautiful. <span class="created_at">Apr 29, 2012</span> <a href="https://twitter.com/#!/scottgranneman/status/196404206662451200"><img src="./All My Tweets - View all your tweets on one page._files/extlink.png"></a></li>

That’s nice, I guess, but it’s not in a format we can use with IFTT. Next step!

Markdown-ify the HTML

Let’s transform this from dense HTML into clean & pleasant Markdown that matches the IFTTT recipe⁴. To do so, use BBEdit (or the equivalent) to perform the following Find & Replaces (Grep & Wrap Around were both checked).

Add the separators between each tweet:

Find: </li>\r<li>
Replace: \r\r---\r\r

You’ll have to manually delete the <li> at the beginning of the file & the </li> at the end. No biggie.

Make the dates of the tweets into hyperlinks. First the very end, the close parenthesis:

Find: ><img src="./All My Tweets - View all your tweets on one page._files/extlink.png"></a>
Replace: )

Now the beginning, the open brace:

Find:  <span class="created_at">
Replace: \r\r[

And then the middle, the close brace followed by an open parenthesis:

Find: </span> <a href="
Replace: ](

Finally, replace links in tweets with Markdown-compatible formatting, so that <a href="http://t.co/zfpoTT5V"> becomes <http://t.co/zfpoTT5V>. Note that in the second Find, the character after the ^ is a dot, not a period or asterisk; on the Mac, you produce it by typing Option+8. That regex inside the quotation marks, by the way, means “find any characters between quotation marks that are NOT a •”. I used the • because that’s a character that you should never find in a real URL.

Find: </a>
Replace: >

Find: <a href="[^•]*?">
Replace: <

Save the file as tweets.txt into a directory like ~/tweets⁵.

Re-order the tweets

The problem now is that tweets.txt is ordered in reverse chronological order, with the most recent tweets at the top, & the oldest at the bottom. The IFTTT recipe, however, orders tweets in the opposite order, with the newest at the bottom, in the same way you would use >> to append to a file.

So we need to reorder the contents of the file. Again, I’m sure you can do this easily with Python or Perl or the like, but I don’t know those languages. And anyway, it’s entirely possible to do it without any programming languages at all beyond good ol’ bash itself.

First, let’s use the split command to divide tweets.txt into a whole bunch of separate little files, one for each tweet. We tell split to separate each file at the --- using -p ---.

If you stop there, though, you’ll run into a problem unless you have a small number of tweets: split by default names the files it creates x following by two letters, starting with aa and going all the way up to zz, giving you everything from xaa to xzz. If you have thousands of tweets, this isn’t enough. Fortunately, split will complain, so you’ll know you need to tell it to use more letters. To do that, use -a 4, while informs split that it should start with xaaaa & go all the way up to xzzzz⁶.

split -a 4 -p --- tweets.txt

The end result should be a bunch of files, one for each tweet. You can see the list of files with ls, & then take a look at the first one:

cat xaaaa

& the last one (obviously the name could be very different for you):

cat xaeht

…to verify that your tweets have been separated into different files.

Now we can merge the separated tweets into a new file that puts things in chronological order⁷:

for tweet in $(ls -1 x* | sort -r) ; do cat $tweet >> tweets-reversed.txt ; done

Check to make sure things are good:

head tweets-reversed.txt
tail tweets-reversed.txt

You’ll note that when you tail the file, the last tweet is missing the --- above it⁸, so that needs to be added manually. Do that, move the file to the location & file name you plan to specify in the IFTTT recipe & you’re now ready to enable things at IFTTT.

Twitter limits access to your last 3200 tweets, so get cracking! ↩
I tried it in Safari, & the page wasn’t saved correctly. I tried it in Chrome, & it worked perfectly. I didn’t try it in any other browser. Go ahead & see what you get. ↩
I prefer BBEdit for this sort of thing, but you can use any text editor that does a decent job with regular expressions. ↩
Yes, there are several scripts & programs out there that will convert HTML to Markdown—I’m a big fan of Pandoc, myself—but they wouldn’t leave things in a state that matches what the IFTTT recipe will generate. And besides, it’s always fun to learn more about BBEdit. ↩
Yeah, I know that the path is now ~/tweets/tweets.txt. No biggie—they’re going to be blown away soon anyway. ↩
Since Twitter only allows you to download the last 3200 tweets, this should be more than enough. Heck, -a 3 might be enough for all I know—math ain’t my strong suit. ↩
Normally you should never use $(ls FILENAMES) as it can be problematical if the file name has spaces in it, leading to word splitting. For more on this issue, & the right way to do things, see Greg Wooledge’s Bash Pitfalls. But in this case, we know that all the file names are space-less, so I went ahead & did it. ↩
This is thanks to the way split works & the way the file was constructed initially. ↩