Turning a list of domains into a Markdown-formatted ordered list with regular expressions
I’m teaching a course called From Blogs to Wikis at Washington University in St. Louis. One of the things I like to show them is a list of the 500 most popular websites in the world, so they can see how many of them (a) are related to social software, & (b) aren’t American. I grabbed the latest data, as I do every year, & massaged it with regular expressions so that it could go on my course website in Markdown format. It took me 3 passes to do it, which I’ve outlined here for others interested in regex … & the 500th most popular site on the Web!
I went to http://www.alexa.com/topsites/global & downloaded the CSV file containing today’s top million websites, as measured by Alexa. I then opened the CSV in Excel1, selected both columns of the first 500 rows, & pressed Command+C to copy the data. I closed Excel, opened Sublime Text, & pressed Command+V to paste that data.
I now had 500 lines that looked like this:
Remember, my ultimate goal is to have this list in Markdown, and Markdown needs a period after a number to turn it automagically into an ordered list. To get that, I used this RegEx:
The Find means the following:
^
says to start at the beginning of the line.[0-9]
says to look for a number between 0 & 9.+
says to look for 1 or more of the previous item, which is this case are numbers between 0 & 9. At this point, I’ve matched all the numbers in the list, from 1 to 500 (without the+
, my regex would just match numbers of one digit).(
&)
around everything so far groups the matched string together, so it can be backreferenced & re-used.
In the Replace, \1
is a backreference to the strings matched inside the group denoted by the (
& )
of the Find. It’s 1
because it matches the first (…)
. You’ll see what to do with more than one group in just a moment.
After my find & replace, the 500 lines now looked like this:
It’s nice how the names all line up, but I didn’t want those extra spaces after the numbers, so I used this regex (to make it obvious, I wrote out SPACE
where you’re supposed to enter a single space):
By now, it should be obvious that [SPACE]+
matches one or more spaces, which I am then replacing with a single space. Easy.
The 500 lines now looked like this:
Markdown, however, needs to surround a typed URL with a <
& a >
to turn it automagically into a hyperlinked URL. To get that, I used this RegEx:
This is a bit more interesting, with the Find meaning the following:
-
The first group,
(^[0-9]+\. )
, breaks down as follows:^
says to start at the beginning of the line.[0-9]+
looks for one or more numbers, thereby matching 1 through 500.\.
looks for a period. You have to preface the dot with a backslash, so your regex knows you’re actually looking for a period, as a dot by itself matches any character.
-
The second group,
(.*)
—and notice that there is a second group here!—means this:.
is intended to be a single dot, which matches any character. Domain names can contain letters (small & capital), numbers, dots, & hyphens, so the dot will work with all of those.*
says to look for zero of more instances of the previous item, so placing it right after the.
will match any length of any characters (except for\n
&\r
, which indicate line feed and carriage return, respectively), thereby matching any of the domain names listed. Yes, it’s kind of silly to use a*
here, as we know that we don’t have zero characters in any of the domain names, so it would have made more sense to use a+
. I can only plead long habit.
The Replace does the following:
\1
is a backreference that matches every string found by(^[0-9]+\. )
, which here means everything from1.
(note the space after the.
) to500.
(again, note the space), and then re-inserts that string back in the Replace.- Then an opening angle bracket
<
to start the process of making a typed URL into an active hyperlink in Markdown. \2
is a backreference to the strings matched by the second group(.*)
, here meaning all the domain names, which are then re-inserted back in the Replace.- Finally, the closing angle bracket
>
to end they Markdown hyperlink.
At the end of this multi-step process, my 500 lines looked like this:
Done!
-
I hardly ever use Excel. This is one of the very rare times. For small jobs, I really like Numbers, but Numbers would shriek in horror & faint at the thought of one million rows of data. ↩