Turning a list of domains into a Markdown-formatted ordered list with regular expressions
I’m teaching a course called From Blogs to Wikis at Washington University in St. Louis. One of the things I like to show them is a list of the 500 most popular websites in the world, so they can see how many of them (a) are related to social software, & (b) aren’t American. I grabbed the latest data, as I do every year, & massaged it with regular expressions so that it could go on my course website in Markdown format. It took me 3 passes to do it, which I’ve outlined here for others interested in regex … & the 500th most popular site on the Web!
I went to http://www.alexa.com/topsites/global & downloaded the CSV file containing today’s top million websites, as measured by Alexa. I then opened the CSV in Excel1, selected both columns of the first 500 rows, & pressed Command+C to copy the data. I closed Excel, opened Sublime Text, & pressed Command+V to paste that data.
I now had 500 lines that looked like this:
1 google.com
2 facebook.com
3 youtube.com
4 yahoo.com
5 baidu.com
6 wikipedia.org
7 live.com
8 twitter.com
9 qq.com
10 amazon.com
…
496 webmd.com
497 rr.com
498 zillow.com
499 google.co.kr
500 mop.com
Remember, my ultimate goal is to have this list in Markdown, and Markdown needs a period after a number to turn it automagically into an ordered list. To get that, I used this RegEx:
Find: (^[0-9]+)
Replace: \1.
The Find means the following:
^
says to start at the beginning of the line.[0-9]
says to look for a number between 0 & 9.+
says to look for 1 or more of the previous item, which is this case are numbers between 0 & 9. At this point, I’ve matched all the numbers in the list, from 1 to 500 (without the+
, my regex would just match numbers of one digit).(
&)
around everything so far groups the matched string together, so it can be backreferenced & re-used.
In the Replace, \1
is a backreference to the strings matched inside the group denoted by the (
& )
of the Find. It’s 1
because it matches the first (…)
. You’ll see what to do with more than one group in just a moment.
After my find & replace, the 500 lines now looked like this:
1. google.com
2. facebook.com
3. youtube.com
4. yahoo.com
5. baidu.com
6. wikipedia.org
7. live.com
8. twitter.com
9. qq.com
10. amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.com
It’s nice how the names all line up, but I didn’t want those extra spaces after the numbers, so I used this regex (to make it obvious, I wrote out SPACE
where you’re supposed to enter a single space):
Find: [SPACE]+
Replace: SPACE
By now, it should be obvious that [SPACE]+
matches one or more spaces, which I am then replacing with a single space. Easy.
The 500 lines now looked like this:
1. google.com
2. facebook.com
3. youtube.com
4. yahoo.com
5. baidu.com
6. wikipedia.org
7. live.com
8. twitter.com
9. qq.com
10. amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.com
Markdown, however, needs to surround a typed URL with a <
& a >
to turn it automagically into a hyperlinked URL. To get that, I used this RegEx:
Find: (^[0-9]+\. )(.*)
Replace: \1<\2>
This is a bit more interesting, with the Find meaning the following:
-
The first group,
(^[0-9]+\. )
, breaks down as follows:^
says to start at the beginning of the line.[0-9]+
looks for one or more numbers, thereby matching 1 through 500.\.
looks for a period. You have to preface the dot with a backslash, so your regex knows you’re actually looking for a period, as a dot by itself matches any character.
-
The second group,
(.*)
—and notice that there is a second group here!—means this:.
is intended to be a single dot, which matches any character. Domain names can contain letters (small & capital), numbers, dots, & hyphens, so the dot will work with all of those.*
says to look for zero of more instances of the previous item, so placing it right after the.
will match any length of any characters (except for\n
&\r
, which indicate line feed and carriage return, respectively), thereby matching any of the domain names listed. Yes, it’s kind of silly to use a*
here, as we know that we don’t have zero characters in any of the domain names, so it would have made more sense to use a+
. I can only plead long habit.
The Replace does the following:
\1
is a backreference that matches every string found by(^[0-9]+\. )
, which here means everything from1.
(note the space after the.
) to500.
(again, note the space), and then re-inserts that string back in the Replace.- Then an opening angle bracket
<
to start the process of making a typed URL into an active hyperlink in Markdown. \2
is a backreference to the strings matched by the second group(.*)
, here meaning all the domain names, which are then re-inserted back in the Replace.- Finally, the closing angle bracket
>
to end they Markdown hyperlink.
At the end of this multi-step process, my 500 lines looked like this:
1. <google.com>
2. <facebook.com>
3. <youtube.com>
4. <yahoo.com>
5. <baidu.com>
6. <wikipedia.org>
7. <live.com>
8. <twitter.com>
9. <qq.com>
10. <amazon.com>
…
496. <webmd.com>
497. <rr.com>
498. <zillow.com>
499. <google.co.kr>
500. <mop.com>
Done!
-
I hardly ever use Excel. This is one of the very rare times. For small jobs, I really like Numbers, but Numbers would shriek in horror & faint at the thought of one million rows of data. ↩