Turning a list of domains into a Markdown-formatted ordered list with regular expressions
I’m teaching a course called From Blogs to Wikis at Washington University in St. Louis. One of the things I like to show them is a list of the 500 most popular websites in the world, so they can see how many of them (a) are related to social software, & (b) aren’t American. I grabbed the latest data, as I do every year, & massaged it with regular expressions so that it could go on my course website in Markdown format. It took me 3 passes to do it, which I’ve outlined here for others interested in regex … & the 500th most popular site on the Web!
I went to http://www.alexa.com/topsites/global & downloaded the CSV file containing today’s top million websites, as measured by Alexa. I then opened the CSV in Excel1, selected both columns of the first 500 rows, & pressed Command+C to copy the data. I closed Excel, opened Sublime Text, & pressed Command+V to paste that data.
I now had 500 lines that looked like this:
1 google.com
2 facebook.com
3 youtube.com
4 yahoo.com
5 baidu.com
6 wikipedia.org
7 live.com
8 twitter.com
9 qq.com
10 amazon.com
…
496 webmd.com
497 rr.com
498 zillow.com
499 google.co.kr
500 mop.comRemember, my ultimate goal is to have this list in Markdown, and Markdown needs a period after a number to turn it automagically into an ordered list. To get that, I used this RegEx:
Find: (^[0-9]+)
Replace: \1.The Find means the following:
^says to start at the beginning of the line.[0-9]says to look for a number between 0 & 9.+says to look for 1 or more of the previous item, which is this case are numbers between 0 & 9. At this point, I’ve matched all the numbers in the list, from 1 to 500 (without the+, my regex would just match numbers of one digit).(&)around everything so far groups the matched string together, so it can be backreferenced & re-used.
In the Replace, \1 is a backreference to the strings matched inside the group denoted by the ( & ) of the Find. It’s 1 because it matches the first (…). You’ll see what to do with more than one group in just a moment.
After my find & replace, the 500 lines now looked like this:
1. google.com
2. facebook.com
3. youtube.com
4. yahoo.com
5. baidu.com
6. wikipedia.org
7. live.com
8. twitter.com
9. qq.com
10. amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.comIt’s nice how the names all line up, but I didn’t want those extra spaces after the numbers, so I used this regex (to make it obvious, I wrote out SPACE where you’re supposed to enter a single space):
Find: [SPACE]+
Replace: SPACEBy now, it should be obvious that [SPACE]+ matches one or more spaces, which I am then replacing with a single space. Easy.
The 500 lines now looked like this:
1. google.com
2. facebook.com
3. youtube.com
4. yahoo.com
5. baidu.com
6. wikipedia.org
7. live.com
8. twitter.com
9. qq.com
10. amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.comMarkdown, however, needs to surround a typed URL with a < & a > to turn it automagically into a hyperlinked URL. To get that, I used this RegEx:
Find: (^[0-9]+\. )(.*)
Replace: \1<\2>This is a bit more interesting, with the Find meaning the following:
-
The first group,
(^[0-9]+\. ), breaks down as follows:^says to start at the beginning of the line.[0-9]+looks for one or more numbers, thereby matching 1 through 500.\.looks for a period. You have to preface the dot with a backslash, so your regex knows you’re actually looking for a period, as a dot by itself matches any character.
-
The second group,
(.*)—and notice that there is a second group here!—means this:.is intended to be a single dot, which matches any character. Domain names can contain letters (small & capital), numbers, dots, & hyphens, so the dot will work with all of those.*says to look for zero of more instances of the previous item, so placing it right after the.will match any length of any characters (except for\n&\r, which indicate line feed and carriage return, respectively), thereby matching any of the domain names listed. Yes, it’s kind of silly to use a*here, as we know that we don’t have zero characters in any of the domain names, so it would have made more sense to use a+. I can only plead long habit.
The Replace does the following:
\1is a backreference that matches every string found by(^[0-9]+\. ), which here means everything from1.(note the space after the.) to500.(again, note the space), and then re-inserts that string back in the Replace.- Then an opening angle bracket
<to start the process of making a typed URL into an active hyperlink in Markdown. \2is a backreference to the strings matched by the second group(.*), here meaning all the domain names, which are then re-inserted back in the Replace.- Finally, the closing angle bracket
>to end they Markdown hyperlink.
At the end of this multi-step process, my 500 lines looked like this:
1. <google.com>
2. <facebook.com>
3. <youtube.com>
4. <yahoo.com>
5. <baidu.com>
6. <wikipedia.org>
7. <live.com>
8. <twitter.com>
9. <qq.com>
10. <amazon.com>
…
496. <webmd.com>
497. <rr.com>
498. <zillow.com>
499. <google.co.kr>
500. <mop.com>Done!
-
I hardly ever use Excel. This is one of the very rare times. For small jobs, I really like Numbers, but Numbers would shriek in horror & faint at the thought of one million rows of data. ↩