I’m teaching a course called From Blogs to Wikis at Washington University in St. Louis. One of the things I like to show them is a list of the 500 most popular websites in the world, so they can see how many of them (a) are related to social software, & (b) aren’t American. I grabbed the latest data, as I do every year, & massaged it with regular expressions so that it could go on my course website in Markdown format. It took me 3 passes to do it, which I’ve outlined here for others interested in regex … & the 500th most popular site on the Web!

I went to http://www.alexa.com/topsites/global & downloaded the CSV file containing today’s top million websites, as measured by Alexa. I then opened the CSV in Excel1, selected both columns of the first 500 rows, & pressed Command+C to copy the data. I closed Excel, opened Sublime Text, & pressed Command+V to paste that data.

I now had 500 lines that looked like this:

1   google.com
2   facebook.com
3   youtube.com
4   yahoo.com
5   baidu.com
6   wikipedia.org
7   live.com
8   twitter.com
9   qq.com
10  amazon.com
…
496 webmd.com
497 rr.com
498 zillow.com
499 google.co.kr
500 mop.com

Remember, my ultimate goal is to have this list in Markdown, and Markdown needs a period after a number to turn it automagically into an ordered list. To get that, I used this RegEx:

Find: (^[0-9]+)
Replace: \1.

The Find means the following:

  • ^ says to start at the beginning of the line.
  • [0-9] says to look for a number between 0 & 9.
  • + says to look for 1 or more of the previous item, which is this case are numbers between 0 & 9. At this point, I’ve matched all the numbers in the list, from 1 to 500 (without the +, my regex would just match numbers of one digit).
  • ( & ) around everything so far groups the matched string together, so it can be backreferenced & re-used.

In the Replace, \1 is a backreference to the strings matched inside the group denoted by the ( & ) of the Find. It’s 1 because it matches the first (…). You’ll see what to do with more than one group in just a moment.

After my find & replace, the 500 lines now looked like this:

1.   google.com
2.   facebook.com
3.   youtube.com
4.   yahoo.com
5.   baidu.com
6.   wikipedia.org
7.   live.com
8.   twitter.com
9.   qq.com
10.  amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.com

It’s nice how the names all line up, but I didn’t want those extra spaces after the numbers, so I used this regex (to make it obvious, I wrote out SPACE where you’re supposed to enter a single space):

Find: [SPACE]+
Replace: SPACE

By now, it should be obvious that [SPACE]+ matches one or more spaces, which I am then replacing with a single space. Easy.

The 500 lines now looked like this:

1. google.com
2. facebook.com
3. youtube.com
4. yahoo.com
5. baidu.com
6. wikipedia.org
7. live.com
8. twitter.com
9. qq.com
10. amazon.com
…
496. webmd.com
497. rr.com
498. zillow.com
499. google.co.kr
500. mop.com

Markdown, however, needs to surround a typed URL with a < & a > to turn it automagically into a hyperlinked URL. To get that, I used this RegEx:

Find: (^[0-9]+\. )(.*)
Replace: \1<\2>

This is a bit more interesting, with the Find meaning the following:

  • The first group, (^[0-9]+\. ), breaks down as follows:

    • ^ says to start at the beginning of the line.
    • [0-9]+ looks for one or more numbers, thereby matching 1 through 500.
    • \. looks for a period. You have to preface the dot with a backslash, so your regex knows you’re actually looking for a period, as a dot by itself matches any character.
  • The second group, (.*)—and notice that there is a second group here!—means this:

    • . is intended to be a single dot, which matches any character. Domain names can contain letters (small & capital), numbers, dots, & hyphens, so the dot will work with all of those.
    • * says to look for zero of more instances of the previous item, so placing it right after the . will match any length of any characters (except for \n & \r, which indicate line feed and carriage return, respectively), thereby matching any of the domain names listed. Yes, it’s kind of silly to use a * here, as we know that we don’t have zero characters in any of the domain names, so it would have made more sense to use a +. I can only plead long habit.

The Replace does the following:

  • \1 is a backreference that matches every string found by (^[0-9]+\. ), which here means everything from 1. (note the space after the .) to 500. (again, note the space), and then re-inserts that string back in the Replace.
  • Then an opening angle bracket < to start the process of making a typed URL into an active hyperlink in Markdown.
  • \2 is a backreference to the strings matched by the second group (.*), here meaning all the domain names, which are then re-inserted back in the Replace.
  • Finally, the closing angle bracket > to end they Markdown hyperlink.

At the end of this multi-step process, my 500 lines looked like this:

1. <google.com>
2. <facebook.com>
3. <youtube.com>
4. <yahoo.com>
5. <baidu.com>
6. <wikipedia.org>
7. <live.com>
8. <twitter.com>
9. <qq.com>
10. <amazon.com>
…
496. <webmd.com>
497. <rr.com>
498. <zillow.com>
499. <google.co.kr>
500. <mop.com>

Done!

  1. I hardly ever use Excel. This is one of the very rare times. For small jobs, I really like Numbers, but Numbers would shriek in horror & faint at the thought of one million rows of data.