info|mjb

Rolling Your Own URL Shortener

Horrific photo of title text in gvim Given there are existing tools that already URL shorten, and scan for malware, track statistics, etcetera, why would one want to roll their own? Because it is fun, and pretty darned easy. There are essentially four things to be done in order to make a URL shortener:

Step One

Acquire a short domain name, preferably one that demonstrates your clever branding skills, like goo.gl Then assign the domain to your favourite webserver, like nginx, or Apache.

Step Two

Setup a database to hold your URLs. The database paradigm is not important so long as you have a means of generating unique integer indecies for each row/document. The minimum data you will need to store is a unique integer, the full long URL, and the relative short URL.

Step Three

Write a front controller to handle redirects. This is a very simple script, so I wrote mine by hand (no frameworks/libraries) for optimal performance. The script needs to detect the requested URI, look it up in the table (it iss the short URL field/attribute) and retrieve the corresponding long URL, and forward the user to the long URL. If the URL is not found, have the controller forward the user to the management console, or the "404" of your choice.

Step Four

Write management software for your table/collection. This software should at a minimum take a long URL, determine if it is already in the table/collection, and insert it, if it is not. In order to dermine the short URL, simply take the unique integer index from the row/document, and base62 convert it. Done!

Appendix I: Things To Do

  • Verify the person adding URLs is a human and/or
  • Add an authentication mechanism to the management software
  • Implement a caching layer on the front controller
  • Make the management software user friendly

Appendix II: Why Base62?

If the numbers: 0,1,2,3,4,5,6,7,8,9 are base10, and the numbers: 0,1 are base2, and the numbers, and letters: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f are base16, base62 is a number system made up of the numbers, and letters:

0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,
D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z

In this system the numbers would have the following values, ... is an indicator for the reader to fill in the next several numbers in the series.

  • 1==1
  • 2==2
  • 3==3
  • ...
  • 9==9
  • 10==a
  • 11==b
  • ...
  • 60==Y
  • 61==Z
  • 62==10
  • 63==11
  • ...
  • 100=1C
  • 101=1D
  • ...
  • 1000=g8
  • 1001=g9
  • ...
  • 1000000=4c92

What this lets us do is take the unique integer index from each row/document, and make a unique alphanumeric combination that represents our short URL.

I recently made a short URL at Google that was: “5GSFR”, if Google is using base62, that number would be 84,101,627. That is a lot of short URLs.

Why not use another base, after all there are other non-alphanumeric characters that are URL safe? From what I understand they are not necessarily SMS URL safe, and base 62 is “good enough”

Addendum

This post was originally shared to Google+. The author has deliberately opted to not use the Google+ embed feature.