Web server with URL fingerprinting out of the box : Havoc's Blog

this blog contains blog posts

Web server with URL fingerprinting out of the box

by havoc

Years ago when we built Mugshot we stumbled on “URL fingerprinting,” Google describes it here. We used a “build stamp” (a continuously incrementing build number) instead of an MD5, but same thing. My guess is that many web sites end up doing this. Owen implemented the feature by writing a custom Apache module.

(The idea, if you aren’t familiar with it, is to give your static files an effectively infinite cache expiration time, but change their URL whenever the resource changes. If you have a bunch of JavaScript or CSS or whatever, people won’t have to re-download it on every visit to the site.)

It seems like the major web servers should do this out of the box. For a directory of static files, the web server could:

Generate a fingerprint for each file
Generate URLs containing the fingerprints
Communicate the fingerprint-containing URL back to the app server for use in templates (for example by writing out a simple text file with the mapping from original to fingerprinted URLs)
Set the infinite-expiration headers properly on the fingerprinted URLs

It is not a huge deal to script this yourself I guess, but do any web servers do this out of the box? Or maybe it’s a Varnish feature?

Ideally it’s dynamic, so if you change your static files the fingerprinted URLs automatically update.

My Twitter account is @havocp.

Interested in becoming a better software developer? Sign up for my email list and I'll let you know when I write something new.

Published: December 20, 2010

Filed Under: Uncategorized

Tags: tech

12 Responses to “Web server with URL fingerprinting out of the box”

Simon says:

December 20, 2010 at 4:16 pm

Actually performing the fingerprinting is trivial – the challenge is that the URLs generated by the application need to include those fingerprints. And that can be a real pain in the neck…

Reply
- havoc says:
  
  December 20, 2010 at 4:28 pm
  
  Yeah, I mean all templates have to be updated. But in most frameworks there’s some way to do this generically so you have a tag where you give the unfingerprinted filename and it gets converted, or whatever. Then you have to use this tag for images/scripts/css.
  
  Annoying to retrofit to an existing codebase, but pretty easy to do in the first place if your web server and app server setup supported it out of the box… which is why they all ought to 😉
  
  Reply
Alex Graveley says:

December 20, 2010 at 9:32 pm

Tornado does it by default.

Reply
- havoc says:
  
  December 20, 2010 at 11:46 pm
  
  An interesting note in their docs about one way to do it with nginx:
  http://www.tornadoweb.org/documentation#static-files-and-aggressive-file-caching
  
  location /static/ {
  root /var/friendfeed/static;
  if ($query_string) {
  expires max;
  }
  }
  
  This assumes their scheme where if a static resource has a query string, that query string means it’s fingerprinted, I guess.
  
  Reply
Hugh says:

December 20, 2010 at 10:39 pm

The “extend cache” feature of mod_pagespeed seems to do this automatically. See http://code.google.com/speed/page-speed/docs/filter-cache-extend.html

The obvious disadvantage is that it has to parse the HTML as it is served, but on the scale of things, that probably not a major cost.

Reply
- Hugh says:
  
  December 20, 2010 at 10:41 pm
  
  s/that probably/that is probably/
  
  Reply
- havoc says:
  
  December 20, 2010 at 11:47 pm
  
  kinda hacky, but cool.
  
  Reply
James Henstridge says:

December 21, 2010 at 4:52 am

For static pages in Launchpad, we included the revision number of the main Launchpad branch in the URL of static pages. The revision number was in the path component of the URL rather than the query string to make it easy for static files to refer to each other using relative URLs without needing to know about the fingerprint.

We then used Apache mod_rewrite rules to map URLs with any revision number to the location where we rolled the code out to, and set appropriate expiration headers.

The aim wasn’t to provide multiple versions of the static files, but to let us easily form new URLs when putting out a new release.

Reply
- havoc says:
  
  December 21, 2010 at 12:48 pm
  
  this is more like the Mugshot build stamp. The downside is on every deploy, people always have to re-download every file even the unchanged ones, right. but there are some upsides too (no per-file stamp/fingerprint, can be a directory name)
  
  Reply
Ole Laursen says:

December 21, 2010 at 7:15 am

I contemplated the write-a-tag-to-output-it approach, I think it would work fine for JS and CSS. But we had the same problem with images. So instead I wrote a middleware for Django that does everything automatically by regexp search & replace in the HTML. It appends the modification time of the files to the URLs as a query parameter. The code is here:

http://people.iola.dk/olau/python/modtimeurls.py

This way retrofitting it to a project is a one-line change. The only problem I’ve found so far is that some IE6-supporting PNG fixers think that PNGs should end with .png, not .png?_=blahblah.

Reply
- havoc says:
  
  December 21, 2010 at 12:47 pm
  
  using the mod time is a great idea; a content hash is kind of overkill, no? it’s not like you’ll ever get a duplicate mod time in practice. nice idea.
  
  Reply
havoc says:

December 21, 2010 at 12:49 pm

Definitely getting the idea that 1) lots of people have rolled a solution to this problem 2) the solutions are all kinda hacky.

Which makes me think it really should be built in to some web servers and frameworks!

Reply

« Previous Post

Havoc's Blog