Web server with URL fingerprinting out of the box
by havoc
Years ago when we built Mugshot we stumbled on “URL fingerprinting,” Google describes it here. We used a “build stamp” (a continuously incrementing build number) instead of an MD5, but same thing. My guess is that many web sites end up doing this. Owen implemented the feature by writing a custom Apache module.
(The idea, if you aren’t familiar with it, is to give your static files an effectively infinite cache expiration time, but change their URL whenever the resource changes. If you have a bunch of JavaScript or CSS or whatever, people won’t have to re-download it on every visit to the site.)
It seems like the major web servers should do this out of the box. For a directory of static files, the web server could:
- Generate a fingerprint for each file
- Generate URLs containing the fingerprints
- Communicate the fingerprint-containing URL back to the app server for use in templates (for example by writing out a simple text file with the mapping from original to fingerprinted URLs)
- Set the infinite-expiration headers properly on the fingerprinted URLs
It is not a huge deal to script this yourself I guess, but do any web servers do this out of the box? Or maybe it’s a Varnish feature?
Ideally it’s dynamic, so if you change your static files the fingerprinted URLs automatically update.
Actually performing the fingerprinting is trivial – the challenge is that the URLs generated by the application need to include those fingerprints. And that can be a real pain in the neck…
Yeah, I mean all templates have to be updated. But in most frameworks there’s some way to do this generically so you have a tag where you give the unfingerprinted filename and it gets converted, or whatever. Then you have to use this tag for images/scripts/css.
Annoying to retrofit to an existing codebase, but pretty easy to do in the first place if your web server and app server setup supported it out of the box… which is why they all ought to 😉
Tornado does it by default.
An interesting note in their docs about one way to do it with nginx:
http://www.tornadoweb.org/documentation#static-files-and-aggressive-file-caching
location /static/ {
root /var/friendfeed/static;
if ($query_string) {
expires max;
}
}
This assumes their scheme where if a static resource has a query string, that query string means it’s fingerprinted, I guess.
The “extend cache” feature of mod_pagespeed seems to do this automatically. See http://code.google.com/speed/page-speed/docs/filter-cache-extend.html
The obvious disadvantage is that it has to parse the HTML as it is served, but on the scale of things, that probably not a major cost.
s/that probably/that is probably/
kinda hacky, but cool.
For static pages in Launchpad, we included the revision number of the main Launchpad branch in the URL of static pages. The revision number was in the path component of the URL rather than the query string to make it easy for static files to refer to each other using relative URLs without needing to know about the fingerprint.
We then used Apache mod_rewrite rules to map URLs with any revision number to the location where we rolled the code out to, and set appropriate expiration headers.
The aim wasn’t to provide multiple versions of the static files, but to let us easily form new URLs when putting out a new release.
this is more like the Mugshot build stamp. The downside is on every deploy, people always have to re-download every file even the unchanged ones, right. but there are some upsides too (no per-file stamp/fingerprint, can be a directory name)
I contemplated the write-a-tag-to-output-it approach, I think it would work fine for JS and CSS. But we had the same problem with images. So instead I wrote a middleware for Django that does everything automatically by regexp search & replace in the HTML. It appends the modification time of the files to the URLs as a query parameter. The code is here:
http://people.iola.dk/olau/python/modtimeurls.py
This way retrofitting it to a project is a one-line change. The only problem I’ve found so far is that some IE6-supporting PNG fixers think that PNGs should end with .png, not .png?_=blahblah.
using the mod time is a great idea; a content hash is kind of overkill, no? it’s not like you’ll ever get a duplicate mod time in practice. nice idea.
Definitely getting the idea that 1) lots of people have rolled a solution to this problem 2) the solutions are all kinda hacky.
Which makes me think it really should be built in to some web servers and frameworks!