Caching in couchdb

Akshat Jiwan Sharma, Mon Oct 06 2014

I remember reading a quote a while ago by someone somewhere that said something like

There are only two hard things in Computer Science: compilers and cache invalidation.

I might be off by one on this one, as I read this a long time ago, but I am pretty sure that one of the hard things was cache invalidation.

"How is this quote relevant to our discussion?"

Err... I thought I might put something up there since we will talking about caching in this article. To be more specific we will be talking about caching in couchdb.

Let's start with something simple. A document.

A document in couchdb is simply a json object. For example

{
"name":"Rayman",
"roll_number":23

}

When this object is saved in couchdb it adds two special fields to this document and it becomes:

{
 "_id":12e4r75869098567,
 "_rev": 1-e234588769829,
 "name":"Rayman",
"roll_number":23

}

These two special fields are the _id and the _rev fields. The _id field is a unique identifier for a document. As you might expect this has to be unique for every document in that is stored within couchdb. The _rev is the one which we are more interested in at the moment._rev field denotes a specific revision of couchdb document. What this means is that every time you update a document it's _rev field is updated as well. The old documents can still be accessed by querying with the old _rev id. This makes couchdb documents versioned.

"Cool. But what does this have to do with caching?"

when you query a couchdb document using the http api it will send in the _rev as an etag. This makes sense because the versioned content in a document is uniquely identified by it's _rev field. When a browser sends an If-None-Match header to couchdb it is matched against the current _rev of the document. If the header and the _rev fields are unchanged couchdb can safely send a 304--Not Modified status code to the browser.

So the document level caching in couchdb is handled by etags which are just the _rev field of the document. Pretty simple so far.

Views

Things get a bit complicated with views. There are two important points to keep in mind here :-

  1. On disk the view indexes are stored in a file which is named after md5 hash of the content of all the views in a design document. This means that if you change the content of any single view within a design document the index will have to be rebuilt.

  2. The views are built and updated when you read from them.

Now that we have a solid understanding of views we can understand how etags are generated. Etags in views are generated by taking the md5 hash of a unique signature. What makes up a signature. Well lots of things. Take a look at the following structure

-record(mrargs, {
    view_type,
    reduce,

    preflight_fun,

    start_key,
    start_key_docid,
    end_key,
    end_key_docid,
    keys,

    direction = fwd,
    limit = 16#10000000,
    skip = 0,
    group_level = 0,
    stale = false,
    inclusive_end = true,
    include_docs = false,
    doc_options = [],
    update_seq=false,
    conflicts,
    callback,
    list,
    extra = []
}).

This is an erlang record that makes up the gist of the view signature that is used to then generate an etag. As a whole the structure looks a bit daunting. But if we take a look at the fields one at a time most of them are recognizable. For instance:-

start_key,
end_key,
start_key_docid,
end_key_docid
reduce,
include_docs=true,
keys,
limit,
skip,
group_level

are simply the query parameters to the couchdb view api. So it's easy to deduce that one of the factors that makes up the signature are the query parameters to the view. This is easily verifiable. Simply query a view with different parameters and then inspect their etags. They will all be different.

The other important thing that makes up the signature is the update and purge seq of the database for which the view is being queried. The update and purge sequences can be checked by issuing a get request to your database.

Conclusion

Etag in a view depends upon the query parameters, the update sequence of the database and the purge seq of the database. If none of these three variables change the etag won't change and couchdb will issue a 304--Not-Modified header.

Lists

List functions are interesting because they can be used to process the result of views and push out the results in a format that is consumable by our client. What they do is they take in the result of view as an input -> apply the transformation we want on that input and returns the result. So I think one would assume that the etags in the list depends upon the value etag of the view. One would be right to an extent.

The etags in the list function does indeed depend upon the signature of the underlying view. But there's more that it depends upon. Let me quote in a small section from the couchdb source code : -

 ETagFun = fun(BaseSig, Acc0) ->

   UserCtx = Req#httpd.user_ctx,
   Name = UserCtx#user_ctx.name,
   Roles = UserCtx#user_ctx.roles,
   Accept = couch_httpd:header_value(Req, "Accept"),
   Parts = {couch_httpd:doc_etag(DDoc), Accept, {Name, Roles}},
   ETag = couch_httpd:make_etag({BaseSig, Parts}),
.......Rest of the function continued

The etag function besides the base signature depends upon

  1. Name of the user
  2. The roles of the user.
  3. The accept header tag (this means that everything else being the same Accept JSON and Accept xml will have different etags )
  4. The etag of the design doc. This can be seen in the Parts tuple (the first field)

Conclusion

A resource from a list will be cached if the user name, the user roles, the accept header and the design document from which the list is being queried stays the same. This has two important implications.

a) It does not matter if the content returned by the list function is different. If the parameters that go into calculating the etag do not change the the etag won't change. Or your resource would be cached even if the content returned is different.

b) Since list functions can be called from different design document than the view the etag of list will also depend upon the etag of the design document.

Show

If you have understood how the etags for the views and list are calculated then the show function is not too difficult to understand. I will quote the source:-

couch_httpd:make_etag({couch_httpd:doc_etag(DDoc), DocPart, Accept,
        {UserCtx#user_ctx.name, UserCtx#user_ctx.roles}, More}).

So the etag of the show function depends upon :-

  1. The etag of the design document.
  2. The etag of the underlying document against which the show function is run
  3. The user name and roles
  4. More.. does not matter. I think this is kept for a future feature where the etag for show might depend upon more parameters :)

So once more it does not really matter if the content returned by the show function is different from the content that originally generated the etag. All it cares is whether the four parameters changed or not.

How to serve dynamic content from show and list functions then?

As we have seen that list and show function do not calculate etag from the content of the response. So this makes them unsuitable for serving dynamic content to the browser. In these cases I usually generate my own etags that is an md5 hash of the original show/list etag +etag calculated from the content. This way if the content and the underlying etag does not change couchdb will send a 304 response but if any one of them changes 200 status code along with new etag will be sent.

For example:-

I have a show function that works on a given document called the account. This account document has a structure like so..


"base": "<html>....</html>"
"rent_contract_template":"<ul><li></li></ul>"
"rent_summary_template": "<h3><span></span> <span></span></h3>"

Don't mind the finer details. This account document simply holds the template strings. The base is the outer html where as the rent_contract and rent_summary are the html that will be appended after transforming them against a data set. Pretty standard stuff. My show function works against this account document.

But to this show function I also post the results from two list functions that serve as the data set against which the templates will be computed. So how do I maintain a etag for this dynamic scenario. As we already know that etag in show/list is not calculated by the content but from a set of predefined parameters. So I make my own etags. The ingredients:-

  1. The etag from the first list.
  2. The etag from the second list
  3. The etag from the show
  4. Chemical X!

are concatenated and it's md5 hash is calculated. I send the resultant string to the browser as the final etag. Now if the etag from any one of my input parameters i.e the list1,the list 2 or the show change the content returned by the template and hence the final etag would be different and I can serve dynamic cache-able content right from my show function.

That's it for this post folks. Want more? The official couchdb blog is pretty cool. Also there is this silly post that illustrates how you can build CABS with couchdb

Mambo Italiano!


comments powered by Disqus