Chicken Soup for the Caching Soul

September 12th, 2011 Steve Ayers, Consultant  (email the author)

When last we saw our hero, he was riding off into the sunset with Hibernate, his newfound love. The blogosphere was all abuzz. Would this new relationship blossom into something so well-known, tabloids would invent a new name for it? (HibeAyersnate? JBAyers?). Or would the union crash and burn, a steaming wreck amidst so many Hollywood romances? (I’m looking at you, J-Lo).

Well, neither really, faithful reader. Our hero is neither smitten nor inflamed with that finicky ORM. There are still times when he hears of Hibernate arrogantly committing data merely because it felt like it, while other times, he simply joins two tables through a simple line in XML. It’s a slippery slope. Who said true love was easy?

But, friends, today, we gather to praise Hibernate, not to bury it. For today, we will discuss in detail one of the characteristics which truly makes Hibernate a worthwhile comrade. That something is Caching.

Hibernate operates on three levels of caching, First-Level (or Session), Second-Level (or Object), and Query Caching. Each has their own jurisdiction and each has their own advantages. They even work together to provide a powerful and flexible approach to application performance. So, today, we’ll go through each to paint a clearer picture about an area in which Hibernate shines.

But, the Intergoogles is rife with information about caching: how to configure it, how it works, why it’s awesome. I won’t belabor the point. Instead, I will focus on what you need to know about the three levels of caching. I’ll illustrate the Gotchas, the AHA!’s, and the Sha Na Na’s.

I will assume that by now you know what caching is and why it is important. If not, here’s a sentence you can cut out and keep in your wallet or purse the next time your grandmother asks you to explain the benefits of caching (right after she figures out her answering machine):

‘Caching is the process of storing frequently-accessed information in a separate area of memory that is easily and quickly accessible so that subsequent reads of the same information will take less time’

There, Gramma, that is caching. Caching. CACHING! No, I didn’t sneeze, gramma. No, I’m not sick. [sigh] Yes, I’ll have the chicken soup.

Anyway, now you know what caching is. So, why should you care?

FIRST LEVEL CACHE

First-Level caching in Hibernate is otherwise known as the session cache. What this means is that Hibernate maintains a separate cache of objects as long as the same session is open.

As I mentioned in my last verbose and rambling blog post, this means that:

Session.load(MyObject.class, 123);
// Other logic here
Session.load(MyObject.class, 123);

will produce the same result, but only hit the database once. Hibernate’s first level cache realizes that you’ve already asked for this information in this session so it can merely pull what you retrieved previously. Since this is the same session, then surely nothing has changed. So, why make yet another round trip to the database only to return you information you already have?

But, this is oversimplifying the matter, a practice that is commonplace on the internet when looking for guidance in technology. ‘Just drop the JAR in your classpath and it all just works!’. Sure it does, bud. You forgot to mention I need nine other JARs, three XML files, and the moon must be a waxing gibbous.

To truly understand the first level cache, it is important to understand the duration and lifecycle of a Hibernate session. I think that grasping this is critical to understanding this cache level because inherent in all this is a fundamental question that is integral to all caching approaches, which is the concept of cache invalidation. You need to invalidate data in the cache when the real data in the database changes. I don’t think it takes a rocket scientist to know that even though we are operating within our own session, there could be a multitude of other sessions alive. All of which could be updating the very data sitting in your precious session.

So, how long is this session alive anyway? If it’s open for any substantial length of time, the odds of stale data in the cache grow exponentially. One would hope the session doesn’t live for very long. Good news, it doesn’t.

There is a pattern that Hibernate suggests to make efficient use of the Session Cache called Open Session In View. In a nutshell, this means keeping the session open just long enough to render the view of the JSP. The session is instantiated upon the HTTP request and then destroyed when the view is fully rendered. Not bad, right? So, now that we know how the session cache works and how long the session lives, lets look at some things to keep in mind:

  • The Session Cache has no built-in invalidation when data changes

    So, even though it is short-lived, it is still possible that another session has modified the data from the time you first requested it and the time you requested it again. To make matters worse, Open Session In View actually keeps the session open LONGER than usual.

  • It could be YOU updating the data behind the scenes

    Wait, what? I’m READING table ABC, but I’m UPDATING XYZ.
    True, smart guy, but remember one of the banes of every developer’s existence: legacy databases. Some of these databases contain business logic nestled deep within the friendly confines of those normalized monsters. So while you think you’re innocently updating table XYZ, there are triggers or procedures going on behind the scenes that are changing table ABC right out from under your nose. I’ve seen it happen.

    session-level-cache1

    So, as you can see, a simple update to the BAZBING table caused the evil, repugnant database business logic to fire off some legacy code which updated the FOOBAR table. Now, your cache says you have five FooBars, when in reality, you just inadvertently removed one.

  • Isolation Level and Performance

    Part of administrating the Hibernate session involves tying it to a transaction manager. However, one thing to keep in mind is that the behavior of the session cache depends on the isolation level used in that underlying connection. The more conservative the isolation level, the highest impact on performance since the chances and implications of locking become more prevalent.

    For example, using an isolation level of serializable (which is default on some databases) prevents all database anomalies such as repeatable reads and phantom reads because it locks all tables within the transaction. As a result, performance will be greatly impacted. Using a less- strict isolation level such as READ UNCOMMITTED will improve performance, but will in turn open up your database transactions to see uncommitted changes. It is a difficult balance sometimes between performance and protection.

  • SECOND LEVEL CACHE

    The second level cache has a more broad scope and is one that can greatly improve performance in applications. It has a few differences.

    1. It can live across sessions
    2. It can be distributed
    3. It is not managed internally by Hibernate. Instead, you have the ability to define a cache provider

    These points are all very important for one main reason: because the data can now be distributed and can live across sessions, you have to be very careful about what is cached and how long that data stays cached. Further, you now have to integrate another cache provider into Hibernate’s inner sanctum to help you do so.

    So now we have data that is living across sessions. Other servers, other users can all benefit from this cache rather than the selfish, session cache. The problem is that by doing this, your sheltered world just got more complicated.

    The second level cache is also known as the object cache, but it doesn’t actually store the instances of the objects. Instead it stores the values for the members of the object with an identifier as the key in the cache. So, for example, let’s suppose we have the following table:

    TEACHER

    ID NAME SUBJECT
    123 Mr. Garrison English
    456 Mr. Mackey Math

    The second level cache will store cached data as:

    {123, [Mr. Garrison, English]}
    {456, [Mr. Mackey, Math]}

    So, asking for teacher 456 will simply use that ID to pull from the cache. Now suppose we also have this table:

    STUDENT

    ID NAME TEACHER (FK) GRADE
    2009 Stan 123 5
    2010 Kyle 456 5
    2011 Eric 456 5
    2012 Kenny 123 5

    Now, things get a bit more involved. There is a one-to-many relationship now of teacher to student. Your persistence-mapping file will most likely have three mappings to show this relationship:

    1. The Teacher class
    2. The one-to-many relationship in the Teacher mapping to a set of Students, using Teacher ID as the key
    3. The Student class

    Now, the reason I bring this up is related to my first item of interest for the second level cache:

    • Caching Hibernate Associations

      In the above scenario, you will have the ability to cache the Teacher mapping as well as the Teacher relationship to Student. So, in that instance, two caches will be created to store the Teacher info as well as the IDs of the Students to which it’s related:

      Teacher Cache
      {123, [Mr. Garrison, English]}
      {456, [Mr. Mackey, Math]}

      Teacher.student Cache
      {123, [2009, 2012]}
      {456, [2010, 2011]}

      Ah, but here’s the rub. Assuming the relationship between teacher and student is a typical fetch of ‘select’, you would expect then, that the following code would only run two queries: one for the teachers and one for the students and then only upon the first invocation of ‘load’.

      Session.load(Teacher.class, 456);
      Session.clear();
      Session.load(Teacher.class, 456);

      Instead, it runs 3 queries: two for the students and teachers in the first invocation and then one for the students in the second. Adding another call to Session.load will run yet another query and so on.

      This is because you also have to cache the Student mapping. What the cache logic does is as follows:

      1. Check the Teacher cache. If ID exists, get the info.
      2. Check the Teacher.student cache. If Teacher ID exists, get the Student IDs related to it
      3. Retrieve the Students based on IDs

      Since there is no Student cache, a retrieval is done for Student info every time. So, to cache associations, make sure you cache the parent class, the relationship, and the child class (even if the child class is never explicitly retrieved on its own)

    • Not every Hibernate method invokes the cache

      One would think that any retrieval for an object would invoke the second-level cache. One would think. That is not the case, however. It turns out that it’s specific to the method you are using to retrieve your data. For example:

      Session.load(Teacher.class, 456)
      Session.get(Teacher.class, 456)

      Will both cause Hibernate to check the second level cache first for any existence of those IDs. However doing something like the following will not:

      Criteria crit = session.createCriteria(Teacher.class)
      crit.add(Restrictions.eq(“name”, “Mr. Garrison”);
      crit.list();

      Because you are using a criteria object and querying for a list, Hibernate will not first check the second level cache for the existence of your objects. OK, that sort of makes sense. But what about these?

      Crit.add(Restrictions.eq(“id”, 456L);
      Critieria.list();

      OR

      Crit.add(Restrictions.eq(“id”, 456L);
      Critieria.uniqueResult();

      I am querying by the ID in both examples. Plus, in the second one, I’m even asking for a unique result. Will Hibernate at least check the cache on the second snippet? Nope. In both, though, the cache will be populated, which is important because this means more memory being utilized for no real purpose.

    • The takeaway here is that not only do certain methods not invoke the cache, but certain approaches to retrieval also do not. More specifically, retrievals by a parameter other than the ID of the record will not check the second level cache. This means that retrieving a codes table by the code name repeatedly will not get any better if you decide to cache it. Since you are retrieving by name and not the designated ID, the second-level cache does not come into play. Remember, in our teacher-student example, the keys in the cache were the IDs. Likewise with the associations.
      So, how can we always guarantee a retrieval by ID so that we can make use of the second-level cache? Is there a way to dummy that up? Glad you asked. And don’t call me a dummy.

      QUERY CACHE

      The query cache is the third style of caching at your disposal. Think of this one as Robin the Boy Wonder of caching. By itself, it’s sort of useless. However, with the second-level cache, it becomes a worthy companion and trusted sidekick.

      What the query cache stores is basically a where clause and its bound parameters as a key, which is paired to the IDs that said where clause returns. For example:

      {[‘Where Teacher.Name = ?’, ‘Mr. Garrison’], 123}
      {[‘Where Teacher.Name = ?’, ‘Mr. Mackey], 456}

      As you can see, it is sort of useless by itself. The query cache provides you means to an end. It is a powerful complement to the second-level cache when IDs are not used for retrieval.
      The query cache also requires a high amount of analysis before just using willy-nilly. There are many points to consider when making use of it. Alex Miller wrote a great blog post on whether query caching is a good idea. So, check it out if you’re interested. If not, I’ll sum up some points of his here as well as one I’ve come across in my experiences:

      • The query cache can get big

        Take a gander at the above example I wrote of the query cache. Look at how verbose it looks already. Now realize that this is my dopey scenario with ONE column in our where clause and one parameter. Obviously, queries and where clauses can get to be enormous. Query caching these behemoths will result in all that information just sitting around in memory. So, while you’re making every effort to improve performance, you’re actually jamming a lot more into memory.

      • On frequently updated tables, the query cache is worthless

        The query cache is managed by the caching provider through the use of a separate cache called the UpdateTimestampsCache. The UpdateTimestampsCache maintains the last updated timestamp of particular tables. When a table is updated or inserted, the last updated value is modified for that table in the timestamp cache. Any entries in the query cache that correspond to that table are then invalidated. Put simply, if the table you are caching in the query cache is updated or inserted FOR ANY REASON, your cache entry will become invalidated. This means that updates or inserts completely unrelated to you or your data can still invalidate your entries.

        So, caching tables that are updating frequently in the system in the query cache is generally not a good idea. The entries will become invalidated quite often, resulting in cache misses constantly.

      • Query caching is ineffective on frequently changing parameters in Hibernate queries

        As you see in my example, the key to the query cache is composed of the where clause itself as well as the values of the bound parameters. The astute programmer will realize that if either of these changes, we have a new entry in the cache. In our example, the information is static, but suppose you have this:

        {[‘Where Teacher.ModificationDate < ?’, (some long representing a date)], 123}

        Now suppose the query is invoked by passing in the current system time in milliseconds. Since obviously the current time changes every (wait for it) millisecond, each repeated invocation of this query will result in a new parameter being passed in for Teacher.ModificationDate. This means that there will be a cache miss each time this query is run, which will subsequently result in ANOTHER cache entry being created. So, taking the first point about verbosity and this point into consideration, imagine an enormous where clause in which a constantly-changing parameter is passed in, such as the current time. We will have unintended cache misses, resulting in unintended extraneous cache entries, which further results in an insane amount of data in memory.

      • So, that is caching as far as my on-again, off-again love Hibernate is concerned. The number one thing to take away from this post is to always consider the inherent implications. On the surface, the three levels of caching that Hibernate provides are very powerful and are a useful way to improve performance. But, deep down in the seedy underbelly lurk caveats at every turn. Caching is not an easy process to manage or to even understand. Just ask your grandmother. And remember to eat the chicken soup.

        Be Sociable, Share!

Entry Filed under: Agile and Development

1 Comment Add your own

Leave a Comment

Required

Required, hidden


+ 5 = eleven

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed

© 2010-2014 Summa All Rights Reserved