How I tamed a complex DB structure with an index service

So, I had to build an API layer on top of an existing legacy (but well-known) codebase.

Sounds easy, huh? Yep, that’s what I thought, too.

Unfortunately, things were not that easy: as I started going into details, I found a few traps.

The traps list

  • JOIN hell

The underlying entity map is quite extended. There are many entities, all somehow related, and to apply some filters you need to set up a long chain of JOINs.

  • Soft-delete master

The system uses a soft-delete table, with all informations about deleted records for all entities. So this table needs to be joined several times in each query.

  • Relations labyrinth

Some entities have many-to-many relations.

These relations are stored in a separate table where couples of entity id / entity type are linked to other couples.

On top of this, you cannot even rely on the order: the relation may be expressed in both directions, so you have to duplicate all your queries.

  • Very big data

The icing on the cake is that there are LOTS and LOTS of record. So of course you are not allowed to make poor design decisions, as this will greatly impact performances.

So: without any creativity, I would have build a few methods with an endless chain of JOINs, which makes for an unreadable and umaintainable query, and very probably a bad performance guarantee.

You can see the scenery was quite gloomy.

The nice, but not nice at all, solution

A solution which we experimented in a recent project is to follow CQRS principles and build a read model.

Which means that I could have run a background process to de-normalize all data into huge read tables, where a filter does not require any join.

But that has its own drawbacks, too: updating the read model is not instantaneous and takes some resources.

You have to catch all possible places in the code where data are updated, and trigger an event to update the read model.

Or you can do this at the database level, but then you have some logic outside your application, which may be tricky in the long run. Some updates logic are quite complex, too.

And you have to prevent multiple parallel executions.

I still believe CQRS is a great concept but it may take a while to build a usable implementation.

The working solution

As stated above, a big part of the complexity of the queries was because I need to apply filters which were distant many JOINs from the requested resource table.

So, as a possible evolution of the read tables idea, I thought about building an index read table only.

In other words, I would de-normalize all relations and create one or more tables with only related IDs from different tables. Then I would do 2 queries:

  • the first on this index table to get all IDs of an entity related to another entity ID
  • the second on the requested resource table, using this list of IDs as a simple WHERE IN (...) clause

Now I was getting somewhere.

But before I started building this system, it evolved even more in a series of thoughts:

  • I already experienced that many small queries may be faster of a single big query.
  • So I could use many small index tables instead of a single big one.
  • And since I use many small indexes, they change less often, and I can cache each result for a while.
  • And since I cache the result, why build tables at all? I can simply get the result and cache it.

Enter the IndexService.

The index service

The index service contains a series of methods with a common structure. This is an example :

public function getFooIdsByBar($barValue)
  $cachedValue = $this->cache->getValue('getFooIdsByBar' . $barValue);
  if ($cachedValue) {
    return $cachedValue;

  $query = $this->connection
    ->select(" AS id")
    ->from("foo", "f")
    ->innerJoin("f", "softdeleted", "s", " = AND s.type = 'foo'")
    ->andWhere("s.deleted = 0")
    ->innerJoin("f", "relations", "r", " = AND r.type = 'foo'"    
    ->andWhere("r.relatedId = '$barValue'")
    ->andWhere("r.relatedType = 'bar')

  $ids = $this->getIds($query);
    'getFooIdsByBar' . $barValue, $ids

  return $ids;

Basically, each method starts by checking if the cache already has the requested value, and if so just returns it.

If not, build the query, get the value, and cache it for later use.

There are many methods in the service; some of them may even use other methods from the same service to get filters.

The advantages

  • Readability

As a developer, the biggest win of this solution is that it each query in the application is really easy to understand. No more than a couple of JOINs in each query; most of them retrieve a single field which is the ID. Comparing to what I had to build at the start of the project, before realizing it was insane, this was a piece of cake.

  • Cache-friendly

The really nice thing, though, is that you win also from a performance point of view. Each method is cached separately. That means you could easily set different expiration times for each method, according to how often that data is changed and/or how important it is to have a really up-to-date result.

  • Database-friendly

Since each query is so small, it will be run very quickly without loading the database and – as I mentioned already – many small queries may be run in a shorter time than a single big one. But that is not all: you cache all results, so the database is even less used. Isn’t that great?

  • Test-friendly

Many small and simple methods are definitely easier to test. The most difficult part was building the fixtures data 🙂

The disadvantages

Honestly, I did not find any so far.

Of course, if you have some wrong data in the response and if you want to debug why, you may have to follow a few methods calls. But on the other hand, all of these methods are simpler to debug and to test, so that evens out in the end.

Currently I don’t know if this paradigm could be successfully applied to other systems. This was a very special case, with an underlying database structure so complex that it was mandatory to come up with a creative solution.

But I felt so good while developing and extending the index system, that somehow I guess it won’t be its last appearance…

Leave a Reply

Your email address will not be published. Required fields are marked *