Thursday, November 17, 2011

Does not compose (and other musings)

I first planned to write a post about how objects do not compose on a property level. But I ended up addressing a couple of other problems with ORM as well. It is really hard (for me) to stay focused on one problem at a time when writing of Object Relational Mapping, because they do such a good job of introducing complexity.

Let's take a trivial example of something you want to do all the time in even the most trivial of web apps. You want to compose data from from different tables.

Consider this tiny database of (some of) the beers in my fridge.

Table: BEER
IDNAMEFK_BREWERY
1Double Dog6
2Gonzo6
3Texas Ranger5
4Punk IPA1
5Bigfoot Ale2
6Torpedo Extra IPA2

Table: BREWERY
IDNAMECOUNTRY
1Brew DogScotland
2Sierra NevadaUSA
3Thomas CreekUSA
4Sam AdamsUSA
5MikkellerDenmark
6Flying DogUSA

In a real database these tables would of course have a lot more columns.

Now, imagine you want to display a HTML page with a list of all the beers along with the name of the brewery displayed as a link. The link would refer to another page displaying all the data for that brewer. You cannot get more standard than this.

With ORM


How would this look in a standard ORM with one class per table and one property per column (sprinkled with some magic if the column was a foreign key)?

beers = orm.getAllOfType(Beer)
for (beer in beers) {
  brewery = beer.getBrewery()
  printRow(beer.getName(), link(brewery.getId(), brewery.getName()))
}

If you get a hard-on from the absence of SQL it looks like we hava a winner! But if you have half a brain and wonder what takes place under the covers things don't look so good. For each beer.getBrewery() a database query is beeing made. This is bad on so many levels.

The extra states


The most obvious reason for why this is bad is that you have no way of knowing about the extra database requests just looking at the code. You can probably even open the definition of the Beer class and still not get a clue. You have to know that your beer object is not just a simple collection of data, nor is it anything like if you had instantiated it yourself. The object was created by the almighty ORM and is now a carrier of the database connection! WTF?!?

So the joyful result is that the objects have state you did not ask for. Typical ORM names for the states are attached and detached. But that is just scratching the surface, if the object has a (potential) reference to another object you have to know if that object has been loaded or not. That is part of the extra state as well, but typically not talked about.

The n+1 problem


This brings us to the next problem; the problem of multiple database requests. This is of course a well know problem. It is so well known it has a name: the n+1 problem, named after the number of queries made to the database. Makers of ORMs love to create problems, give them names, then create bad workarounds for the problems they just created. A sane person would maybe stop a think that you may have something fundamentally flawed in the model.

Ok, so that are the "solutions" ORMs provide for the n+1 problem? One way is to provide some mechanism to give hint to the ORM that you always should join (or fetch in ORM speak) the brewery as soon as you load a beer. Never mind that in 95% of your other queries of you don't care about that data at all. Hence, you should really think twice before using this technique.

Another "solution" provided may be a query language that sort of looks like SQL but have none of the properties a relational query language should have. Again, WTF?!? Why create a shitty language just so you can work with objects which sort of looks like your tables. What is the obsession?

Loading too much data


Another problem is that the ORM will load all the columns from the BREWERY table. The ORM has no way of knowing that you are only going to use 2 of them. This will get ugly very quick, especially if you're joining many tables.

Without ORM


How would a solution look in SQL and some low level data access layer?

resultSet = queryDatabase(
  """SELECT BEER.NAME, BREWERY.NAME, BREWERY.ID
  FROM   BEER, BREWERY
  WHERE  BEER.FK_BREWERY = BREWERY.ID""")

for (row in resultSet)
  printRow(row["BEER.NAME"], link(row["BREWERY.ID"], row["BREWERY.NAME"]))

Done! One query to the database guaranteed. Further you have a clear separation of concerns, querying the database is clearly separated from rendering the table. And the row object is simple naked data, the way god intended.

But here you may object (no pun intended)! The resultSet can be of a kind which has a open database connection, wasn't that something just complained about? This technique is used if you expect large result sets and you don't want to load all data in memory. If that is the case this is reflected in resultSet's type. You don't have to know how the resultSet was created, you just have to look at its type. Further, you will never execute a whole new query, you will effectively stream data from the database. (Streaming data from the database with an ORM could be a topic for a whole other post btw.)


The root cause


So, that is the root problem with the ORM in this case? Objects do not compose on a property level, they are indivisible containers of data. Relations (tables) on the other hand are divisible, you can cut them any way you like and still have a relation. That power is lost as soon as you go ORM.

No comments:

Post a Comment