leah blogs

July 2005

13jul2005 · Querying databases using Ruby

I have, like many others, problems using databases. For the biggest part, these problems relate to the usage of SQL, which is IMO an extremely ugly language that hides the beauty of the relational model underlying.

Even more interesting, there is no real alternative for SQL to access relational databases. Therefore, I have been thinking about formulating database queries directly in Ruby, and effectively executing them there too. This sounds likely like a big performance loss, but it has the great advantage that you actually know what the database it doing (that is, exactly what you tell it) without reading big and complex ANALYZE output. Also, note that—in the end—efficient queries always run faster than the unoptimized ones. Therefore, I think a pure-Ruby database query language scales good enough for now, optimization is left for later.

Now, what is the basic idea of database querying? In most cases, a database query can be written in Unix pipe style (Joins will come later). For example, take this SQL:

SELECT name, address FROM customers WHERE city = "New York" LIMIT 5

We could rewrite this like the following in Unix (In fact, 4gl(?) works a lot like that):

< customers | grep "..." | cut -f ... | head -n 5

Or, like this in Ruby:

customers.find_all { |c| c.city == "New York" }.
          first(5).
          map { |c| [c.name, c.address] }

Unfortunately, the Ruby query is not lazy, as the SQL and Unix queries are—they only look until they found five results. We could rewrite the query with Lazy Streams, then it would look like that:

Stream.database(Customers).filter { |c| c.city == "New York" }
                          .take(5).map { |c| [c.name, c.address] }

That looks much better, actually. Unfortunately, this method possibly needs lots of method calls and lambdas, so let’s rewrite it imperatively (ugh):

result = []
count = 0
Customers.each { |c|
  next   if c.city != "New York"
  break  if count >= 5

  result << [c.name, c.address]
  count += 1
}

In fact, I think I can rewrite almost each SQL query in a loop like that. Note, how we could optimize the query, if we, say had a secondary key on city (this approach forces you to cleverly make use of indexes, as linear tablescans take much longer than your usual database, which is probably written in C).

Customers.each(:city, "New York") { ... }

Now, queries on a single table are fairly easy; how can we join tables? There are three kinds of joins, 1:1, 1:n, and m:n. 1:1 joins are trivial, we just grab the data we need:

result << [c.amount, Customers[c.custid].name]

An interesting idea left open would be to directly store a reference to the customer, so you could write [c.amount, c.customer.name]. You can do this if referential transparency is guaranteed.

For 1:n joins, we need an inner loop (assume useful indexes), outer joins can be done similarily:

SELECT customer.name, payment.amount
  FROM customers, payments
 WHERE customer.id = payment.customer

will get

Customers.each { |c|
  Payments.each(:customer, c.id) { |p|
    result << [c.name, p.amount]
  }
}

Finally, m:n joins can be done with a table and two 1:n joins, just like in SQL.

A further feature, that SQL provides are aggregates. SQL can easily compute the count, sum, average, and maximum and minimum of a set of rows. In Ruby, we can do that too, of course:

sum = 0
Payments.each { |p| sum += p.amount }

Should we need an aggregate inside a query, we either need several table scans (ugh) or some kind reference to future values.

As you can see, almost all SQL queries can be written in standard, iterative Ruby code that should not be slower (in theory, at least) if the queries are written well. However, that kind of Ruby code is rather tiresome to write. Again, have a look at our first piece of SQL:

SELECT name, address FROM customers WHERE city = "New York" LIMIT 5

Wouldn’t it be nice if we could write this in a more functional style?

Scan.new(Customers,
    Where.new(
        Limitation.new(5,
            Collect.new { |c| [c.name, c.address] }) { |c|
      c.city == "New York" })).scan!

Or, with a bit of syntactic tricks:

(query = Scan.new(Customers)) >>
  Where.new { |c| c.city == "New York" } >>
  Limit.new(5) >>
  Collect.new { |c| [c.name, c.address] }
p query.scan.results

That looks very convenient, no? How does it work? Essentially, we are building a chain again, like in the example with the Unix pipes. You can see the source for the full details, but look how Limit is implemented to get an idea of how it works:

class Limit < Statement
  def initialize(limit)
    super()
    @limit = limit
    @count = 0
  end

  def call(row)
    if @count < @limit
      @count += 1
      pass_on row
    end
  end
end

This way, we can combine all operators needed in a powerful way without any implicit loops (one could even imagine an peephole-optimizer to work on the chain…) in our source, but still full access to Ruby’s methods. Unfortunately, this way of querying needs, in worst case, (Number of rows) * (Number of statements in the chain) calls, which are comparatively expensive. It would be interesting to rewrite the example in raw C that is connectable with Ruby…

NP: Coldplay—Square One

Copyright © 2004–2022