I have, like many others, problems using databases. For the biggest part, these problems relate to the usage of SQL, which is IMO an extremely ugly language that hides the beauty of the relational model underlying.
Even more interesting, there is no real alternative for SQL to access
relational databases. Therefore, I have been thinking about
formulating database queries directly in Ruby, and effectively
executing them there too. This sounds likely like a big performance
loss, but it has the great advantage that you actually know what the
database it doing (that is, exactly what you tell it) without reading
big and complex ANALYZE
output. Also, note that—in the
end—efficient queries always run faster than the unoptimized ones.
Therefore, I think a pure-Ruby database query language scales good
enough for now, optimization is left for later.
Now, what is the basic idea of database querying? In most cases, a database query can be written in Unix pipe style (Joins will come later). For example, take this SQL:
SELECT name, address FROM customers WHERE city = "New York" LIMIT 5
We could rewrite this like the following in Unix (In fact, 4gl(?) works a lot like that):
< customers | grep "..." | cut -f ... | head -n 5
Or, like this in Ruby:
customers.find_all { |c| c.city == "New York" }.
first(5).
map { |c| [c.name, c.address] }
Unfortunately, the Ruby query is not lazy, as the SQL and Unix queries are—they only look until they found five results. We could rewrite the query with Lazy Streams, then it would look like that:
Stream.database(Customers).filter { |c| c.city == "New York" }
.take(5).map { |c| [c.name, c.address] }
That looks much better, actually. Unfortunately, this method possibly needs lots of method calls and lambdas, so let’s rewrite it imperatively (ugh):
result = []
count = 0
Customers.each { |c|
next if c.city != "New York"
break if count >= 5
result << [c.name, c.address]
count += 1
}
In fact, I think I can rewrite almost each SQL query in a loop like
that. Note, how we could optimize the query, if we, say had a
secondary key on city
(this approach forces you to cleverly make use
of indexes, as linear tablescans take much longer than your usual
database, which is probably written in C).
Customers.each(:city, "New York") { ... }
Now, queries on a single table are fairly easy; how can we join tables? There are three kinds of joins, 1:1, 1:n, and m:n. 1:1 joins are trivial, we just grab the data we need:
result << [c.amount, Customers[c.custid].name]
An interesting idea left open would be to directly store a reference
to the customer, so you could write [c.amount, c.customer.name]
.
You can do this if referential transparency is guaranteed.
For 1:n joins, we need an inner loop (assume useful indexes), outer joins can be done similarily:
SELECT customer.name, payment.amount
FROM customers, payments
WHERE customer.id = payment.customer
will get
Customers.each { |c|
Payments.each(:customer, c.id) { |p|
result << [c.name, p.amount]
}
}
Finally, m:n joins can be done with a table and two 1:n joins, just like in SQL.
A further feature, that SQL provides are aggregates. SQL can easily compute the count, sum, average, and maximum and minimum of a set of rows. In Ruby, we can do that too, of course:
sum = 0
Payments.each { |p| sum += p.amount }
Should we need an aggregate inside a query, we either need several table scans (ugh) or some kind reference to future values.
As you can see, almost all SQL queries can be written in standard, iterative Ruby code that should not be slower (in theory, at least) if the queries are written well. However, that kind of Ruby code is rather tiresome to write. Again, have a look at our first piece of SQL:
SELECT name, address FROM customers WHERE city = "New York" LIMIT 5
Wouldn’t it be nice if we could write this in a more functional style?
Scan.new(Customers,
Where.new(
Limitation.new(5,
Collect.new { |c| [c.name, c.address] }) { |c|
c.city == "New York" })).scan!
Or, with a bit of syntactic tricks:
(query = Scan.new(Customers)) >>
Where.new { |c| c.city == "New York" } >>
Limit.new(5) >>
Collect.new { |c| [c.name, c.address] }
p query.scan.results
That looks very convenient, no? How does it work? Essentially, we
are building a chain again, like in the example with the Unix pipes.
You can see the source for the full details,
but look how Limit
is implemented to get an idea of how it works:
class Limit < Statement
def initialize(limit)
super()
@limit = limit
@count = 0
end
def call(row)
if @count < @limit
@count += 1
pass_on row
end
end
end
This way, we can combine all operators needed in a powerful way without any implicit loops (one could even imagine an peephole-optimizer to work on the chain…) in our source, but still full access to Ruby’s methods. Unfortunately, this way of querying needs, in worst case, (Number of rows) * (Number of statements in the chain) calls, which are comparatively expensive. It would be interesting to rewrite the example in raw C that is connectable with Ruby…
NP: Coldplay—Square One