Cequel is a Ruby ORM for Cassandra using CQL3.
Cequel::Record
is an ActiveRecord-like domain model layer that exposes
the robust data modeling capabilities of CQL3, including parent-child
relationships via compound primary keys and collection columns.
The lower-level Cequel::Metal
layer provides a CQL query builder interface
inspired by the excellent Sequel library.
Add it to your Gemfile:
gem 'cequel'
If you use Rails 5, add this:
gem 'activemodel-serializers-xml'
Cequel does not require Rails, but if you are using Rails, you
will need version 3.2+. Cequel::Record will read from the configuration file
config/cequel.yml
if it is present. You can generate a default configuration
file with:
rails g cequel:configuration
Once you've got things configured (or decided to accept the defaults), run this to create your keyspace (database):
rake cequel:keyspace:create
Unlike in ActiveRecord, models declare their properties inline. We'll start with
a simple Blog
model:
class Blog
include Cequel::Record
key :subdomain, :text
column :name, :text
column :description, :text
end
Unlike a relational database, Cassandra does not have auto-incrementing primary keys, so you must explicitly set the primary key when you create a new model. For blogs, we use a natural key, which is the subdomain. Another option is to use a UUID.
While Cassandra is not a relational database, compound keys do naturally map
to parent-child relationships. Cequel supports this explicitly with the
has_many
and belongs_to
relations. Let's create a model for posts that acts
as the child of the blog model:
class Post
include Cequel::Record
belongs_to :blog
key :id, :timeuuid, auto: true
column :title, :text
column :body, :text
end
The auto
option for the key
declaration means Cequel will initialize new
records with a UUID already generated. This option is only valid for :uuid
and
:timeuuid
key columns.
The belongs_to
association accepts a :foreign_key
option which allows you to
specify the attribute used as the partition key.
Note that the belongs_to
declaration must come before the key
declaration.
This is because belongs_to
defines the
partition key; the id
column is
the clustering column.
Practically speaking, this means that posts are accessed using both the
blog_subdomain
(automatically defined by the belongs_to
association) and the
id
. The most natural way to represent this type of lookup is using a
has_many
association. Let's add one to Blog
:
class Blog
include Cequel::Record
key :subdomain, :text
column :name, :text
column :description, :text
has_many :posts
end
Now we might do something like this:
class PostsController < ActionController::Base
def show
Blog.find(current_subdomain).posts.find(params[:id])
end
end
Parent child relationship in a namespaced model can be defined using the class_name
option of belongs_to
method as follows:
module Blogger
class Blog
include Cequel::Record
key :subdomain, :text
column :name, :text
column :description, :text
has_many :posts
end
end
module Blogger
class Post
include Cequel::Record
belongs_to :blog, class_name: 'Blogger::Blog'
key :id, :timeuuid, auto: true
column :title, :text
column :body, :text
end
end
If your final primary key column is a timeuuid
with the :auto
option set,
the created_at
method will return the time that the UUID key was generated.
To add timestamp columns, simply use the timestamps
class macro:
class Blog
include Cequel::Record
key :subdomain, :text
column :name, :text
timestamps
end
This will automatically define created_at
and updated_at
columns, and
populate them appropriately on save.
If the creation time can be extracted from the primary key as outlined above,
this method will be preferred and no created_at
column will be defined.
If your a column should behave like an ActiveRecord::Enum
you can use the
column type :enum
. It will be handled by the data-type :int
and expose some
helper methods on the model:
class Blog
include Cequel::Record
key :subdomain, :text
column :name, :text
column :status, :enum, values: { open: 1, closed: 2 }
end
blog = Blog.new(status: :open)
blog.open? # true
blog.closed? # false
blog.status # :open
Blog.status # { open: 1, closed: 2 }
Cequel will automatically synchronize the schema stored in Cassandra to match
the schema you have defined in your models. If you're using Rails, you can
synchronize your schemas for everything in app/models
by invoking:
rake cequel:migrate
Record sets are lazy-loaded collections of records that correspond to a particular CQL query. They behave similarly to ActiveRecord scopes:
Post.select(:id, :title).reverse.limit(10)
To scope a record set to a primary key value, use the []
operator. This will
define a scoped value for the first unscoped primary key in the record set:
Post['bigdata'] # scopes posts with blog_subdomain="bigdata"
You can pass multiple arguments to the []
operator, which will generate an
IN
query:
Post['bigdata', 'nosql'] # scopes posts with blog_subdomain IN ("bigdata", "nosql")
To select ranges of data, use before
, after
, from
, upto
, and in
. Like
the []
operator, these methods operate on the first unscoped primary key:
Post['bigdata'].after(last_id) # scopes posts with blog_subdomain="bigdata" and id > last_id
You can also use where
to scope to primary key columns, but a primary key
column can only be scoped if all the columns that come before it are also
scoped:
Post.where(blog_subdomain: 'bigdata') # this is fine
Post.where(blog_subdomain: 'bigdata', permalink: 'cassandra') # also fine
Post.where(blog_subdomain: 'bigdata').where(permalink: 'cassandra') # also fine
Post.where(permalink: 'cassandra') # bad: can't use permalink without blog_subdomain
Note that record sets always load records in batches; Cassandra does not support result sets of unbounded size. This process is transparent to you but you'll see multiple queries in your logs if you're iterating over a huge result set.
CQL has special handling for the timeuuid
type,
which allows you to return a rows whose UUID keys correspond to a range of
timestamps.
Cequel automatically constructs timeuuid range queries if you pass a Time
value for a range over a timeuuid
column. So, if you want to get the posts
from the last day, you can run:
Blog['myblog'].posts.from(1.day.ago)
When you update an existing record, Cequel will only write statements to the database that correspond to explicit modifications you've made to the record in memory. So, in this situation:
@post = Blog.find(current_subdomain).posts.find(params[:id])
@post.update_attributes!(title: "Announcing Cequel 1.0")
Cequel will only update the title column. Note that this is not full dirty tracking; simply setting the title on the record will signal to Cequel that you want to write that attribute to the database, regardless of its previous value.
In the above example, we call the familiar find
method to load a blog and then
one of its posts, but we didn't actually do anything with the data in the Blog
model; it was simply a convenient object-oriented way to get a handle to the
blog's posts. Cequel supports unloaded models via the []
operator; this will
return an unloaded blog instance, which knows the value of its primary key,
but does not read the row from the database. So, we can refactor the example to
be a bit more efficient:
class PostsController < ActionController::Base
def show
@post = Blog[current_subdomain].posts.find(params[:id])
end
end
If you attempt to access a data attribute on an unloaded class, it will lazy-load the row from the database and become a normal loaded instance.
You can generate a collection of unloaded instances by passing multiple
arguments to []
:
class BlogsController < ActionController::Base
def recommended
@blogs = Blog['cassandra', 'nosql']
end
end
The above will not generate a CQL query, but when you access a property on any
of the unloaded Blog
instances, Cequel will load data for all of them with
a single query. Note that CQL does not allow selecting collection columns when
loading multiple records by primary key; only scalar columns will be loaded.
There is another use for unloaded instances: you may set attributes on an
unloaded instance and call save
without ever actually reading the row from
Cassandra. Because Cassandra is optimized for writing data, this "write without
reading" pattern gives you maximum efficiency, particularly if you are updating
a large number of records.
Cassandra supports three types of collection columns: lists, sets, and maps. Collection columns can be manipulated using atomic collection mutation; e.g., you can add an element to a set without knowing the existing elements. Cequel supports this by exposing collection objects that keep track of their modifications, and which then persist those modifications to Cassandra on save.
Let's add a category set to our post model:
class Post
include Cequel::Record
belongs_to :blog
key :id, :uuid
column :title, :text
column :body, :text
set :categories, :text
end
If we were to then update a post like so:
@post = Blog[current_subdomain].posts[params[:id]]
@post.categories << 'Kittens'
@post.save!
Cequel would send the CQL equivalent of "Add the category 'Kittens' to the post
at the given (blog_subdomain, id)
", without ever reading the saved value of
the categories
set.
Cassandra supports secondary indexes, although with notable restrictions:
- Only scalar data columns can be indexed; key columns and collection columns cannot.
- A secondary index consists of exactly one column.
- Though you can have more than one secondary index on a table, you can only use one in any given query.
Cequel supports the :index
option to add secondary indexes to column
definitions:
class Post
include Cequel::Record
belongs_to :blog
key :id, :uuid
column :title, :text
column :body, :text
column :author_id, :uuid, :index => true
set :categories, :text
end
Defining a column with a secondary index adds several "magic methods" for using the index:
Post.with_author_id(id) # returns a record set scoped to that author_id
Post.find_by_author_id(id) # returns the first post with that author_id
Post.find_all_by_author_id(id) # returns an array of all posts with that author_id
You can also call the where
method directly on record sets:
Post.where(author_id: id)
Cassandra supports tunable consistency, allowing you to choose the right balance between query speed and consistent reads and writes. Cequel supports consistency tuning for reads and writes:
Post.new(id: 1, title: 'First post!').save!(consistency: :all)
Post.consistency(:one).find_each { |post| puts post.title }
Both read and write consistency default to QUORUM
.
Cassandra supports frame compression,
which can give you a performance boost if your requests or responses are big. To enable it you can
specify client_compression
to use in cequel.yaml.
development:
host: '127.0.0.1'
port: 9042
keyspace: Blog
client_compression: :lz4
Cequel supports ActiveModel functionality, such as callbacks, validations, dirty attribute tracking, naming, and serialization. If you're using Rails 3, mass-assignment protection works as usual, and in Rails 4, strong parameters are treated correctly. So we can add some extra ActiveModel goodness to our post model:
class Post
include Cequel::Record
belongs_to :blog
key :id, :uuid
column :title, :text
column :body, :text
validates :body, presence: true
after_save :notify_followers
end
Note that validations or callbacks that need to read data attributes will cause unloaded models to load their row during the course of the save operation, so if you are following a write-without-reading pattern, you will need to be careful.
Dirty attribute tracking is only enabled on loaded models.
Cequel 0.x targeted CQL2, which has a substantially different data representation from CQL3. Accordingly, upgrading from Cequel 0.x to Cequel 1.0 requires some changes to your data models.
Upgrading from a Cequel::Model
class is fairly straightforward; simply add the
compact_storage
directive to your class definition:
# Model definition in Cequel 0.x
class Post
include Cequel::Model
key :id, :uuid
column :title, :text
column :body, :text
end
# Model definition in Cequel 1.0
class Post
include Cequel::Record
key :id, :uuid
column :title, :text
column :body, :text
compact_storage
end
Note that the semantics of belongs_to
and has_many
are completely different
between Cequel 0.x and Cequel 1.0; if you have data columns that reference keys
in other tables, you will need to hand-roll those associations for now.
CQL3 does not have a direct "wide row" representation like CQL2, so the
Dictionary
class does not have a direct analog in Cequel 1.0. Instead, each
row key-map key-value tuple in a Dictionary
corresponds to a single row in
CQL3. Upgrading a Dictionary
to Cequel 1.0 involves defining two primary keys
and a single data column, again using the compact_storage
directive:
# Dictionary definition in Cequel 0.x
class BlogPosts < Cequel::Model::Dictionary
key :blog_id, :uuid
maps :uuid => :text
private
def serialize_value(column, value)
value.to_json
end
def deserialize_value(column, value)
JSON.parse(value)
end
end
# Equivalent model in Cequel 1.0
class BlogPost
include Cequel::Record
key :blog_id, :uuid
key :id, :uuid
column :data, :text
compact_storage
def data
JSON.parse(read_attribute(:data))
end
def data=(new_data)
write_attribute(:data, new_data.to_json)
end
end
Cequel::Model::Dictionary
did not infer a pluralized table name, as
Cequel::Model
did and Cequel::Record
does. If your legacy Dictionary
table has a singlar table name, add a self.table_name = :blog_post
in the
model definition.
Note that you will want to run ::synchronize_schema
on your models when
upgrading; this will not change the underlying data structure, but will add some
CQL3-specific metadata to the table definition which will allow you to query it.
CQL is designed to be immediately familiar to those of us who are used to working with SQL, which is all of us. Cequel advances this spirit by providing an ActiveRecord-like mapping for CQL. However, Cassandra is very much not a relational database, so some behaviors can come as a surprise. Here's an overview.
Perhaps the most surprising fact about CQL is that INSERT
and UPDATE
are
essentially the same thing: both simply persist the given column data at the
given key(s). So, you may think you are creating a new record, but in fact
you're overwriting data at an existing record:
# I'm just creating a blog here.
blog1 = Blog.create!(
subdomain: 'big-data',
name: 'Big Data',
description: 'A blog about all things big data')
# And another new blog.
blog2 = Blog.create!(
subdomain: 'big-data',
name: 'The Big Data Blog')
Living in a relational world, we'd expect the second statement to throw an
error because the row with key 'big-data' already exists. But not Cassandra: the
above code will just overwrite the name
in that row. Note that the
description
will not be touched by the second statement; upserts only work on
the columns that are given.
Counting is not the same as in a RDB, as it can have a much longer runtime and
can put unexpected load on your cluster. As a result Cequel does not support
this feature. It is still possible to execute raw cql to get the counts, should
you require this functionality.
MyModel.connection.execute('select count(*) from table_name;').first['count']
- 5.0
- 4.2
- 4.1
- 4.0
- Ruby 2.3, 2.2, 2.1, 2.0
- 2.1.x
- 2.2.x
- 3.0.x
- dropped support for jruby (Due to difficult to work around bugs in jruby. PRs welcome to restore jruby compatibility.)
If you find a bug, feel free to open an issue on GitHub. Pull requests are most welcome.
For questions or feedback, hit up our mailing list at [email protected] or find outoftime in the #cassandra IRC channel on Freenode.
See CONTRIBUTING.md
Cequel was written by an awesome lot. Thanks to you all.
Special thanks to Brewster, which supported the 0.x releases of Cequel.
If you're new to Cassandra, check out Learning Apache Cassandra, a hands-on guide to Cassandra application development by example, written by the creator of Cequel.
Cequel is distributed under the MIT license. See the attached LICENSE for all the sordid details.