Partition key vs Clustering key

The First Table

CREATE KEYSPACE "my_status" WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 1 };
USE "my_status";
CREATE TABLE "users"("username" text PRIMARY KEY, "email" text, "encrypted_password" blob);

Like a relational database, the columns available to a table are defined in advance. New columns cannot be added on the fly when inserting data, although it's possible to update an existing table's schema.

Every table defines one or more columns to act as the primary key; each row is uniquely identified by the value(s) in its primary key column(s), and those columns cannot be left blank in any row.

INSERT INTO "users"("username", "email", "encrypted_password") VALUES('alice', 'alice@gmail.com', 0x8914977ed729792e403da53024c6069a9158b8c4);
INSERT INTO "users"("username", "email", "encrypted_password") VALUES('bob', 'alice@gmail.com', 0x8914977ed729792e403da53024c6069a9158b8c4);
INSERT INTO "users"("username", "email", "encrypted_password") VALUES('carol','carol@gmail.com',0xed3d8299b191b59b7008759a104c10af3db6e63a);
INSERT INTO "users"("username", "email", "encrypted_password") VALUES ('dave','dave@gmail.com',0x6d1d90d92bbab0012270536f286d243729690a5b);

SELECT token("username"), username, email, encrypted_password FROM "users";

 system.token(username) | username | email           | encrypted_password
------------------------+----------+-----------------+--------------------------------------------
   -5396685590450884643 |      bob | alice@gmail.com | 0x8914977ed729792e403da53024c6069a9158b8c4
   -4493667438046306776 |     dave |  dave@gmail.com | 0x6d1d90d92bbab0012270536f286d243729690a5b
   -3169904368870211108 |    carol | carol@gmail.com | 0xed3d8299b191b59b7008759a104c10af3db6e63a
    5699955792253506986 |    alice | alice@gmail.com | 0x8914977ed729792e403da53024c6069a9158b8c4

(4 rows)

You'll also notice that bob is returned before alice. Clearly, rows are not returned in lexical order of their primary key.

SELECT * FROM "users" WHERE "username" > 'dave' LIMIT 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

The reason that our attempted query doesn't work is that, as we noticed before, the primary keys in our users table are not stored in lexical order; Cassandra can only return rows in the order in which they are stored.

The actual ordering is determined by the token of the primary key—the way the token is calculated is opaque to us, but Cassandra lets us use the token() function to retrieve the token for a given value:

SELECT * FROM "users" WHERE token("username") > token('dave') LIMIT 2;

 username | email           | encrypted_password
----------+-----------------+--------------------------------------------
    carol | carol@gmail.com | 0xed3d8299b191b59b7008759a104c10af3db6e63a
    alice | alice@gmail.com | 0x8914977ed729792e403da53024c6069a9158b8c4

(2 rows)

Using this technique, we can paginate over arbitrarily large tables using multiple queries.

Creating a table with a compound primary key

CREATE TABLE user_status_updates("username" text, "id" timeuuid, "body" text, PRIMARY KEY ("username", "id"));

This means that each row is identified uniquely by the combination of its username and id columns. It also means that every row must have a value in both of these columns.

INSERT INTO user_status_updates("username", "id", "body") VALUES ('alice', 76e7a4d0-e796-11e3-90ce-5f98e903bf02, 'Learning Cassandra!');
INSERT INTO user_status_updates("username", "id", "body") VALUES('bob', 97719c50-e797-11e3-90ce-5f98e903bf02, 'Eating a tasty sandwich.');

Cassandra is aware of the structure of a timeuuid and is able to both convert timestamps into UUIDs and extract the creation timestamp from a UUID. As we'll soon see, Cassandra can also sort our rows by their creation time using the timestamps encoded in the UUIDs.

SELECT "username", "id", "body", DATEOF("id") FROM "user_status_updates";

 username | id                                   | body                     | system.dateof(id)
----------+--------------------------------------+--------------------------+---------------------------------
      bob | 97719c50-e797-11e3-90ce-5f98e903bf02 | Eating a tasty sandwich. | 2014-05-30 01:13:21.813000+0000
    alice | 76e7a4d0-e796-11e3-90ce-5f98e903bf02 |      Learning Cassandra! | 2014-05-30 01:05:17.725000+0000

(2 rows)

The DATEOF function instructs Cassandra to return a result column containing the timestamp at which the given column's UUID value was created.

SELECT * FROM "user_status_updates" WHERE "username" = 'alice' AND "id" = 76e7a4d0-e796-11e3-90ce-5f98e903bf02;

 username | id                                   | body
----------+--------------------------------------+---------------------
    alice | 76e7a4d0-e796-11e3-90ce-5f98e903bf02 | Learning Cassandra!

(1 rows)
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('alice', NOW(), 'Alice Update 1');
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('bob', NOW(), 'Bob Update 1');
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('alice', NOW(), 'Alice Update 2');
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('bob', NOW(), 'Bob Update 2');
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('alice', NOW(), 'Alice Update 3');
INSERT INTO "user_status_updates" ("username", "id", "body") VALUES ('bob', NOW(), 'Bob Update 3');

The UNIXTIMESTAMPOF function returns the timestamp represented as the number of milliseconds since January 1, 1970 at midnight UTC

SELECT "username", "id", "body", UNIXTIMESTAMPOF("id") FROM "user_status_updates";

 username | id                                   | body                     | system.unixtimestampof(id)
----------+--------------------------------------+--------------------------+----------------------------
      bob | 97719c50-e797-11e3-90ce-5f98e903bf02 | Eating a tasty sandwich. |              1401412401813
      bob | bfe68db0-8010-11e8-bbfe-55b594efc7b0 |             Bob Update 1 |              1530767006219
      bob | bfec0bf0-8010-11e8-bbfe-55b594efc7b0 |             Bob Update 2 |              1530767006255
      bob | bff13c10-8010-11e8-bbfe-55b594efc7b0 |             Bob Update 3 |              1530767006289
    alice | 76e7a4d0-e796-11e3-90ce-5f98e903bf02 |      Learning Cassandra! |              1401411917725
    alice | bfe41cb0-8010-11e8-bbfe-55b594efc7b0 |           Alice Update 1 |              1530767006203
    alice | bfe973e0-8010-11e8-bbfe-55b594efc7b0 |           Alice Update 2 |              1530767006238
    alice | bfee7cf0-8010-11e8-bbfe-55b594efc7b0 |           Alice Update 3 |              1530767006271

(8 rows)
  • First, we can see that the results are grouped by username: all the status updates of bob appear first, followed by all the status updates of alice.
  • Second, within each user's status updates, the updates are returned in ascending order of the timestamp of the row's id column. Looking at the rightmost column in the results, we see that the timestamps monotonically increase for each user.

Anatomy of a compound primary key

While the actual ordering of username is opaque, the ordering of id is meaningfully related to the information encoded in the id column.

In the lexicon of Cassandra, username is a partition key. A table's partition key groups rows together into logically related bundles. In the case of our MyStatus application, each user's timeline is a self-contained data structure, so partitioning the table by user is a sound strategy.

We call the id column a clustering column. The job of a clustering column is to determine the ordering of rows within a partition. This is why we observed that within each user's status updates, the rows were returned in a strictly ascending order by timestamp of the id. This is a very useful property since our application will want to display status updates ordered by the creation time.

All Cassandra tables must have a partition key; clustering columns are optional. In the users table, each row is its own tiny partition; no row is grouped with any other.

Beyond two columns

A table can define one or more partition key columns and zero or more clustering columns.

Multiple clustering columns

CREATE TABLE user_status_updates_by_datetime("username" text, "status_date" date, "status_time" time, "body" text,
  PRIMARY KEY ("username", "status_date", "status_time"));

Rows for a particular username are clustered by both columns, first by status_date and then by status_time.

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-18', '08:30:55.123', 'Alice Update 1');

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-18', '14:40:25.123456789', 'Alice Update 2');

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-19', '08:25:25', 'Alice Update 3');

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-21', '08:35:55.123456', 'Alice Update 4');

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-21', '14:30:15.123', 'Alice Update 5');

INSERT INTO "user_status_updates_by_datetime" ("username", "status_date", "status_time", "body")
VALUES ('alice', '2016-11-23', '14:50:45.123456', 'Alice Update 6');
SELECT * FROM "user_status_updates_by_datetime";

 username | status_date | status_time        | body
----------+-------------+--------------------+----------------
    alice |  2016-11-18 | 08:30:55.123000000 | Alice Update 1
    alice |  2016-11-18 | 14:40:25.123456789 | Alice Update 2
    alice |  2016-11-19 | 08:25:25.000000000 | Alice Update 3
    alice |  2016-11-21 | 08:35:55.123456000 | Alice Update 4
    alice |  2016-11-21 | 14:30:15.123000000 | Alice Update 5
    alice |  2016-11-23 | 14:50:45.123456000 | Alice Update 6

(6 rows)

You will notice that all the status updates for the user alice are ordered by status_date first and then by status_time for a particular status_date value.

SELECT * FROM "user_status_updates_by_datetime" WHERE "username" = 'alice' AND "status_date" < '2016-11-20';

 username | status_date | status_time        | body
----------+-------------+--------------------+----------------
    alice |  2016-11-18 | 08:30:55.123000000 | Alice Update 1
    alice |  2016-11-18 | 14:40:25.123456789 | Alice Update 2
    alice |  2016-11-19 | 08:25:25.000000000 | Alice Update 3

(3 rows)

SELECT * FROM "user_status_updates_by_datetime" WHERE "username" = 'alice' AND "status_date" > '2016-11-20' AND "status_time" > '12:00:00';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "status_time" cannot be restricted (preceding column "status_date" is restricted by a non-EQ relation)"

This simply means that if you are using an inequality relation on a clustering column, all the preceding clustering columns need to have an equality relation.

SELECT * FROM "user_status_updates_by_datetime" WHERE "username" = 'alice' AND "status_date" = '2016-11-21' AND "status_time" > '12:00:00';

 username | status_date | status_time        | body
----------+-------------+--------------------+----------------
    alice |  2016-11-21 | 14:30:15.123000000 | Alice Update 5

(1 rows)

Composite partition keys

There might be cases where a partition size might grow well beyond the disk size of a Cassandra instance, or a partition might have more than 100,000 cells, which will reduce the performance of some of the internal operations of Cassandra as well as read queries on that partition. To ensure you don't face this situation, Cassandra provides a concept of composite partition keys.

A composite partition key can split up a very large internal partition into multiple internal partitions.

CREATE TABLE "status_update_replies" (
  "status_update_username" text,
  "status_update_id" timeuuid,
  "id" timeuuid,
  "author_username" text,
  "body" text,
  PRIMARY KEY (
    ("status_update_username", "status_update_id"),
    "id"
  )
);

Without the extra parentheses, Cassandra assumes, by default, that only the first column in the primary key is a partition key, and the remaining columns are clustering columns.

INSERT INTO status_update_replies("status_update_username",  "status_update_id", "id", "author_username", "body")
VALUES('alice', 76e7a4d0-e796-11e3-90ce-5f98e903bf02, NOW(), 'bob', 'Reply1');

INSERT INTO status_update_replies("status_update_username",  "status_update_id", "id", "author_username", "body")
VALUES('alice', 97719c50-e797-11e3-90ce-5f98e903bf02, NOW(), 'bob', 'Reply2');
SELECT * FROM "status_update_replies" WHERE "status_update_username" = 'alice';
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

SELECT * FROM "status_update_replies" WHERE "status_update_username" = 'alice' AND  "status_update_id" = 97719c50-e797-11e3-90ce-5f98e903bf02;

 status_update_username | status_update_id                     | id                                   | author_username | body
------------------------+--------------------------------------+--------------------------------------+-----------------+--------
                  alice | 97719c50-e797-11e3-90ce-5f98e903bf02 | b98df670-8014-11e8-bbfe-55b594efc7b0 |             bob | Reply2

(1 rows)

You will notice that the first query fails because you failed to specify the equality relation for both parts of the partition key in the where clause. Since you are trying to fetch all the replies associated with the status updates made by the username alice and this could be spread across multiple partitions, Cassandra doesn't support it as it will have a poor performance.

Structure of composite partition key tables

The partitioning is based on a hash generated on the combination of values specified for columns in the partition key. So, every unique combination of the partition key columns value will belong to a different partition. This necessitates us to specify values for all the partition key columns when querying a table with a composite partition key.

Composite partition key with multiple clustering columns

CREATE TABLE "status_update_replies" (
  "status_update_username" text,
  "status_update_id" timeuuid,
   "status_date" date,
   "status_time" time,
  "id" timeuuid,
  "author_username" text,
  "body" text,
  PRIMARY KEY (
    ("status_update_username", "status_update_id"),
    "status_date", "status_time", "id"
  )
);
  • In the preceding table, the partition key is a combination of status_update_username and status_update_id.
  • The replies are clustered by status_date, status_time and id.

results matching ""

    No results matching ""