Cassandra

Background

設計一套DB來符合FB日益成長(2008)的使用者數目與data volume，在把Inbox功能migrate到Cassandra之前，資料原本是存在MySQL。

MySQL是傳統的RDBMS，scalability能力不好，而且大部分業務只需要對一個"key"做處理。

Data Model

以row key為單位做sharding，在replica上每個key的操作都是atomic。每個row包含structured columns，column組成column family，column family有兩種:Simple跟Super(column family in column family)。

Application可以指定column要根據time或是name來sort，這樣可以保證data locality。

資料的存放跟Dynamo一樣，都是使用consistent hashing，由第一個在ring上遇到的node當coordinator。Write的時候只要有quorum的replica成功就算成功，read的時候如果不需要strong consistency，就由最近的replica出結果，否則會回傳quorum的data給application。

Partitioning

同上，使用consistent hashing，由第一個遇到的node當coordinator。實務上，會有data hotspot的問題，需要處理某些node特別忙碌的情況，Dynamo選擇把一個physical node弄成多個virtual node分布在ring上面，而Cassandra選擇根據loading把負擔輕的node移到負擔重的node附近分攤流量。

Replication

每個data item可以設定要replicate幾份(假設N份)，在write的時候，由coordinator來負責寫入其他N-1份。Cassandra提供幾種分散方式：Rack Unaware、Rack aware和Datacenter Aware。Application Developer根據自身在performance和data persistency的需求來決定data分散的方式。

Node之間的資訊(node metadata)則存在ZooKeeper，node彼此溝通是根據gossip的演算法在溝通。

Membership

Failure Detection

使用accural failure detector，是一套根據收到的message來計算其他node是否down的演算法，底層需要一個統計分佈，此篇paper說用的是exponential distribution，但不確定之後有沒有改掉。

Bootstraping

Node剛加入cluster的時候，會得到一個token，此token代表在ring的位置，然後會去跟ZooKeeper註冊和得到其他node的資訊，然後用gossip的方式把新結點加入的消息傳出去。

Local Persistence

Paper沒有把data structure說得很清楚，但官方文件關於storage engine有提到SSTable。

根據paper的描述與網路查到的文件，看起來就是LSM-tree和SSTable，也就是log file只提供append，當memtable設定的threashold達到就archive輸出成sstable，然後從memtable開始，如果找不到就找SSTable。

然後有一隻背景程式會不斷的compact這些SSTable files，清出空間並且把被刪除的row移除。

為了加快搜尋，也搭配Bloom Filter來先判斷一個key是否存在於資料中。

Cassandra

Background​

Data Model​

Partitioning​

Replication​

Membership​

Failure Detection​

Bootstraping​

Local Persistence​