Cisco Meeting Server - Database Clustering


Two Node Database Clustering (Don't Do It!)

Recently a few people architecting Cisco Videoconferencing (VC) solutions have asked the question, "how many servers do I need to cluster the CMS database? " or similar. Given the size of the deployments are relatively small the answer is always three and hopefully the following will help solidify why that is. It's also worth noting that there is currently a limit of five database nodes in a cluster.

As per Cisco documentation regarding the database.

"The Call Bridge reads from and writes to the database storing the space information, for example, the members of spaces, chat messages occurring between members of a space, recent activity within a space."

The database is also where spaces and users you create get inserted. Now back to the three answer. The CMS database cluster as per the deployment guide

"creates what is essentially an 'online' backup of the running database which is maintained as the system runs. It also provides the ability to move to using the backup in an automated fashion in the event of a failure being detected."

Excellent, continuing with the description

"Database clustering does not do any kind of load balancing, or any caching. Nor does it perform any sharding of data for more efficient local access within any kind of geographically distributed arrangement. All queries are directed at the current master, where ever it is. The replicas are not available as read-only instances."

I have bolded the most important part of this description. Essentially if there is no current master, the replica's are all stuck in read-only mode, and no inserts to the database will be possible. I have configured a two-node cluster, to demonstrate the behaviour and why you should not use a two-node configuration. Firstly the cluster is operational.

Simulating a fault, by disconnecting the network connection from the current master both servers now detect the communication failure between the cluster nodes. As there are only two nodes neither can confirm they can see more than 50% of the cluster, so both of them go read-only.

Trying to now make a change that would cause a database insert on the still reachable CMS node results in an error as can be seen below.

From the Syslog output, you can clearly see the SQL insert failure also.

Apr 2 00:38:42 local0.warning CMS1 postgres[15926]: [2-1] 2017-04-02 00:38:42 UTC 192.168.44.122(37122) ERROR: cannot execute INSERT in a read-only transaction
Apr 2 00:38:42 local0.warning CMS1 postgres[15926]: [2-2] 2017-04-02 00:38:42 UTC 192.168.44.122(37122) CONTEXT: SQL statement "INSERT INTO cospaces (cospace_id, name, tenant_id, call_profile_id, conferencing_profile_id, cdr_tag, call_branding_profile_id, client_call_branding_profile_id, non_member_access, owner_id, stream_url) VALUES (inp_guid, inp_name, tenant_id_to_use, inp_call_profile_id, inp_conferencing_profile_id, cdr_tag_to_use, inp_call_branding_profile_id, inp_client_call_branding_profile_id, inp_non_member_access, found_owner_id, inp_stream_url)"
Apr 2 00:38:42 local0.warning CMS1 postgres[15926]: [2-3] #011PL/pgSQL function create_cospace(text,uuid,text,uuid,uuid,text,text,boolean,text,text,text,uuid,text,text,text,uuid,uuid,uuid,text,integer,uuid,uuid,text,boolean,text,uuid,text) line 87 at SQL statement
Apr 2 00:38:42 local0.warning CMS1 postgres[15926]: [2-4] 2017-04-02 00:38:42 UTC 192.168.44.122(37122) STATEMENT: SELECT op_cospace_request_status FROM create_cospace($1, $2, $3, $4, $5, lower($6), $7, $8, $9, lower($10), $11, $12, lower($13), $14, $15, $16, $17, $18, $19, NULL, $20, $21, $23, $24, $25, $26, $22)
Apr 2 00:38:42 user.err CMS1 host:server: ERROR : cospace operation failed with PGRES_FATAL_ERROR (ERROR: cannot execute INSERT in a read-only transaction#012CONTEXT: SQL statement "INSERT INTO cospaces (cospace_id, name, tenant_id, call_profile_id, conferencing_profile_id, cdr_tag, call_branding_profile_id, client_call_branding_profile_id, non_member_access, owner_id, stream_url) VALUES (inp_guid, inp_name, tenant_id_to_use, inp_call_profile_id, inp_conferencing_profile_id, cdr_tag_to_use, inp_call_branding_profile_id, inp_client_call_branding_pro
Apr 2 00:38:42 user.warning CMS1 host:server: WARNING : error database result status 7 (ERROR: cannot execute INSERT in a read-only transaction#012CONTEXT: SQL statement "INSERT INTO cospaces (cospace_id, name, tenant_id, call_profile_id, conferencing_profile_id, cdr_tag, call_branding_profile_id, client_call_branding_profile_id, non_member_access, owner_id, stream_url) VALUES (inp_guid, inp_name, tenant_id_to_use, inp_call_profile_id, inp_conferencing_profile_id, cdr_tag_to_use, inp_call_branding_profile_id, inp_client_call_branding_profile_id, inp_no

Once the master comes back online, the database will again be writeable. However if there is a major fault and you cannot get the master back online, you are in trouble! Below is a third CMS node that was not part of the original cluster. Trying to join it to the one remaining node will fail as there is no master to attach to. The syslog shows the following output after running database cluster join 192.168.44.122.

Apr 2 00:47:08 user.info CMS3 sfpool: Health check 192.168.44.123: error (up = 0): could not connect to server: No route to host|#011Is the server running on host "192.168.44.123" and accepting|#011TCP/IP connections on port 5432?|
Apr 2 00:47:11 user.info CMS3 dbcluster_watcher: Error: Failed to find master.
Apr 2 00:47:11 user.info CMS3 dbcluster_watcher: Operation Complete

Checking the cluster status on the third node also shows the fault.

It is possible (confirmed in the lab) to initialize the remaining node as a new cluster with database cluster initialize, causing it to create a new cluster from its current database. However, I would not recommend relying on this. Doing this you can then join another node to the cluster (I suggest two, so the result is a three node cluster!) and you will be back up and running with the original database contents. If you tried to join another node back to the cluster before doing this it would get stuck attaching; in the lab to get out of this situation a factory_reset full was required. 

If for whatever reason a three node cluster is not possible, use a single database node and initialize the cluster on it, then connect your second call bridge to it with database cluster connect (not join!). The configuration is then consistent across the call bridges but stored on a single master. Finally, make sure you take regular backups in any mode of deployment.

Helpful links: