Troubleshooting OAuth Token Migration Issues

Cassandra Write Timeout Issue

While running the Migration utility in Cloud storage (for example, AWS or GCP), after writing a number of tokens in Cassandra, the utility may sometimes give the following error and fail.
com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during SIMPLE write query at consistency LOCAL_QUORUM (2 replica were required but only 1 acknowledged the write)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:88)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:66)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:297)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:268)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    ... 25 common frames omitted

This error is caused in Cloud cluster setup, where the write operation was taking more time than the write timeout value defined in cassandra.yml (write_request_timeout_in_ms: default value 2 seconds).

In this case, run the utility again. In next migration, running it should start the write operation.
Note: If the application does not exit after throwing the exception, press Ctrl-C. This will exit the application and will write the checkpoint file with last run details.

Cassandra Commands to Check Health of Cassandra Cluster

Check atoken table statistics to check write status:
[root@cass-set-0-0 builder]# nodetool tablestats -H oauth2.atokens;
Total number of tables: 57
----------------
Keyspace : oauth2
    Read Count: 0
    Read Latency: NaN ms
    Write Count: 3954158
    Write Latency: 0.06952324439235862 ms
    Pending Flushes: 0
        Table: atokens
        SSTable count: 4
        Space used (live): 83.21 MiB
        Space used (total): 83.21 MiB
        Off heap memory used (total): 0 bytes
        SSTable Compression Ratio: -1.0
        Number of partitions (estimate): 112
        Memtable cell count: 143
        Memtable data size: 31.12 KiB
        Memtable off heap memory used: 0 bytes
        Memtable switch count: 41
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 242672
        Local write latency: NaN ms
        Pending flushes: 0
        Percent repaired: 100.0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 0 bytes
        Bloom filter off heap memory used: 0 bytes
        Index summary off heap memory used: 0 bytes
        Compression metadata off heap memory used: 0 bytes
        Compacted partition minimum bytes: 0
        Compacted partition maximum bytes: 0
        Compacted partition mean bytes: 0
        Average live cells per slice (last five minutes): NaN
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): NaN
        Maximum tombstones per slice (last five minutes): 0
        Dropped Mutations: 1 bytes
This output tells us the following statistics about the Cassandra cluster health:
  • SSTable count - This will tell how many SSTables containing data for this table. High value of SSTable (~ more than 10) indicate that compaction is not happening regularly.
  • Write count - Constant increase in this value indicates there are continuous writes happening on this table. This is a good indicator that migration utility is actually continuously writing to Cassandra.
  • Percent repaired - Consistency repair is actually an important mechanism for making sure copies of data are shipped around the cluster to meet your specified replication factor. If it is less than 50%, then it means that 50% of data is not replicated properly and hence we may end up losing the data.
  • Dropped Mutations - The number of mutations (INSERTs, UPDATEs or DELETEs) started on this table but not completed. A large number of dropped mutation means that node is overloaded.
Check write latency of table atokens:
[root@cass-set-1-0 builder]# nodetool tablehistograms oauth2 atokens
oauth2/atokens histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                 
50%             0.00             73.46              0.00               179                10
75%             0.00            126.93              0.00               179                10
95%             0.00            219.34              0.00               179                10
98%             0.00            263.21              0.00               179                10
99%             0.00            315.85              0.00               179                10
Min             0.00             29.52              0.00               150                 9
Max             0.00        1386179.89              0.00               179                10
This output tells us the following statistics about the Cassandra cluster health:
  • Write latency - In above example output (taken from cloud setup), the write latency has been maxed out to 1.38 seconds which is marginally lesser than configured value of write_request_timeout_in_ms = 2 seconds (/etc/cassandra/conf/cassandra.yml). if it gets higher than configured value then we may see WriteTimeoutException issue while writing records to Cassandra.
Check overall node status and heap memory:
[root@cass-set-1-0 builder]# nodetool  info
ID                     : 7d134170-9dea-419f-8cc0-9f5692245269
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 3.04 MiB
Generation No          : 1554892036
Uptime (seconds)       : 170138
Heap Memory (MB)       : 425.22 / 460.81
Off Heap Memory (MB)   : 0.08
Data Center            : dc2
Rack                   : rack1
Exceptions             : 0
Key Cache              : entries 68, size 5.55 KiB, capacity 23 MiB, 82400 hits, 82665 requests, 0.997 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 11 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache            : entries 33, size 2.06 MiB, capacity 83 MiB, 572 misses, 1288690 requests, 1.000 recent hit rate, NaN microseconds miss latency
Percent Repaired       : 0.0%
Token                  : (invoke with -T/--tokens to see all 32 tokens)
[root@cass-set-1-0 builder]#
This output tells us the following statistics about the Cassandra cluster health:
  • Heap memory - Too little heap memory availability out of total heap memory may cause slow operations. In this case, check with compactionstats and tablestats command to determine the root cause.
Check if there are pending tasks piling up:
[root@cass-set-0-0 builder]# nodetool compactionstats
pending tasks: 0

This output tells the following statistics about the Cassandra cluster health:

  • It indicates the estimation of the work to be done. If it continues to increase then it indicates that the compaction strategy is not proper, or you are low on disk space.