Wednesday, August 1, 2018

IoT Durable Queues: What we learned from database transaction locking

Data is everything at Warren Rogers. The reliable transfer and durability of our data is paramount to us and to our clients. When building the next generation of IoT sensors, it was clear that we would need a guaranteed delivery pipeline that stored data in case of failure. We would need several durable queues backing all events until they were successfully transmitted. Since the core application would be written in Java, Apache Camel with backing ActiveMQ durable queues seemed like the obvious choice.

As with most IoT sensors the volume of sensor data was quite large and so these queues were writing to disk constantly. Ironically, it turns out that in attempting to ensure guaranteed delivery, data was being written to disk so often that it would cause premature hard drive failure. It became clear that we would need to balance disk writes against the possibility of losing data in, for example, some uncontrolled shutdown like sudden power loss.

For inspiration, we looked to optimistic versus pessimistic transaction locking in databases. Turns out that if in the usual case a database transaction is likely to succeed before some other process had “dirtied” the underlying records, it is probably more efficient that you be optimistic and just try to run the transaction rather than locking beforehand. By analogy, if in the normal case we were simply going to send the data to the cloud successfully and then delete the event from disk, why not assume that the operation would succeed unless there was reason to believe otherwise? If we did fail, it was clearly the case that we should no longer assume that we would succeed and would need to persist to disk (become more pessimistic).

This analogy from database transaction locking became the inspiration for the new queue implementation we call the Optimistic Queue. It has two modes, Optimistic (where data is not stored to disk) and Pessimistic (where data is stored to disk). The key is to tune the Queue so that, unless there was a real issue communicating with the cloud service (i.e. networking issues, etc.), the “normal” mode was Optimistic. Any failure to transmit an event would immediately place the queue in Pessimistic mode and save all events in the queue to disk.

In addition, we came up with a couple of useful dials or configuration parameters: maximumQueueSize and queueSizeToReturnToOptimistic.

MaximumQueueSize relates to the size at which, even if we happen to have a healthy connection, we have so much data that a catastrophic event (an uncontrolled shutdown, etc.) would be too costly. This could also detect a degraded connection where we found the cloud service to be too slow a consumer for some reason. When entering the Pessimistic Mode, the queue should persist all events and begin persisting any incoming events.

Assuming we are now Pessimistic Mode but the events are being successfully transmitted to the server, we will want to wait till the queue has drained to some point. It is of course possible that the issue that caused the initial failure still exists but allowed one successful transmission. QueueSizeToReturnToOptimistic is the lower threshold at which we can return to Optimistic Mode. It can be 0 if the normal case when data is flowing is that you never have another event in line behind the one being sent. If instead the normal case is that you have 1, 2 or 5 events queued up at any given time, it may be better to set this variable a little higher so that it can return to Optimistic mode more quickly rather than wait for the rare occurrence when it does reach 0.

The Optimistic Queue dramatically reduced the number of writes to disk without increasing exposure too much. In some cases it was writing to disk with less than a quarter of the frequency. In the very unlikely event of an uncontrolled shutdown where the queue was backed up we could lose up to MaximumQueueSize events. Loosening the constraints on guaranteed delivery allowed for a much longer IoT life meaning more uptime and less data loss overall.

No comments:

Post a Comment