How to avoid double spending problem in distributed ledgers?

The backbone of resilient online payment systems.

Prashanth Basappa
7 min readSep 7, 2022
Photo by Clay Banks on Unsplash

E-commerce has exploded in popularity across the world in recent years. What makes every transaction possible is a payment system running behind the scenes. A reliable, scalable, and flexible payment system is essential.

What is a payment system? According to Wikipedia, “a payment system is any system used to settle financial transactions through the transfer of monetary value. This includes the institutions, instruments, people, rules, procedures, standards, and technologies that make its exchange possible.”

A payment system is easy to understand on the surface but is also intimidating for many developers to work on. A small slip could potentially cause significant revenue loss and destroy credibility among users.

Strong Consistency Requirement: any read operation returns a value corresponding to the result of the most updated data. A client never sees out-of-date data in other words in a distributed system with N nodes — W+R>N as there must be at-least one overlapping node that the latest data to ensure consistency.

Imagine when you are trying to transfer 100 dollars from account A to account B, and both accounts are in the same bank. After you initiate the transfer, you refresh your screen. However, when you refresh your screen, your total balance dip — that 100 dollars seem to vanish out of thin air. You see that account A is 100 dollars less. However, account B is not 100 dollars more. Then, you refresh the screen a couple of times to see that account B has gained 100 dollars.

This problem that you experience during a transaction is called read skew. An anomaly happens when you read the transaction at an unlucky time — during and after a written transaction.

Read skew becomes a problem when doing a database backup or analytics query.

In database backup, we need to make a copy of the database. There may be a written request during the backup process coming in. If the read skew inconsistency happens, the backup result may be inconsistent. Some data are in the old version, and other data are in the new version. This inconsistent problem may become permanent with such an operation.

We need to scan over a large database in an analytics query and periodically check for data corruption. Read skew can cause the search and check to be inconsistent — often may have inconsistent results and fire off false alerts about data corruption.

Solution:

Versioning and vector clocks are used to solve the inconsistency problems. Make the transaction read from the consistent snapshot of the database — the transaction will see from all the data that other transactions committed in the database at the start of the transaction.

This feature is called snapshot isolation, and it is offered in a lot of relational databases.

A consistent snapshot is an isolation level that guarantees that each transaction will read from the database’s consistent snapshot, usually the most recent snapshot before the current initiated transaction.

Implementing Snapshot Isolation

We need to keep various snapshot versions in the database to implement snapshot isolation. Each time the beginning of the transaction, the database will give the latest committed snapshot version to the transaction. Then, the database will keep track of each transaction with its corresponding snapshot version to keep the read consistent.

Each transaction has a transactionId, and the transactionId is retrieved from the database. Thus, the transactionId is always increasing. The database further makes a snapshot of the new transaction and tags that snapshot with the latest transactionId.

Implementing snapshot isolation requires a monotonous increasing counter, transactionId, to determine which version to return to the transaction call. However, this can be tough when dealing with a distributed environment because coordination is required to generate causality. One solution to solve this is by using a time-of-day clock that returns a confidence interval to create an ever-increasing transactionId.

Lastly, to ensure each transaction gets a consistent snapshot, we can use the quorum strategy to always return the most recent snapshot from the current transaction returned by the majority of the node or to have session affinity on transaction calls and database instances.

During delete, instead of deleting the value in the field right away, the database will mark a tombstone on that field. One reason for not deleting the value right away is that those earlier transactions may still use that value. Thus, we can leverage garbage collection to check and delete the value asynchronously once all transactions use the value committed to their transactions.

These implementations can be used in a Ledger which keeps a financial record of the payment transaction. For example, when a user pays the seller $1, we record it as debit $1 from a user and credit $1 to the seller. The ledger system is very important in post-payment analysis, such as calculating the total revenue of the e-commerce website or forecasting future revenue. Through these ledgers is where read traffic is directed towards.

Once we have solved the consistency problem for low traffic, we now need to improve the resiliency of the system during stress:

1. Lower Your Timeouts

If there’s only a single thing you take away from this post, it should be to investigate and set low timeouts everywhere you can.

2. Install Circuit Breakers

By raising an exception instantly once we detect a service being down, we save resources by not waiting for another timeout we expect to happen. In some cases rescuing these exceptions allows you to provide a fallback.

It requires understanding the ways your application can fail and what falling back could look like. At scale a circuit breaker can still waste a lot of resources (and money) as well. The article Your Circuit Breaker is Misconfigured explains how to fine tune this pattern for maximum performance.

3. Understand Capacity

Understanding a bit of queueing theory goes a long way. Slightly summarized, Little’s Law states that “the average number of customers in a system (over an interval) is equal to their average arrival rate, multiplied by their average time in the system.” Put in a formula, Little’s Law is expressed as capacity = throughput * latency. This also means that throughput = capacity / latency. Or in more practical terms: if we have 50 requests arrive in our queue and it takes an average of 100 milliseconds to process a request, our throughput is 500 requests per second.

4. Add Monitoring and Alerting

Metrics we need to monitor to know our system is at risk of going down due to overload. Google’s site reliability engineering (SRE) book lists four golden signals a user-facing system should be monitored for:

  • Latency: the amount of time it takes to process a unit of work
  • Traffic: the rate in which new work comes into the system
  • Errors: the rate of unexpected things happening.
  • Saturation: how much load the system is under, relative to its total capacity.

5. Use Idempotency Keys

Distributed systems use unreliable networks, even if the networks look reliable most of the time. The idempotency key looks up the steps the attempt completed (such as creating a local database record of the transaction) and makes sure we send only a single request to our financial partners. If any of these steps fail and a retried request with the same idempotency key is received, recovery steps are run to recreate the same state before continuing. Building Resilient GraphQL APIs Using Idempotency describes how our idempotency mechanism works in more detail.

An idempotency key needs to be unique for the time we want the request to be retry-able, typically 24 hours or less. We prefer using Universally Unique Lexicographically Sortable Identifier (ULID) for these idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the b-tree data structure databases use for indexing. In one high-throughput system at Shopify we’ve seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for idempotency keys.

6. Reconcile payment information

When system components communicate asynchronously, there is no guarantee that a message will be delivered, or a response will be returned. This is very common in the payment business, which often uses asynchronous communication to increase system performance. External systems, such as PSPs or banks, prefer asynchronous communication as well. So how can we ensure correctness in this case? The answer is reconciliation. This is a practice that periodically compares the states among related services in order to verify that they are in agreement. It is usually the last line of defense in the payment system.

7. Incorporate Load testing

Regularly test the limits and protection mechanisms of our systems by simulating large volume flash sales on specifically set up benchmark stores.

8. Organize Incident Retrospectives

Aim to have an incident retrospective meeting within a week after the incident occurred. Retrospectives aren’t just good for trying to prevent problems, they’re also a valuable learning tool for new members of the team.

--

--

Prashanth Basappa
Prashanth Basappa

Written by Prashanth Basappa

Aspiring Storyteller and Critical Thinker📍SEATTLE

No responses yet