Web-Services Transactions @ XML JOURNAL

Web-Services Transactions

From Loosely Coupled - The Missing Pieces of Web Services

Most non-programmers think of transactions as associated with buying and selling, credit-card authorizations, and the like. But in the jargon of computer science, the word transaction has a very specific meaning: the interaction and managed outcome of a well-defined set of tasks. If that definition still sounds rather vague or abstract, it's because the scope of what's considered a transaction has expanded over the past two decades, and the older simpler definitions are no longer adequate. Computer systems have been connected via networks, and applications are more distributed in nature. The theories and practices of transactions have been repeatedly stretched to their limits, re-evaluated, and extended. Now, because of web services, we're once again expanding that definition to include long-lived loosely coupled asynchronous transactions.

Transaction Basics
Most database operations are simple, and thus don't qualify as transactions per se. For example, when a customer-service application wants to look up a customer's phone number, the application sends a query message to the database. It's a read-only operation that involves only one record in a single database. But most importantly, it's a one-step (atomic) operation that doesn't interact or conflict with other applications that may be interacting with the same record or even the same database.

More complex database operations require multiple steps that must all be completed for the operation to succeed. We refer to these operations as transactions. The traditional definition of a transaction is a single unit of work composed of two or more tasks. If any of these component tasks cannot be completed, the entire transaction fails, leaving the data in the state it was in before the transaction was initiated. In other words, a transaction is a collection of tasks that either all succeed, or all fail. Achieving this consistent termination of a unit of work is the goal of a traditional transaction-processing monitor (TP monitor) which is software that manages lower-level database operations.

An example of a simple transaction is a transfer of funds from one account to another within the same bank. The transaction's unit of work consists of two tasks: the debiting from one account, and the crediting to another. Ideally, both tasks will execute properly (commit), but even more important is that if one task can't be accomplished, neither will be executed (i.e., they'll both abort). It's okay if the matching credit and debit both failthe application initiating the transaction can always try again. But it's a serious problem if the credit is executed without the associated debit, or vice versa.

ACID
As the results of their theoretical studies of transactions, Theo Häerder and Andreas Reuter published a 1983 paper, "Principles of Transaction-Orientated Database Recovery," in which they presented the requirements for systems that could process multiple-task units of work (transactions), and would not be corrupted by hardware, database, or operating-system failures. The paper is most famous for its specification of the principles of Atomicity, Consistency, Isolation, and Durability (ACID). A system that conforms to these so-called ACID properties guarantees the reliability of its transactions.

Two-Phase Commit
When all of the data involved in a transaction resides on a single database, only one TM is required to maintain atomicity. But applications and databases are increasingly distributed, such as those linked by web services. The challenge for web services is to maintain atomicity by guaranteeing the mutual success and durability of all of the elements of such a distributed transaction, so named because it involves a distributed unit of work. In other words, multiple steps are required that involve two or more databases.

The traditional method for handling distributed transactions is known as the two-phase commit, which, as its name implies, breaks transactions into two cooperating phases. The two-phase commit protocol is illustrated in Figure 1.

The two-phase commit process assures the atomicity of the distributed transaction. It's clean and simpleexcept when things go wrong. Due to hardware, software, or communications failures, it's possible that one or more messages may be lost, resulting in an uncertain state for one or more of the resource managers. As it turns out, however, only the loss of a commit message can cause a serious problem. Losses of other message types are less critical. If a resource manager fails to get the request-to-prepare message, it will simply fail to respond. The controller will give up waiting for the resource manager's response and send out an abort message. The other resource managers will not have committed any of their changes. The same occurs if one or more of the response messages is lost. And if a done message is lost, no action need be taken, since all of the resource coordinators will have committed the transaction.

The most serious problem occurs when a resource manager prepares for the transaction but never receives either a commit or an abort message from the transaction coordinator. Once a resource manager has sent its prepared response, it's in limbo. It can't commit the transaction, and it can't release any resources locked on behalf of the transaction. (Resource locks are under the control of the individual resource managers, not the transaction controller.)

In fact, there's no simple solution to this problem. No two-phase commit protocol can protect against all failures. The possibility will always exist that a communications failure can cause a resource manager to become blocked, or unable to commit or abort. Still, even with its limitations, the two-phase commit protocol remains the mainstay of distributed transactions.

The Web-Services Challenges
The ACID model has been the focus of transaction technologies for twenty years. It's widely used for both local andvia the two-phase commit protocoldistributed transaction systems. But as valuable as the ACID model has proven to be for tightly coupled distributed systems, it falls short for long-lived, loosely coupled asynchronous transactions.

Long-lived transactions
Web services are far more complex in terms of time and space than the transactions for which the ACID concepts were developed. Whereas ACID-based transactions may span many seconds or even a few minutes, loosely coupled web-services transactions may extend over hours or even days. Considerable time can elapse between the preparation and commit phases. Using ACID-style transactions in such long-running business processes would mean that participating resources could be locked and unavailable for extended periods of timewhich is unacceptable to many local applications that use the same databases and pend until the resources they require are released.

Reliability
ACID-style transactions are designed to cope with failures in hardware, software, and communications, but only in otherwise reliable environments where such failures occur relatively infrequently. Most ACID-style distributed transactions systems are based on synchronous, connection-oriented protocols, which maintain communications paths between transaction coordinators and the participating resource managers for at least the duration of the transaction. These synchronous protocols assist in handling such errors by signaling the transaction-coordinator or resource-manager software when a communication failure occurs, so that the coordinator or resource manager knows it can no longer communicate with the service at the other end of the connection. When a communications link fails, all synchronous transactions that depend on that link are promptly aborted.

Short-term communications failures are therefore fatal errors for tightly coupled synchronous transactions, but they must be routinely handled by the systems that support long-lived, loosely coupled asynchronous transactions. The latter are based on a reliable-messaging infrastructure that delivers messages with a high degree of assuredness, even in cases where the recipient and the intervening infrastructure may be down for extended periods of time.

Trust
Because the resource locks typically used with ACID-style transactions may block applications, it's critical that they be held for as short a time as possible. If an application dies after locking a resource, that resource could be orphaned forever. If the resource in question represents the availability of an airline seat, that seat might never be filled. A resource manager therefore manages its resources like a mother hen, making sure that locked resources are never abandoned. If a local application requests a lock and then terminates, the resource manager must clean up the mess by unlocking the resource. Before a resource manager allows transactions to be initiated by remote transaction coordinators, a great deal of trust must exist among the resource manager, the remote coordinators, and other resource managers participating in the transactions.

Suppose it's not the link that fails, but rather the remote transaction coordinator. Although the messaging software won't signal a communications error (the communications link is still operational), the local resource manager has the ultimate fallback: It can rely on timeouts to protect its resources. Unfortunately, timeouts can't be used for long-lived transactions, because by definition they execute over extended periods. Again, the techniques that support ACID-style transactions won't work with those that are long-lived, loosely coupled, and asynchronous.

Cancellation risks and abuses
External web services introduce a number of risks just by exposing internal systems to access by others. Allowing externally initiated transactions increases what's known as cancellation risk. For example, consider airline seats purchased at full price a few months before the flight. If they're cancelled at the last minute, the airline may be unable to sell them.

The problem becomes more acute when business processes are automated by web services, because accidental or even intentional abuse can so easily go undetected. For example, imagine how an unethical travel aggregator might exploit an airline-reservation web service. Months in advance, the aggregator reserves every available seat on a particular flightbut at the last minute, cancels them. In a panic to sell the seats, the airline puts them on sale at a deep discount. The unscrupulous travel aggregator then repurchases the same seats at this much lower price.

Accepting a reservation carries an inherent risk of such a last-minute cancellation. This problem exists even without web services, but there are systems in place to detect and prevent most abuses. Airlines manage this risk through overbooking. Concert and theater ticket agencies protect themselves using no-refund policies. But many other businesses - particularly those in wholesale trade - have no formal methods for managing cancellation risks. The risks and abuses of cancellations will probably increase and spread to other industries as external web services are deployed. Web services will ultimately need to express and negotiate the policies under which such transactions are made.

Loosely Coupled Transactions
Clearly, the web-services requirements for transactions far exceed what can be accomplished using traditional technologies. The more loosely we couple systems - separating them in time, space, and control - the more difficult it becomes to manage transactions distributed among them. Loosely coupled transactions, it would seem, come at a cost of increased complexity. That's true, but only so long as we keep trying to apply, refine, and improve traditional approaches based on ACID-style concepts. Instead, let's consider how we can build an all-new transactional system based on loosely coupled web services technologies: asynchronous communications, reliable messaging, and document-style interaction. Let's use an example of a tightly coupled transaction, then see how it can be improved.

You're in your car, listening to the radio, when you hear an announcement that your favorite musician will be performing in your town. You grab your cell phone and dial the ticket-sales agency. A friendly salesperson answers the phone, and you launch into your request - only to be interrupted by the salesperson telling you, "I'm sorry, but our computers are down right now, and we don't know when they'll be back up. You'll have to call again later."

You've just stumbled into one of the drawbacks of synchronous transactions: In this case, there's nothing you can do but abort the transaction. You (the requestor) and the reservation system (the provider) must be available simultaneously. There's no point leaving your information with a salesperson who's just an intermediary, with no store-and-forward capability. Even if the salesperson were willing to take down your information, would you trust that person to complete your order? The responsibility for recovering from the system failure and restarting the transaction falls entirely on you, the requestor.

Half an hour later, you call back (retry), and learn that the system is now available. Of course the context of your transaction has been lost, so you've got to start from the very beginning. As luck would have it, the agent submits your request only to report, "Sorry, but all of the orchestra seats are now sold out. The best I can do is row J, seats 103 and 104 in the upper mezzanine." For a period of a few minutes, the reservation system locks the database records that represent those two seats while you make up your mind. If other customers are placing orders through different agents, they won't be offered those same seats. (This is now a synchronous transaction.)

You tell the agent you'll take the tickets, but your cell phone goes dead just as you're about to jot down your confirmation number. Now what? Did the transaction complete? Do you really have two tickets for the concert, or do you need to call back and place another order? If you do, will you end up with four tickets instead of two? Unfortunately, there's no way to know. Such are the problems of tightly coupled transactions without a reliable asynchronous messaging infrastructure.

Wouldn't it be great if you could just leave a voice-mail message (a self-contained document) including not only the obvious details, but instructions (the business logic) for what to do in case your first-choice seats aren't available? Your voice-mail message would then enter a message queue along with those of other customers, and be processed in sequence. As a result of your request, the ticket agency would call you back or send you an email message confirming your purchase. The acknowledgement would complete this long-lived, loosely coupled asynchronous transaction.

Long-lived transactions
By communicating asynchronously, you've eliminated the real-time constraint of the transaction. You can make your request in the middle of the night. Even if a human agent must review your order, that person need not be available at the time you submit it. Although the vendor's voice-mail system must be able to accept calls at a reasonable rate, the actual transaction system that processes the request is highly scalable. Even if the transaction system goes offline, all orders will get processed in due time as long as customers can submit voice-mail orders. You can see how a reliable asynchronous messaging system is key to long-lived, loosely coupled asynchronous transactions.

Isolation without locking
You've also eliminated the need for record locking. So long as all requests are submitted through a single queue, the ticket agency can process its requests serially. And provided only one ticket request is being processed at a time, the application doesn't need to simulate serialization by locking resources.

Compensating Transactions
Once a transaction has been committed, it can no longer be aborted. Yet in the real world, there are often times when the effects of a transaction must be undone. The problem is that some transactions can't be reversed because their effects are permanent, and/or conditions have so changed over time that restoring the previous state would be inappropriate. As an example, consider a transaction that triggers the manufacturing of an item. Materials are consumed, and money is spent. It's impossible to simply wipe out the transaction. You can't un-manufacture the item. Instead, other actions must be taken, such as charging the customer a cancellation fee and offering the item for sale to other parties. In the earlier example of an unscrupulous travel aggregator who cancelled airline tickets at the last minute, we saw how the airline chose to put those seats on sale at a discount in order to make sure they'd be sold and the airplane would be full.

These are examples of compensating transactions that can be applied after an original transaction has been committed in order to undo its effects, without necessarily returning resources to their original states. Many transaction managers support compensating transactions - and as we'll see in the case of long-lived, loosely coupled asynchronous web services, compensating transactions can actually be used instead of resource locking.

Optimism
ACID-style transactions are optimistic, and assume a high likelihood of success. You can imagine a human coordinator of a simple two-phase commit transaction commanding the participants. Phase One: "Okay, here's what you need to do. [Coordinator enumerates the requirements.] Has everyone prepared for the transaction by safely storing the results? Good." Phase Two, after receiving affirmative votes from all participants: "Now everyone...GO!" There's no need for the coordinator to ask whether anyone was unsuccessful, since all of the participants promised in Phase One that they could do as requested. The key to the success (and integrity) of the transaction is the locking of the resources between these two phases.

On the other hand, a loosely coupled transaction coordinator must take a pessimistic view of a transaction's outcome. Even with a reliable messaging protocol, many other errors can occur due to the long-term nature of the transaction. Rather than reserve their resources in advance, loosely coupled participants prepare compensating transactions that will undo the local effects in case the first phase is unsuccessful. If the transaction is later aborted, all participants execute their compensating transactions.

When using compensating transactions, our human coordinator might say in Phase One, "Okay, here's what you need to do. Don't do it yet, but in case this doesn't work, I want each of you to figure out ahead of time how to recover. Now everyone...GO!" Then, in Phase Two: "Great...did that work for everyone, or do we all need to run our back-out scenarios?"

Compensating transactions are one of the technologies that decouple systems from one another, and are a first step towards filling in the missing pieces of complex web services.

Standards
IBM, Microsoft, and BEA are at work on WS-Coordination, a framework that supports multiple coordination types including WS-AtomicTransactions for short-lived "all-or-nothing" transactions, and WS-BusinessActivity for long-lived loosely coupled transactions using compensation.

Sun, Oracle, Iona and others have announced plans for WS-CAF, the Web Services Composite Application Framework, for transactions and coordination of interdependent web services. And the OASIS Business Transaction Technical Committee is continuing to develop BTP, the Business Transaction Protocol, but they're awaiting implementations so that they can progress it towards a full OASIS standard.

The issues are both political and technical. Because the traditional mechanisms for handling distributed transactions don't work for web services, the standards for web-services transactions will be some of the last to be developed, agreed to, and adopted. Most experts don't expect much impact from these competing standardization efforts until 2005.

About Doug Kaye
Doug Kaye is the CEO of RDS Strategies LLC and the publisher of the IT Strategy Letter.