Global OTA — Booking Platform Modernisation
Reduced booking failures by 82% during peak season at £14M in recovered revenue
A global online travel agency was losing 7.8% of peak-season bookings — £14M annually — because their legacy inventory system could not hold state during carrier API timeouts. Saga-based orchestration on GCP multi-region infrastructure cut failures to 1.4% and produced the first Black Friday with zero regional outages.
The client operated one of the larger online travel agencies in the European market, with four hundred million pounds of annual gross booking value across flights, hotels, rail, and ground transport. Their technical problem was specific and commercially expensive. The underlying inventory system — built in 2016 and not substantially modified since — executed multi-leg bookings as a synchronous chain of API calls across up to forty-seven supplier systems. If any single supplier API timed out mid-booking, the legacy system had no mechanism to hold the partial state and resume. The booking would fail, the customer would be returned to the search page, and any legs that had already been confirmed with suppliers would either be released through a best-effort cancellation call or, in a meaningful minority of cases, would remain confirmed without a matching customer reservation — creating a downstream reconciliation problem that the finance team tracked as 'phantom bookings'.
During ordinary trading conditions the failure rate was acceptable at around one point two per cent. During peak traffic — Black Friday, summer departures, Christmas — carrier API latency rose sharply and the failure rate climbed to seven point eight per cent. In 2024, the company had measured the commercial cost of those failures at approximately fourteen million pounds of abandoned revenue, with a further estimated six million pounds of reconciliation and customer-service cost.
The architectural recommendation was a saga-based orchestration pattern implemented on Google Cloud Platform. We moved the booking flow out of the synchronous chain and into a durable orchestrator running on Cloud Workflows, with each supplier call modelled as an explicit saga step with a defined compensating action. If a hotel booking was confirmed and a subsequent flight booking failed, the orchestrator executed the compensating cancellation against the hotel supplier rather than leaving the reservation orphaned. The orchestrator state itself was persisted durably, so that in-flight bookings could survive orchestrator restarts, regional failovers, and the rare catastrophic supplier outage. The customer experience was redesigned to communicate this: rather than a page refresh with a generic error, customers saw a real-time progress view of their booking as each leg confirmed, with explicit handling for the scenario where a specific leg required manual re-selection.
Multi-region active-active was the second structural piece. The legacy system ran in a single region with a warm standby; during a regional incident, failover took roughly eighteen minutes, during which time all bookings were rejected. We redesigned the infrastructure as active-active across two GCP regions with global load balancing and cross-region replicated state. The ninety-ninth-percentile latency target was set at one hundred and twenty milliseconds end-to-end, which was met through a combination of edge caching, supplier connection pooling, and a rewritten pricing engine that pre-computed fare permutations during search rather than at the booking step.
The synthetic monitoring layer was the piece the client's platform team had been most exercised about. The forty-seven supplier integrations had, historically, been monitored only through real-customer traffic — which meant that a supplier outage at three in the morning would typically be discovered when the first customer of the day encountered it. We built a synthetic test suite that executed canary bookings against every supplier every two minutes, with automated alerting to the on-call platform engineer if any supplier exceeded its SLA. The synthetic bookings were genuine transactions that were immediately cancelled, which required negotiating a specific arrangement with each supplier but gave the client a level of supplier-health visibility that its competitors did not possess.
The platform went live in full two weeks before the following Black Friday — a deliberate timing decision that gave the team a single high-pressure test of the new architecture. During the Black Friday window, the booking failure rate ran at one point four per cent, down from the previous year's seven point eight per cent. There were zero regional outages, no phantom bookings were generated, and the platform processed forty-one per cent more booking volume than the preceding Black Friday on the same infrastructure cost envelope. Over the following trading year, the fourteen million pounds of abandoned revenue recovered to a run-rate materially aligned with the business case the board had approved at engagement start. The architecture has since been extended to the client's rail and ground transport lines of business.