Migration to Microservices Lessons Learned, Part 2

Last week we covered starting with one service instead of starting with multiple services at a time. For example, start the first couple of sprints developing functionality for multiple domains in the same service and in the same repository, but in different packages. In Part II I will discuss starting with services that bring business value orchestration and data consistency.

Start With Services That Bring Business Value

That might sound obvious, but it is not.
Engineers might be overhyped with infrastructure and start thinking about OAuth, Gateway, Service Discovery, Config Server, asynchronous communication between services, etc., from day one. There is nothing wrong with having that in mind, but starting implementation of a microservices architecture from infrastructural service might be a step in a wrong direction.

Let me elaborate. Imagine you have begun with OAuth. To make it work, you at least need to have a user- and role-management service. If you don’t have it in place, you will start creating shortcuts and workarounds or you will be blocked by another team that is developing it at a different pace. If you don’t have at least two services, there is no reason to create a unified Gateway, implement Service Discovery, etc. Use common sense and start with services that help you to achieve your MVP with less effort. Focus on the business first, not fancy technical stuff.

Think About Orchestration and Eventual Data Consistency

Life is easier when you are inside of one monolithic application. For example: no distributed transactions, no two-phase commits, everything is ACID, etc. Life may get more interesting in the microservices world when one logical business transaction from the UI perspective has to result in multiple calls to different services that should somehow be orchestrated and should store data in an eventually consistent state.

Orchestration

Typical scenario: payment is marked as “complete” only in the case where its order is marked as “confirmed”, otherwise payment remains “not verified”. In this simple example, we have two different entities — a payment and an order — that are owned by two different services, each of them operates with its separate storage. So, after a payment is created in a payment service DB, a corresponding order in order service DB should be updated, too. This can be achieved in various ways, but let me provide a simple scenario:

  1. Order services make an async HTTP call to the payment service to start payment.
  2. The appropriate order has a not-paid status.
  3. The payment service completes payment and notifies the order service by making the async HTTP call to order a service that updates the order status with a particular result.
  4. UI might reload the order from the database to show a correct status for an end user.

Even for this very simple scenario the logic became quite complicated:

  • Should the order service trigger communication between services or the payment service?
  • Should this communication be synchronous or asynchronous?
  • What if the payment service(s) is/are down when making a request?
  • What if the order service(s) is/are down when payment service sends status back?

In reality, such business flows involve calls between multiple services and naturally are eventually consistent.

Inter-service communication

Let’s scratch a service and talk briefly about inter-service communication to discuss a few possible approaches and their complexity.
First of all, services can communicate with other services directly. In this case, a strict contract should be defined between services, remote service should be highly available, and, in case of request failed, there can be a potential data loss. In order to avoid data loss, services can communicate through a middleware. This decoupled approach is data-oriented since it is almost contract-free, but middleware should be highly available and responsible for data persistence, not mentioning additional overhead invested in its observability. In addition to direct vs. decoupled, inter-service communication can be synchronous (based on blocking API) or asynchronous (based on non-blocking API). During development, we tend to forget that approaches for inter-service communication are not one size fits all, and depending on different business scenarios it can be a combination of any of mentioned above, e.g. direct and asynchronous, direct and synchronous, etc. The selection criteria really depends on business scenario and cost, e.g. how much time it will take to implement, test, deploy, monitor, and what is a probability of potential issues, as well as strategies for different failover scenarios.

Fault-tolerance or things will break

In micro services world services are distributed across the network. And the network will eventually fail so services might become unavailable. Or load to a solution can be increased which may cause services processing to slow down or even make a service unavailable. To all of that, business cases from real-life are implemented behind a scene as a cascading call stack from a service to a service. If a network or specific service fails, the whole stack may fail. So end users may get a stack trace instead of a meaningful message. In case of a bad design, end users may even wait for an error. Let’s just admit that things will break and don’t pretend that you can eliminate a source of any possible failure. But you can enable resilience in complex distributed systems where failure is inevitable in advance.

Service coordination during network failures can be addressed by implementing alternative routes or service discovery. The latter pattern provides dynamic service locations, client-side load balancing and allows to easily scale services horizontally on demand.

A high load can be addressed by using auto-scaling approach and upscaling and downscaling services on a demand. Obviously, auto-scaling requires proper configuration, testing, and support. In addition, instance healers can watch for service instances using health checks and heal instances if necessary.

Issues with cascading call stack can be addressed by applying different practices, for instance, use service timeouts and timeout propagation to indicate to a server how long a client is willing to wait for an answer; use cancellation propagation when a caller is not interested in a result any more; use client-side and server-side flow control to balance computing power and network capacity between a client and a server. Luckily, most of the mentioned practices are already implemented by RPC frameworks, like gRPC, out of the box.

A great and popular Circuit Breaker pattern can address cascading failures across multiple services. It is implemented in popular fault tolerance libraries, like Hystrix, so you don’t need to code it from the scratch. However, on the negative side, it requires additional time for implementation, configuration (threshold, timeouts, etc.) and monitoring. Also, engineers tend to forget that not all errors should trip the circuit, some should reflect normal failures and be dealt with as part of regular logic.

I hope a few high-level items above have convinced you that fault-tolerance is a crucial aspect that needs to be addressed while designing and implementing micro services architecture. The earlier, the better.

Eventual Data Consistency

Since microservices architecture follows to AP scenario from CAP theorem, data in the system will be eventually consistent. Thus, it would be great to store data in a way to:

  • Guarantee reply to a correct state if one or more of the services are down.
  • Guarantee reply to a proper state if a middleware between services is down.
  • Provide visible auditing and a history of changes.

For instance, event sourcing is a great fit for microservices architecture. Have a look at a presentation by Chris Richardson and Kenny Bastani.

The strategy might be the following:

  • Persist events, not the current state.
  • Identity state changing domain events.
  • Replay events to recreate state if needed.

On top of that Command Query Responsibility Segregation pattern (CQRS) can be applied to use a different model to update information than the model you use to read information. Combination of Event Sourcing/CQRS can have many benefits. From a consistency point of view, write model is consistent, events are ordered, and transactions are quite simple. From a data storage point of view, normalized model can be used on the command side and denormalized model can be used on the query side. As a result, scalability is almost free since we can scale command and query sides separately.

The drawback (beyond additional complexity, of course) is that the decision as to whether to use CQRS/event sourcing or not has to be made in advance or as soon as possible. The late decision to develop in an event-driven architecture might require service rewrite and an additional data migration.

Come back later where I will share my experiences around data migration for microservices.