舫摘

知人者智 自知者明 胜人者有力 自胜者强

0%

How to run a distributed system to handing over 50,000 transactions per day

Recently, I just finished a journey with a B2C food delivery project. I’d like to share the brief for review in the future. And, hope you can get some inspiration during reading.

In this project, we have 5 physical servers, all infrastructure, services, and databases hosting on those 5 physical servers.

Service governance

Kubernetes very popular in most articles for introducing the distributed system. But in my point of view, Kubernetes is overkill in most scenarios for startups.

In order to run Kubernetes, we have to prepare a mainstream system and find a skilled expert who acknowledges how to manage Kubernetes and write obscure configurations. It’s not easy and expensive. For me, it’s not a happy journey even we running it at last.

In this project, I chose the docker swarm to arrange the services. We working with the docker system for a long time. It is thin, fast, no additional knowledge required.

Deployment strategy

We using Jenkins help to deliver the service to the server. Highly recommended.

To driven the services I introduced blue-green deployment, the services split into two groups hosting on two physical servers and scaling by docker swarm.

In this approach, only one group serves the customers, we called it hot group. Another one, we called cold group.

We don’t complete shutdown the cold group, just assign the necessary resources to maintain basic functionality.

The cold group could help the QA team to do the regression test at low cost. Because cold group running with production data, the result of the regression test is reliable.

Once we got the green light from the QA team, PO could push the new version to production on Jenkins with one click.

Database improvements

There are 50,000 transactions created pre-day in the MySQL database. There are many challenges on the database side. I did the improvements below to improve the performance of the database.

  • Changed physical disk to SSD. SSD is very important. It has more than 10 times the performance rather than HDD. SSD more expensive than HDD, but please don’t be cheap in this option. Low I/O might destroy your project.
  • Added Redis as a cache layer to reduce the query workload. In this step, have to take care of the data consistency issue.
  • Adjusted the configuration of MySQL to get the best performance. Please adjust max_connections, innodb_buffer_pool_size, innodb_thread_concurrency to a reasonable numbers in your configuration of MySQL.
  • Added another MySQL instance as a slave to improve the query performance.
  • Balance the requests with ProxySQL. ProxySQL is a powerful tool for the load balance of MySQL.
  • Added ElasticSearch to shares the workload from complex queries.
  • Separate tables by month to improve write performance.

Database backup strategy

In the first approach, we created two MySQL instances, one for master, another one for the slave. The databases running on HDD both. After total orders of over 3,000,000, the backup strategy crashed because of low I/O performance.

Master and slave synchronizing the records with bin logs, will create large I/O requests on slave instance. When we backup the database with mysqldump on slave, the system can’t handle the increased I/O requests at all. Tons I/O requests blocked the system. That’s why SSD is important, High I/O performance is important.

In the current approach, all HDD replaced by SSD, I created an additional MySQL instance on the existing server for backup and complex queries from the marketing team only.

We dropped mysqldump, replaced it with xtrabackup. It is an amazing tool, a full backup only takes few minutes on a 50GB database. Incremental backup supported.

Idempotency

We using Redis to generate the order ID to ensure the order IDs are uniform and reliable.

This approach working perfectly, only one thing has to concern, how to keep the system running what if Redis has no response?

We are downgrading the request to the database layer to resolve the issue. Even Redis dead at all the system still could be working as expected.