The Visual Content team I work with in trivago has a sub-team called Image Concepts team, which mainly works on image classification research projects.
Previously, the two teams implemented their own infrastructure and the sub-team heavily depended on the core team’s data. They used the data streams from the
We eventually realized that it would make much more sense to merge the two teams. However, this idea not only needed people to work together, but also, an architectural re-design was required.
Our databases were hosted in separate Amazon RDS (Relational Database Service) instances and as a step towards merging the two teams, we decided to do some research on merging both database instances.
NOTE: The merging we’re talking about here is bringing two databases into the same RDS instance, but not merging the database in the table level.
Evaluating the Merge
Why isn’t it that simple?
- Billions of records on each database.
- Heavy queries and processes running on the databases.
- Read replication and auto-scaling are enabled on our databases.
- Database streamers listening to our tables.
RDS Storage limits
Amazon RDS uses Amazon EBS (Elastic Block Store) to store the data and log files. Depending on the storage required, RDS will automatically resize across multiple EBS instances to improve the performance of your RDS instance. However, it comes with a storage limitation.
- General Purpose SSD
- MySQL derivatives and Postgres = 20 GiB – 32 TiB
- SQL Server variations = 20 GiB–16 TiB
- Provisioned IOPS SSD
- MySQL derivatives and Postgres = 100 GiB – 32 TiB
- SQL Server variations = 20 GiB or 100 GiB – 16 TiB
Because of this, we had to find the individual storage sizes of our databases to make sure it would still fit into the limit after the merge. We checked the sizes of the data files, indexes and bin logs, and then, we used the below query to calculate data length and index lengths:
SELECT table_schema 'Database Name', ROUND(SUM(data_length + index_length) / (1024 * 1024), 1) 'DB Size in MB' FROM information_schema.tables GROUP BY table_schema;
(If you want data length and index length separately, simply update the parameters inside the SUM function.)
Since we are highly dependent on the binary logs of the databases (because we stream the database changes using them), we also checked its sizes. We used the SQL command shown below to see the bin logs:
SHOW BINARY LOGS;
Once you’ve checked the total size of your database, you can verify whether you will comply with the storage limits (-which most probably you will be).
Happy with the storage limitations, so now you think you’re good with merging? Not that easy!
RDS Instance Performance
It’s NOT just the storage that you should consider. What about the performance? There are several factors affecting the performance of your RDS instance. So, you need to evaluate if your RDS instance will still perform well after merging.
RDS IOPS (I/O operations Per Second)
You should consider the IOPS of your instance and if it will be able to handle the load it gets. This is highly dependent on the size of your data on the EBS. AWS has some limitations related to the IOPS and this is calculated with IO credit balance, which represents the available bandwidth for your RDS instance. You may go through the limits in this table and compare it with your data sizes, so you can get an idea about IO delays enforced on your RDS instance.
Read more about burst performance in this article.
RDS Instance Type
When you bring two databases into a single RDS instance, you definitely have to evaluate the idea about upgrading the RDS instance type.
First, find out how much of CPU, Memory, and number of maximum database connections you’ll need. It also depends on the load you’re going to have on the RDS instance. So, you have to check the average resource usage and its spikes. It would be enough if you get an RDS instance to cater your average load, then enable replication (along with some code updates) and auto-scaling to handle the spikes. This will save a lot of $$$.
To benefit from database level load balancing, you can enable read-replication along and implement your read queries to use the read endpoint. Same goes for the write queries to use the write endpoint.
Moreover, you can enable auto-scaling for the RDS instances and handle heavy loads on your database. Then AWS will create temporary RDS read instance(s) to cater the reads and terminate them when the load is less.
Max DB Connections Limit
You may also need to consider the maximum number of simultaneous database connections required if the two databases are brought into one RDS instance. This should also be a major factor when deciding the RDS instance type, and you can check this table to know the limits for your RDS instance type.
If you don’t write your code to reuse the database connections, you might end up easily exceeding the database connection limit. And if you define your auto-scaling policy not to consider the database connections, your instance will be never be scaled automatically. And then, all the new database connections will be refused. Even if you add database connections to the auto-scaling policy, your instances will be scaled sooner (even when it’s not required) if you’re not reusing the database connections in your code.
After we collect these statistics about our databases, then we can compare the pros and cons of bringing the databases into one RDS instance.
What we’ll gain?
Utilize the RDS instance resources better and save money.
If we have one db.r4.2xlarge ($0.64/h) instance and another db.r3.large ($0.32/h) instance, we can bring the two databases into the same RDS instance, but make sure to keep the instance type as db.r4.2xlarge, and enable read-replication and auto-scaling. Then we may save almost all the costs we had spent on the second database.
Reference for the prices: https://www.ec2instances.info/rds/?region=eu-west-1
Query both databases at once
When the two databases are in the same server, we can write queries to join tables from both of the DBs. This is very helpful if you’re going to merge two dependent RDS instances.
We can get rid of the additional infrastructure
If we have database streamers (i.e Debezium), streams (i.e. Kafka, Kinesis) and sinks (i.e. Kafka connect) to sync data from one database to the other, we can get rid of those infrastructure components including the duplicate data.
Read-replication, auto-scaling and backups are not affected.
AWS has updated the RDS services and now read-replication, auto-scaling and backups are handled in RDS instance level, not in the DB level.
If you had enabled read-replication for an RDS instance, and then create another database inside that server, all the changes are replicated into the read-replica instances as well. Since auto-scaling is also applied in the instance level, all of your databases in the source instance will be replicated to the new RDS instance. The same concept is applied to RDS backups, where AWS takes a snapshot of your RDS instance and creates it as a new instance when restored.
What we’ll lose?
Are we violating software engineering concepts?
If we’re merging the databases just for the sake of doing so or only to save money, we might end up with creating infrastructure level dependency, which is BAD!
Exceeding the EBS size limit
If we need to implement a new feature/project and it requires a lot of space in the databases, we might exceed the provisional size limit. But, it’s YAGNI (You Aren’t Going to Need It).
Single point of failure!
If we need to restart the instance, resize it, or apply auto scaling, the consumers of both databases will be affected. Also, when heavy queries are executed on the DB, it can affect other processes using either of the databases.
When you bring the two databases into one RDS instance, you have to change the database connection configs used, especially in the processes which previously used the old RDS instance.
If you re-evaluate the architecture design, you may find some components are not required after the merge. For example, you can get rid of components you used to sink the source tables to the consumer database, such as streamers, streams and sinks.
Monitoring will be hard(er)
If we need to monitor the anomalies in the RDS instance, it’ll not be straightforward to know which database has the highest load. If you already have monitoring tools enabled for your RDS instances, you may have to reconfigure them to consider individual databases (if you need those details).
How to merge?
After considering everything, if your decision was still to bring the databases into one RDS instance, you can use the AWS’s native service called AWS Data Migration Service.
I’m not going to explain the steps, because that’s very simple and straightforward.
Before you need to merge two RDS database instances, you should do some research to determine if it would be beneficial to have the two databases in a single RDS instance.
You may need to compare your data sizes with your RDS instance limits, IOPS, and maximum database connection limits when deciding the output RDS instance type.
Then evaluate if you will truly benefit from bringing the two databases into one RDS instance. If your answer is still yes, then go for it! Let me know your thoughts as well 🙂