Db Migration
Whilst performing the server migration (detailed in some of my previous posts) I had to perform a database migration of a Postgres instance.
5 Million Data Points
This database though not critical is a basic Data Warehouse that is the data storage for a data scraper application. This means that although some down time is acceptable a large window of downtime of the database would result in a large data gap.
Through small in size the main fact table of the data warehouse had 5,183,352 data points (rows).
Migration Plans
I had originally planned to just copy the underlying storage across between the hosts. However this had some complications as the original Postgres instance was hosted in Docker with it’s storage just a Docker Volume and the new Postgres instance (also in Docker) storage is mapped to a NAS over NFS.
This meant that permissions could become an issue and I’ll go over this (with regards to the NAS) in a future blog post. TLDR; the permissions if I was to just copy may not be correct and I would have to try and solve this from the NFS client not the NFS storage.
Unfortunately this is only something that I would have to do once so coming up with a complex, reusable solution would probably take longer than the actual copying of the data.
Solution
Using a simply Jupyter notebook I created a simple ETL process to “lift and shift” the data between the databases.
stmt = select(data_fact_table)
batch_size = 1000
with moya_postgres.connect() as conn:
with docker_postgres.connect() as conn2:
data_facts = conn.execute(stmt).fetchall()
print(f"{len(data_facts)} records to process")
for i in range(0, len(data_facts), batch_size):
rows = data_facts[i:i+batch_size]
for row in rows:
insert_stmt = insert(data_fact_table).values(
uuid = row.uuid,
timestamp = row.timestamp,
value = row.value,
fuel_id = row.fuel_id,
region_id = row.region_id,
power_id = row.power_id,
date_id = row.date_id,
time_id = row.time_id
)
conn2.execute(insert_stmt)
conn2.commit()
print(f"Processed {i+batch_size} records")
Note: Not shown in the example above was reading of all the records from the source database.
I originally tried to insert all the records into the table and then calling “commit()” however this timed out and failed. I then just sliced the array and then inserted the records in a batch of 1,000 and this was able to process the data.
Although this worked it was fast to read the records from the database but very slow to insert the records. In fact to copy all the records it took 475 minutes and 38.5 seconds (~7 hours 55 minutes). I ran this late at night and kept an eye on it for around 3 hours and then decided to go to bed.
When the copying started I also enabled the service on the new server so that it would start scraping the data as well as leaving the original one running. This meant that if the data migration failed no data was lost and I would just have to fix and re-run the migration again.
Another important consideration if you were doing this in a production setting is being able to pick up mid way through so you wouldn’t have to copy all the data from scratch every time you run the migration.
As the data copied successfully I didn’t deep dive into why the insert was so slow but I think it was due to the fact it was calling insert with a single record 1,000 times before committing (saving); instead of calling insert with 1000 records at a time.
In this example from sqlalchemy you can see that they have included multiple records in the insert statement and I tried this quickly however it had a type conversion error so I just went back single record inserts as was working and wanted to get the task completed that night.