Needles in Haystacks - Finding Differences Between Two Multi-Terabyte Datasets

Etai Litov

Software Engineer

No items found.

When you think about the key functionality of a backup and recovery tool, the ability to restore data quickly is critical. But before that step, you must be able to identify the exact data you need to restore.

At Own, one challenge we need to solve in order to backup and restore massive amounts of data is finding what changed between two backups, and what we need to fix.

Compare is usually considered an easy operation; most software engineers at one point or another have loaded two collections into memory and compared the data between them to find the differences. But when handling big amounts of complex data, certain processes that used to seem trivial can become much more problematic.

When a large company compares data within the Own Recover application, we use two snapshots. Each snapshot is a representation of the database at a specific point in time. Each snapshot can hold billions of records from thousands of tables, which can easily amass into tenths of terabytes.

Once you reach this kind of scale, two problems arise: the ability to load the data to memory, and how fast you can run the compare job.

Because customers are often under pressure to find out what data got corrupted or lost, they need to act fast to respond. Of course, since loading terabytes of data isn't going to fit into your memory, we need to start thinking of different ways to store and load it.

There are a few different considerations when handling big data between files, SQL databases and managed storages.

First, the snapshots’ storage needs to be secured. The data we are holding is some of the most important data that companies hold, and as a SaaS company that supports multiple tenants, we need to make sure no customer can accidentally access the others’ data. The snapshots' storage also needs to be usable for every operation we need, so we aren’t spending too much time on migrating data from one place to another.

Many databases have their own powerful compare tools, which are able to provide sophisticated indexing and storage. But in our case, when all the data needs to be compared, does the overhead from using such a tool outweigh its effectiveness?

These are some of the points you need to consider when trying to find the optimal solution you can get. But even after solving all of the challenges above, we're still not done.

Since we are comparing two live snapshots from different times — and maybe even different environments — the schema from the two databases can be different. Figuring what the relevant changes are and how to identify differences between the two snapshots is important and can have a huge impact on the recovery process. Is a removed column important? Or can we ignore it?

In addition, if some fields or objects are read only and don't have insert or update permissions, then the context of the comparison suddenly also changes the picture. Should we waste time comparing objects we don't care about?

These scenarios are just some of the different challenges you can help solve when working for a company like Own that works with massive amounts of complex data. Scale can transform a trivial problem into a complex one, and trying to find the balance for the optimal solution is what me and my colleagues work hard to do each day.

Own is growing rapidly and is currently hiring multiple positions, including software engineers. Watch this video to hear what our employees like most about working at Own.

Get started

Submit your details and we will contact you shortly to schedule a custom 25-minute demo.

Book a demo