How to approach duplicating big data sets in a (My)SQL database?
emmalee last edited by
The web application I'm working on is a sort of a project management tool. The underlying database engine is MySQL. Each project is stored in the DB as a highly normalized SQL model (it uses roughly 100 tables for all kinds of things like project pages, project page details, attachments, internationalization of all these etc.). The model also uses the concept of revisions, so the model is roughly as follows:
- primary key: project_id
primary key: project_revision_id
foreign key: project_id
about a hundred of other tables, like project_page, or project_page_attachment, or project_permissions, all refering, directly or indirectly, to project_revision_id.
Now, the problem we're facing is very slow performance of operations related to creating a new revision of a project. Let's say we have project in revision 12, and we want to create a new revision of it. Within the above-described model, how it works is we need to create a duplicate of all data within all these 100 tables - we're basically selecting data from all tables where project_revision_id = 12 and inserting it with project_revision_id = 13. For big projects with a lot of data this can be as much as 1,000,000 records, and it currently takes us about 15 mins to execute such action for such big projects.
I'm wondering which directions we should explore when it comes to making it much more performant. One direction we're exploring already is Hibernate Envers, which is about introducing the 'delta' tables, which would keep track of all changes applied & would keep the relatively easily selectable 'last state' of each table.
But as introducing Envers is a big undertaking, and as we still have some questions about the performance of it in general. Before we invest a lot of time into going down this path, maybe there are other directions worth analyzing? Maybe there's solutions to rapidly duplicate big portions of data in MySQL? Other ideas?
Bogopo last edited by
Is this a way to describe your situation? If not, does this sound "better"?
project_revisionscontains dozens of columns with "ids" pointing to key-value tables with the actual numbers/strings. There is one row per "revision" per "project".
PRIMARY KEY(project_id, revision_number).
projectshas one row per "project". It might have nothing more than two columns: An unchanging
revision_number. Notice how those two columns let it efficiently reach into
project_revisionsto get the "current" revision of a given project:
SELECT pr.* FROM projects AS p JOIN project_revisions AS pr USING(project_id, revision_number) WHERE p.project_id = ?
Add a new revision
- Copy the latest revision into a new row in
- Give it a new
- Change any "attributes" that need changing (see below)
projectsto contain the new
As for changing an attribute (pages, details, attachments, etc), they are key-value with two columns and, for example,
If your processing knows that an attribute is not changing, there is no need to do anything for that attribute. On the other hand, if an attribute is changing (or "might be changing"), you can
INSERT IGNOREthe new
valueinto the table. (
INDEX(value); more discussion below), fetch the id and change the id for that attribute in that revision.
Since you can't have an
TEXTcolumn something different is needed for that case. If you like the design so far, I will elaborate in how to deal with that.
From another angle, I find that there is such a thing as "over-normalization". With "about a hundred" such, I suspect you are well into that.
There are two reasons for normalization:
- Centralizing where something can be changed. This seems to be totally unnecessary since you are keeping a history of old values anyway.
- Performance -- shrink the disk space by avoiding repeated copies. Note that my proposal avoids repeating any attribute when a revision does not change that attribute.
When using normalization for saving space, consider that
- There is space overhead in normalization, hence "short" values (say, less than 10 bytes) do not actually save space. This is due to the "id" in both tables, etc.
- When searching by a range (numeric, date, float, etc), performance becomes really bad if the value is not in the main table. Ditto if the search is via