hierarchical data best practice in relational database
-
we have 1 table that also references itself in
parent_id
table high level structure: (let's assume that data is above 1M rows)
category_table id int name varchar parent_id int
the goal is to get the categories with their parent category names.
i have 2 approach in mind
A) do a select query and join
category_table
to itselfselect a.*, b.name from category_table a left join category_table b on a.id=b.parent_id
B) add another column for the parent name =
parent_name
category_table id int name varchar parent_id int parent_name varchar
then do a single select statement to fetch the data
select * from category_table
my worries are, A is what i think the right approach is but when fetching data it may be slow. B is faster than A however i will introduce a redundant column and will be a challenge when i update the real parent name
would like to know your thoughts on which one is the best way to approach this and if you have any other suggestions
-
As with all denormalization decisions, it boils down to two things:
How important is it to optimize the query performance? If the query with the JOIN provides satisfactory performance, then use that, because the database will enforce its own data integrity to prevent update anomalies. However, if the difference is performance is great enough that you definitely cannot satisfy your performance objective if you use the JOIN solution, then you must use the denormalization.
Do you have a robust plan for accounting for the data anomalies? You correctly identified that you could have a challenge if category names change, because you need to remember to update them in multiple rows. How confident are you that you can write the code perfectly to do that? Alternatively, do you have a plan for detecting and correcting anomalies that occur if updates fail to update the category name in all necessary places? Do the updates and the correction code account for concurrent updates?