How to index according to a property that is a set/collection using a relational database

  • I would like to query in O(log N) time against a property that is a collection.

    How do I set up the indexes for this in a relational database such as MySQL or PostgreSQL?


    book: {
       title: "Hitchhiker's Guide to the Galaxy",
       genre: ["science-fiction", "comedy"]       

    I would like to do a query like this in O(log N) time:

    SELECT * FROM books WHERE genre='comedy'
    ( or perhaps: 'comedy' in genre )

    I've done this using Big Table, which was very easy. I've done it with Cassandra using secondary indexes that were maintained by code - which I regret because it involved a ton of complexity overhead.

    What is a good way to do this using a relational database such as MySQL or PostgreSQL? I would like to able able to configure the table and index(s) such that I do not have to maintain anything extra using code.

    I understand this is probably a very basic question, however I had a lot of trouble googling this concept and I'm afraid my relational database experience is very limited. I suspect there is a term/jargon for this situation but I don't know enough to even know what it is.

  • Welp, since you're interested in a relational database solution, keyword being relational, the first step would be store your data in a more normalized form that one would use a relational database for. For example, you would probably want to store your data in three related tables: Books, Genres, and BooksGenres. Here's some sample DDL scripts to create those tables:

        Title VARCHAR(500)

    GenreName VARCHAR(100)

    CREATE TABLE BooksGenres
    BookId INTEGER,
    GenreId INTEGER

    ALTER TABLE BooksGenres ADD PRIMARY KEY (BookId, GenreId);

    This will create a table dedicated to your Books data, a table for a unique list of Genres, and a table (BooksGenres) to relate the two together since they have a many-to-many relationship.

    One benefit of structuring it this way instead of a single denormalized table is if you changed the name of a specific Genre you'd only have to update a single record in the Genres table and all of the Books related to that Genre would automatically reflect that change when you join those two tables together.

    In a single denormalized table, you'd have to update every Book record that is using that Genre which would be much less efficient.

    Now the other thing to note is in most modern relational database systems, the PRIMARY KEY field(s) get a CLUSTERED INDEX created for them automatically. This is why we did not need to write any additional code to create a specific index, because the fields we're about to join on happen to be the PRIMARY KEY fields of each respective table, which then makes the following query efficient out of the box:

    SELECT B.*, G.* -- Note using * is an anti-pattern for multiple reasons, so you should actually explicitly list out only the fields you need
    FROM Books AS B
    INNER JOIN BooksGenres AS BG
        ON B.BookId = BG.BookId
    INNER JOIN Genres AS G
        ON BG.GenreId = G.GenreId
    WHERE G.GenreName = 'comedy';

    Now suppose you did want to join by a field that isn't the PRIMARY KEY / CLUSTERED INDEX on the table, and you wanted to index that field appropriately for efficiency, for example the Genres.GenreName field. Then you'd simply want to create a secondary index (known as a NONCLUSTERED INDEX) on that field in that table like so:

    CREATE NONCLUSTERED INDEX IX_Genres_GenreName ON Genres (GenreName);

    The relational database system you use may even use that secondary index for the above query to service the WHERE G.GenreName = 'comedy' part of it, if the query optimizer finds it to be more efficient of an index than the CLUSTERED INDEX to reduce the data.

    You typically want to look to create indexes that cover the (JOIN, WHERE, HAVING clauses) of your queries, since these are the clauses that reduce the data as it's being served for your query. But sometimes it's also helpful to index the fields in your ORDER BY clauses too because it'll save the data pre-sorted in that same order, and save you from a sort operation everytime a query with that ORDER BY clause is executed.

    Finally, your example is a relatively simple case, but it gets more complex as your schema evolves with more fields in the tables, and different queries are being executed against those tables.

    For example, with a composite (multi-column) index, the order you define the columns matters (most times) because they define the ordering of the data itself (which is generally stored in a which has O(log(n)) search time), and generally can only be efficiently used when your query uses all or any contiguous subset (reading left to right) of the fields defined in the index.

    Other considerations sometimes come into play too, such as leading your index definitions with the most (most unique) column because it leads to the most balanced B-Tree, and improved efficiency when the index is seeked against.

    There's a lot that can be said about indexing, much more than what fits in a single answer to your particular question, so in addition to what I said, and the embedded resources I linked, you may find these additional resources of interest as well:


    Note: The resources I've linked may refer to a specific database system in practice but the concepts are all generic enough to apply to any modern relational database system you choose.

Log in to reply

Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2