query relational tag data, grouping by a different metric



  • Suppose I have the following schema:

    CREATE TABLE tag_group (
      id INTEGER NOT NULL PRIMARY KEY,
      name TEXT NOT NULL,
    );
    

    CREATE TABLE tag (
    id INTEGER NOT NULL PRIMARY KEY,
    name TEXT NOT NULL,
    tag_group_id INTEGER NOT NULL,
    FOREIGN KEY (tag_group_id) REFERENCES tag_group(id);
    );

    CREATE TABLE media (
    id INTEGER NOT NULL PRIMARY KEY,
    file TEXT NOT NULL,
    created_at DATETIME NOT NULL,
    )

    CREATE TABLE media_tag (
    id INTEGER NOT NULL PRIMARY KEY,
    tag_id INTEGER NOT NULL,
    media_id INTEGER NOT NULL,
    FOREIGN KEY (tag_id) REFERENCES tag(id)
    FOREIGN KEY (media_id) REFERENCES media(id)
    )

    Essentially this is a database with media (files) that have tags attached to them. Tags in the system would look like this:

    |   tag_group  |          tag |
    |--------------|--------------|
    | 'instrument' |     'guitar' |
    |      'genre' |        'pop' |
    |     'artist' | 'billy_joel' |
    

    Retrieving media based on a tag query would look like this:

    -- assume the query is something like "get me all the media for 'guns_n_roses', 'rock', 'guitar'"
    SELECT media.* FROM media
    INNER JOIN media_tag ON media_id = media.id
    WHERE tag_id IN (1, 3, 5)
    GROUP BY tag_id
    HAVING COUNT(tag_id) >= 3
    

    But, assume I want to get some more holistic info about whats in the database. How would I perform a query that filters using the same tag search (e.g. 'guns_n_roses', 'rock', 'guitar') and then group the rows based on tag group? E.g. the resulting data would look something like this:

    |   tag_group  |  media_count  |
    |--------------|---------------|
    | 'instrument' |            7  |
    |      'genre' |           19  |
    |     'artist' |           40  |
    


  • this is the closest I can get, but this query seems like it would have pretty poor performance

    SELECT tag_group.name AS tag_group, COUNT(media.id) AS media_count FROM media
    INNER JOIN media_tag ON media_id = media.id
    INNER JOIN tag ON tag_id = tag.id
    INNER JOIN tag_group ON tag_group_id = tag_group.id
    WHERE media.id IN (
      SELECT id FROM media
      INNER JOIN media_tag ON media_id = media.id
      WHERE tag_id IN (1, 3, 5)
      GROUP BY tag_id
      HAVING COUNT(tag_id) >= 3
    )
    GROUP BY tag_group_id
    

    It does appear to return the data that I want to see, but wouldn't this cause a lot of extra loops?

    [edit] here is the query plan generated from the above statement. It appears that we duplicate work selecting from media using primary key, but that is the extent of duplicated work I see. Is this the best I can do? I would be curious if theres a smarter way to write my schemas/query.

    |--SEARCH TABLE media USING INTEGER PRIMARY KEY (rowid=?)
    |--LIST SUBQUERY 1
    |  |--SEARCH TABLE media_tag USING COVERING INDEX media_tag (tag_id=?)
    |  `--SEARCH TABLE media USING INTEGER PRIMARY KEY (rowid=?)
    |--SEARCH TABLE media_tag USING COVERING INDEX sqlite_autoindex_media_tag_1 (media_id=?)
    |--SEARCH TABLE tag USING INTEGER PRIMARY KEY (rowid=?)
    |--SEARCH TABLE tag_group USING INTEGER PRIMARY KEY (rowid=?)
    `--USE TEMP B-TREE FOR GROUP BY
    



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2