Does Greg Rahn's "join order benchmark" reflect real world OLAP join scenarios?



  • Greg Rahn's https://github.com/gregrahn/join-order-benchmark has dozens of inner join queries on the IMDB database.

    Does this benchmark and the database itself reflect common real world OLAP join scenarios? How does a join optimizer's performance on these benchmark correlate to general join performance for the real world?

    For example:

    SELECT MIN(cn.name) AS movie_company,
       MIN(mi_idx.info) AS rating,
       MIN(t.title) AS complete_euro_dark_movie
    FROM complete_cast AS cc,
     comp_cast_type AS cct1,
     comp_cast_type AS cct2,
     company_name AS cn,
     company_type AS ct,
     info_type AS it1,
     info_type AS it2,
     keyword AS k,
     kind_type AS kt,
     movie_companies AS mc,
     movie_info AS mi,
     movie_info_idx AS mi_idx,
     movie_keyword AS mk,
     title AS t
    WHERE cct1.kind = 'crew'
      AND cct2.kind != 'complete+verified'
      AND cn.country_code != '[us]'
      AND it1.info = 'countries'
      AND it2.info = 'rating'
      AND k.keyword IN ('murder',
                    'murder-in-title',
                    'blood',
                    'violence')
      AND kt.kind IN ('movie',
                  'episode')
      AND mc.note NOT LIKE '%(USA)%'
      AND mc.note LIKE '%(200%)%'
      AND mi.info IN ('Sweden',
                  'Norway',
                  'Germany',
                  'Denmark',
                  'Swedish',
                  'Danish',
                  'Norwegian',
                  'German',
                  'USA',
                  'American')
      AND mi_idx.info < '8.5'
      AND t.production_year > 2000
      AND kt.id = t.kind_id
      AND t.id = mi.movie_id
      AND t.id = mk.movie_id
      AND t.id = mi_idx.movie_id
      AND t.id = mc.movie_id
      AND t.id = cc.movie_id
      AND mk.movie_id = mi.movie_id
      AND mk.movie_id = mi_idx.movie_id
      AND mk.movie_id = mc.movie_id
      AND mk.movie_id = cc.movie_id
      AND mi.movie_id = mi_idx.movie_id
      AND mi.movie_id = mc.movie_id
      AND mi.movie_id = cc.movie_id
      AND mc.movie_id = mi_idx.movie_id
      AND mc.movie_id = cc.movie_id
      AND mi_idx.movie_id = cc.movie_id
      AND k.id = mk.keyword_id
      AND it1.id = mi.info_type_id
      AND it2.id = mi_idx.info_type_id
      AND ct.id = mc.company_type_id
      AND cn.id = mc.company_id
      AND cct1.id = cc.subject_id
      AND cct2.id = cc.status_id;
    


  • I don't know who Greg Rahn is, but that's one ugly query, especially with the stuffing of the predicates of the JOINs into the WHERE clause instead of on a proper JOIN clause between each referenced object. That's a non-standard syntax, and I recommend not following it.

    Order of the objects being JOINed depends on the database system we're talking about. Some do explicitly care about the order of the tables (I believe generally by the cardinality of the data set of that object) in order to achieve the most optimal plan but most don't care and automatically try to optimize the JOIN order themselves, such as Microsoft SQL Server.

    Of course the complexity of parsing, analyzing, and determining the best execution plan for a query increases with the more tables that are being joined together. For example, with 3 tables there's 6 ways to order the tables and JOIN them:

    1. A-B-C
    2. A-C-B
    3. B-A-C
    4. B-C-A
    5. C-A-B
    6. C-B-A
    

    And in Microsoft SQL Server, there are 3 types of JOIN operators it can use for any given JOIN: Nested Loops, Hash Match, Merge Join. So with only 3 tables there's actually 18 unique ways to JOIN all the data together then. That's 18 unique execution plans the optimizer needs to analyze and choose the best one from, in a very minimal amount of time. So you can imagine how quickly that number grows when you add a fourth or fifth table to JOIN to, where 15 tables is some number in the billions (maybe more?) execution plans.

    Because of the vast number of execution plans the optimizer needs to choose from and because the slightest change in a query in Microsoft SQL Server results in a new plan being recalculated, re-arranging the order of the tables being JOINed could result in a more optimal plan but just as well could result in a less optimal one too. Again because Microsoft SQL Server doesn't care about the order of the tables per se, rather it does the re-ordering for you under the hood, and it's essentially sheer luck at that point that you've caused a different execution plan to be generated. Though most times you'll likely see the same plan anyway. This is probably true for most database systems that optimize the JOIN order for you.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2