Postgres improvements in conversion functions for bulk load



  • I regularly import data from httparchive.org. The data is a MySQL CSV export and I use pgloader, which handles the quirks of this export (\N for NULL), etc. I also need to do some additional processing for normalisation:

    • split the url in protocol (http|https) and host parts
    • convert the string date "Mon DD YYYY" to date object

    Currently, I have some triggers to do this when importing the data but I'm looking for ways to improve this, particularly to see whether some steps can be run in parallel.

    I have the following CTE for extracting protocol and port:

    with split as
    (select regexp_match(url, '(https|http)://(.+)/' )as parts 
    from urls )
    

    Running locally this seems faster than tsdebug

    This works well with a select but seems very slow as an update.

    with split as
    (select regexp_match(url, '(https|http)://(.+)/' )as parts 
    from urls )
    update urls
    set 
    protocol = parts[1],
    host = parts[2] 
    from split
    

    An alternative, especially when working with a text source would be split the URL before it goes to Postgres.

    The uncompressed CSV is 3526430884 bytes and takes around 20 minutes to import with no processing. But more than twice this with the processing. FWIW I have also tried using a Foreign Data Wrapper. But, even after solving various problems with the CSV (nulls, encoding), this leads to a memory error.

    With some help, I've managed to run benchmarks and improve my triggers.

    CREATE OR REPLACE FUNCTION public.extract_protocol()
     RETURNS trigger
     LANGUAGE plpgsql
    AS $function$
    DECLARE elements text [];
    BEGIN
    elements := regexp_match(NEW.url, '(https|http)://(.+)/');
    NEW.protocol = elements[1];
    NEW.host = elements[2];
    RETURN NEW;
    END;
    $function$
    

    This now runs faster than doing a subsquent update but neither are limiting factors. The bottleneck is now the overhead of the indices when inserting the cleaned up data into the main table. I think my only options there are to weigh up the cost of indices for the insert over disabling and then adding them.



  • Your UPDATE syntax generates a cross join of the table urls with the split result. Which is essentially a cross join of the table with itself.

    You need to have some kind of join condition between the target table and the source. The obvious choice would be the primary key column of the table.

    with split as (
      select pk_column, regexp_match(url, '(https|http)://(.+)/' ) as parts 
      from urls 
    )
    update urls
      set protocol = s.parts[1],
          host = s.parts[2] 
    from split s 
    where urls.pk_column = s.pk_column --<< here
    

    I think your attempt to avoid evaluating the regex expression twice by using the CTE makes things slower, rather than faster. I would expect the overhead of joining the table with itself is far bigger than the overhead of evaluating the regex twice.

    So I think, you should also try:

    update urls
      set protocol = (regexp_match(url, '(https|http)://(.+)/' ))[1]
          host = (regexp_match(url, '(https|http)://(.+)/' ))[2]
    



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2