Best way to map different JSON keys to same target columns



  • I have CSV import data that comes from multiple vendors with 50+ different columns, and each vendor has slightly different naming.

    Instead of trying to spend hours mapping all of the column names to a master table with tons of columns, instead I want to store the original CSV row as a JSON column, such as csv_data.

    I also want to store a unique identifier that is shared by all of the vendors, like email.

    So my columns in my `contacts table would be:

    vendor_id
    email
    csv_data
    

    Then I was hoping I could create a view or materialized view which I could start with a small subset of columns that I know I will need (and will spend time mapping). Then, if I ever need the new columns later I could modify the view with new mappings.

    I was hoping I could do some kind of case switch, like this:

    SELECT
      CASE 
        WHEN vendor_id = 100 THEN
            csv_data->>'fullname' AS full_name,
            csv_data->>'age' AS age,
            ... etc ...
        WHEN vendor_id = 101 THEN
            csv_data->>'FULL NAME' AS full_name,
            csv_data->>'AGE' AS age,
            ... etc ...
        WHEN vendor_id = 102 THEN
            csv_data->>'full name' AS full_name,
            csv_data->>'cust_age' AS age,
            ... etc ...
      END
    FROM contacts
    

    Unfortunately this is a syntax error... is there any more sane way to do this? I couldn't come up with good ideas. I truly don't want to create some enormous mapping as so many of the column names are messy and probably will never be used. I want to store the data in case I need it, but for my 'results' query I want to create a nice clean mapping that I only add to when I need it.

    I'm open to better ways of doing this overall as well. I was trying to avoid the nuclear route of the massive massive table. But I guess I would if I absolutely advised as the best way.



  • If you go that route, storing original JSON data (I suggest as jsonb rather than json - cleaner and faster to extract), then consider a translation table with key names for every relevant target column per vendor. Conveniently as another single jsonb value. Like:

    CREATE TABLE key2col (
      vendor_id int PRIMARY KEY
    , keys jsonb NOT NULL
    );
    

    INSERT INTO key2col VALUES
    (100, '{"full_name":"fullname", "age":"age"}')
    , (101, '{"full_name":"FULL NAME", "age":"AGE"}')
    -- more ...
    ;

    Then your query can simply be:

    SELECT c.vendor_id, c.email
         , c.csv_data->>(k.keys->>'full_name') AS full_name
         , c.csv_data->>(k.keys->>'age') AS age
    FROM   contacts c
    LEFT   JOIN key2col k USING (vendor_id);
    

    db<>fiddle https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=7db98fcee3a2335ba1ff0df114ff1eb3

    This is even fit for any varying set of key --> column translation per vendor. Every field that has no JSON key for the given vendor in key2col.keys will be NULL. Same if the vendor did not actually provide the declared key in contacts.csv_data.

    The LEFT JOIN in the query retains rows from contacts without any registered columns. You may want a plain JOIN instead ...




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2