Importing CSV with a json array of strings in one of the column



  • The columns of the CSV are of the format text (can include empty lines), date and a JSON array of strings (something like ['a', 'b', 'c']. I've been trying to copy that CSV to a PostgreSQL table (using psycopg2's copy_expert, which simply executes the given SQL COPY command, if that matters)

    Table is created with

    CREATE TABLE posts(
        id SERIAL PRIMARY KEY,
        text TEXT NOT NULL,
        created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
        rubrics jsonb[] NOT NULL
    );
    

    and the copy command is

    COPY posts(text, created_date, rubrics)
    FROM STDIN
    WITH CSV HEADER
    

    where STDIN is the CSV file.

    The error I get is

    malformed array literal: "['a', 'b', 'c']"
    DETAIL:  "[" must introduce explicitly-specified array dimensions.
    CONTEXT:  COPY posts, line 15, column rubrics: "['a', 'b', 'c']"
    

    I've tried all 4 JSON-related data types (json and jsonb with and without [] or [3]), including square brackets produces the error above, while omitting them (rubrics jsonb NOT NULL when creating) gives a new one:

    invalid input syntax for type json
    DETAIL:  Token "'" is invalid.
    CONTEXT:  JSON data, line 1: ['...
    COPY posts, line 15, column rubrics: "['a', 'b', 'c']"
    

    Do I have any recourse besides manually fixing the .csv to use {} instead of [] before copying it? It feels like I do, yet I haven't really been able to found anything besides a couple somewhat, but not fully relevant questions.

    Update regarding the comments

    I've changed the table creation to

    CREATE TABLE posts(
        id SERIAL PRIMARY KEY,
        text TEXT NOT NULL,
        created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
        rubrics jsonb NOT NULL
    );
    

    and left a single entry in the CSV to test, so it looks like this

    text,created_date,rubrics
    "Lorem
    

    Ipsum

    Test",2019-07-25 12:42:13,'["f", "o", "o"]'

    Things I've tried and errors I got

    Using the CSV as is:

     extra data after last expected column
    CONTEXT:  COPY posts, line 6: ""Lorem
    

    Ipsum

    Test",2019-07-25 12:42:13,'["f", "o", "o"]'"

    An additional pair of "" around the array ("'["f", "o", "o"]'")

     invalid input syntax for type json
    DETAIL:  Token "'" is invalid.
    CONTEXT:  JSON data, line 1: '...
    COPY posts, line 6, column rubrics: "'[f, o, o]'"
    

    No quotes at all (just the ["f", "o", "o"])

    extra data after last expected column
    CONTEXT:  COPY posts, line 6: ""Lorem
    

    Ipsum

    Test",2019-07-25 12:42:13,["f", "o", "o"]"

    Double quotes instead of single quotes ("["f", "o", "o"]")

    invalid input syntax for type json
    DETAIL:  Token "f" is invalid.
    CONTEXT:  JSON data, line 1: [f...
    COPY posts, line 6, column rubrics: "[f, o, o]"
    

    Double quotes outside, single quotes inside ("['f', 'o', 'o']")

    invalid input syntax for type json
    DETAIL:  Token "'" is invalid.
    CONTEXT:  JSON data, line 1: ['...
    COPY posts, line 6, column rubrics: "['f', 'o', 'o']"
    

    Could it be an issue with the Python library I'm using after all?



  • JSON is a layer of encapsulation with a https://www.ecma-international.org/publications-and-standards/standards/ecma-404/ that the JSON field in your input must respect.

    CSV is another layer of encapsulation that applies on top, with a syntax that is more of less https://www.rfc-archive.org/getrfc?rfc=4180 . As mentioned in the RFC, programs that read or write CSV do not always conform strictly to all these rules, but concerning the quoting rules, Postgres is conformant.

    JSON uses double quotes around string literals, so all the attempts in the question to use single quotes around strings are invalid for the JSON parser

    CSV requires that double quotes are doubled, see rules #5 and #7 of rfc 4180, so all attempts to use double quotes in JSON without doubling them for CSV or without enclosing the entire field in double quotes are invalid for the CSV parser.

    1. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields
    1. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

    Your example slightly edited to conform to both CSV and JSON does work:

    CREATE TABLE posts(
        id SERIAL PRIMARY KEY,
        text TEXT NOT NULL,
        created_date TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
        rubrics jsonb NOT NULL
    );
    

    COPY posts(text, created_date, rubrics)
    FROM STDIN WITH CSV;
    foo,2022-05-17,"[""a"", ""b"", ""c""]"
    .

    select * from posts;
    id | text | created_date | rubrics
    ----+------+---------------------+-----------------
    1 | foo | 2022-05-17 00:00:00 | ["a", "b", "c"]

    If you wanted a Postgres array of JSON objects in rubrics, that would be more complicated, as expressing the Postgres array as text requires another layer of encapsulation with its https://www.postgresql.org/docs/current/arrays.html#ARRAYS-IO , so the production of valid data to import with COPY would have to do

    JSON quoting -> postgres array quoting -> CSV quoting

    But as seen in your example, it seems that you don't need a Postgres array of JSON objects as a column, but rather a single JSON value that contains a JSON array.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2