Optimal format for storage of large text data



  • Please indicate the format for the storage of large volume text data. It should meet the following criteria (although 3 out of 5). The criteria go as a matter of priority:

    • Usible presentation of data structure (data input may be arbitrary)
    • Support to python libraries (converting, passing)
    • Rapid and small (parasing, downloading to OBD)
    • Usible distribution (scalling, network departure)
    • Man-made species (the possibility of reading and editing directly by human beings)

    I'm considering formats - CSV, XML, JSON. I would be grateful to the councils for choosing from these formats or your proposals.


    UPD. A little clarification. Why did you take care of the format?

    Collected a large amount of data for its project (engineering and scientific data).

    The task is to structure and store them, and suddenly someone needs information, and I can give it to him. Consequently, this kind of man-made species will be quite the way.

    There may be a change in meaning, and to avoid a new password, it is necessary to direct the file directly to the editor.

    In addition, the data obtained should be imported into the database, in my case PostgresQL, and any person who has accepted my textual data can do the same to any convenient OBD.



  • So, let's on our head:

    Same data in csv, json:

    csv:

    country|city
    US|New York
    Russia|Moscow
    

    json:

    {[{"country":"US", "city":"New York"},{"country":"Russia", "city":"Moscow"}]}
    

    compare the length of the lines. Who's got more?

    JSON/XML is convenient for what is structured to describe the data scheme. CSV is convenienced by the fact that it is very compact, the minimum cost of passwording.

    • Any non-binary format of a person can be edited very easily, some binary formats are simple enough to be edited in a hex editor, especially if you're used to.

    • JSON formally supports TALK UTF-8. CSV can be in any code.

    • If you have very sophisticated data, highly related, data that are difficult to provide in the form of one or two tables, perhaps you should look at json/xml.

    If you're just pouring texts into the bud, exporting them, the sv will be fine.

    In general, this is the question of the selection of a temporary format, a format for exports or online transfers, to external systems (no data stored in csv/json/xml as a main online repository)

    If you have very large texts, store them in text files and in the database, and in csv/json/xml, let's refer to the files. The structure is complicated, but it's easier to edit.

    However, the difference between formats is nivel. In short, as always, everything depends on architecture and challenges.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2