1. 程式人生 > >The State of CSV and JSON

The State of CSV and JSON

Our Friends CSV and JSON

In the history of the Internet, two data formats have become ubiquitous. CSV and JSON. CSV or Comma Separated Values originates from the 1970s way before the Web came to life. JSON or JavaScript Object Notation on the other hand was specified in the early 2000s. Both formats are easy to read and write by humans, and easy to interpret by computers. In short, both offer great UX.

CSV

CSV was created as a way to transfer data in and out of spreadsheets and databases. Separating values with commas is natural for humans to read and write. It’s also easy for a computer program to split the values between comma characters. Quotes (typically double quotes) were added in the early 80s as a means to encapsulate values that themselves contained commas. Again, natural for a human to read and write and still simple enough for a computer program to parse.

CSV is the defacto interchange format across databases. Look at MySQL Workbench, SQL Server Management Studio, PhpMyAdmin, etc. They all support importing and exporting data as CSV. Easy to write by a computer program.

CSV has a brother called TSV or Tab Separated Values. It’s the format you find in your clipboard when you copy tabular data in applications. For example from a spreadsheet application like Excel. Instead of a comma character, the tab character \t

is used to separate values. The values can then easily be pasted into another application and interpreted as tabular data.

Although CSV has been around for so long, official specification efforts have come very late. In 2005 the IETF or Internet Engineering Task Force published an RFC memo RFC-4180. That memo documented the CSV format and its “txst/csv” media type. Recently, in 2014 the IETF published another memo RFC-7111 to identify URI fragments in a “text/csv” media type file. Both these memos are not official specification of the CSV format. Instead they aim to document common usage of CSV.

In 2015 the W3C or World Wide Web Consortium started working on specifying the CSV format by publising a series of recommendations and group notes. Work is underway to standardize the CSV syntax. Why did it take so long? Maybe because the CSV definition, although very loose, has always been “good enough” and taken for granted.

Interestingly enough, within the efforts of publishing a standard for CSV, the W3C touches on CSV to JSON conversion. A great segway for me to jump to the second topic of this article, the JSON data format.

JSON

JSON was created by Douglas Crockford in 2001–2002 to solve a problem: transferring JavaScript data through an HTML document. Mr Crockford figured he could pass a <script> tag containing a JavaScript object. It looked something like this:

<html><head><script> document.domain = ‘fudco’; parent.session.receive(   { to: “session”, do: “test”,     text: “Hello world” } )</script></head></html>

He eventually realized that reserved words like do could not be used as keys and added double quotes around keys. In 2002 Mr. Crockford published the JSON grammar and parser on json.org.

Since then, JSON has grown in popularity going as far as replacing XML in APIs. Its interesting to compare the Stack Overflow question trends of JSON vs XML to see the rise in popularity of JSON (and the fall of XML).

The rise of APIs, and specifically RESTful APIs is most likely one the reasons JSON has become so popular. Looking at the Twitter API, the GitHub API and the Facebook Graph API, they all use JSON as their data format and adhere to RESTful principles. Many different programming languages consume those APIs and therefore had to implement parsing and generating JSON.

Configuration files are also more and more written in JSON format. Every front-end developers recognizes the famouspackage.json file. It is found in literally every JavaScript repository on GitHub to define the package and its dependencies.

Computers understand JSON as a universal format. Humans, namely developers, understand it as well.

JSON is much simpler for a human to write than XML. First of all, its based on the most popular programming language in the world: JavaScript. A JavaScript object is very natural and implicit for programmers to write. Second, it is much less verbose than XML. JSON uses double quotes to open and close keys and string values; curly brackets to open and close an object; and square brackets to open and close an array. Third, JSON defines elementary data types boolean, string, number, object and array. XML on the otherhand does not. Extra meta data is required to specify the type of content of a node. Again, XML is extremely verbose and open to developer interpretation while JSON provides simple grammar and data types every developer understands.

JSON allows both computers and humans to represent data structures without ambiguity.

CSV and JSON

Here we are with two data formats used for different things: CSV for representing tabular data, widely used in database import/export, spreadsheet applications and copy/paste-able on every desktop operating system; and JSON for APIs and configuration files. Both formats are complementary but do have overlap. One could represent tabular data as JSON by providing an array of arrays. Yet for a human, those square brackets and lack of new line characters would make writing and reading very inconvenient. Furthermore, spreadsheet applications would need to change their clipboard parsing conventions. Not likely to happen. CSV is here to stay. Not only that, we do see CSV growing in popularity in a similar upward trend as JSON.

Developers do need to convert between one format to the other. Hence why the W3C published recommendation Generating JSON from Tabular Data on the Web to that effect. That document “describes the processing of tabular data to create a set of nested objects that must be serialized as JSON”.

A Google Trends on CSV to JSON also shows the rise in demand for conversion between JSON and CSV.

My own CSV to JSON conversion toolbox website csvjson.com has seen a rise in popularity as well. Over the past year, I’ve had many requests to extend the toolset. So I’ve added new web apps for JSON to CSV conversion and a JSON validator and linter.

More interesting however is Dror Harari’s proposal to mary CSV and JSON. The new format is called CSVJSON and represents tabular data as rows of JSON arrays without enclosing square brackets. This clever pairing solves parsing issues of CSV due to the lack of a formal specification by imposing gramar and rules established for JSON (i.e. character encoding, and escaping rules). CSVJSON simply says: “to parse a row, surround it with square brackets and use JSON.parse()”. Furthermore, CSVJSON adds support for null, booleans and objects. It therefore becomes an ideal transfer format between databases.

Mr. Harari wasn’t alone to come up with that idea. JSON Lines or Line-delimited JSON define a format whereby multiple valid JSON values are delimited by new line characters. The JSON values themselves are stripped of white space and new line characters as to only have new line characters separate different JSON values. The format is ideal for streaming. During transfer a chunk would contain a valid JSON value and more could come later. Although this format is not CSV strictly speaking, it does leverage the same idea: one record per row.

The Road Ahead

Time will tell if those new derivative CSV/JSON data formats live on to be widely used. I think they have their own niches (database table transfers and JSON streaming) and will most likely remain there. What’s interesting to note is that both formats, and JSON as well for that matter, have been specified from the get go. That wasn’t the case for our friend CSV. Its fascinating to see that JSON grammar is being used to specify the shortcomings of CSV. I’m curious to see the outcome of the W3C effort around tabular data. Finally one can wonder if the marriage between both formats will produce more offsprings? Time will tell.