File format for archiving

loriab · November 30, 2015, 5:08pm

What file format is favored for archiving of results? That is, floats, strings, arrays (Norb by Nbas or smaller, let’s say), etc. Something with max size a few times that of the output file, only organized.

xml
json, json-ld (json with some extra markups for linked data)
python pickle, python shelf
plain ascii (please no)
yaml (probably not– we could do with a bit more structure)
some Boost serialization/tree format

Necessary properties, I think:

architecture independent (w/o aux formatting tool)
vi-readable in a pinch (I’d consider json and xml in this class)

So, thoughts on favored formats?

crawdad · November 30, 2015, 5:24pm

@Rollin_King should weigh in on this one, considering the work he’s done recently with Hans-Peter Lüthi on this topic.

Rollin_King · November 30, 2015, 5:59pm

I haven’t heard any compelling reason not to use XML. It’s readable, portable, and there are many tools available for it. The downsides are that 1) the large-scale attempt at a standard (CML/CompChem) is stalled and incomplete (though it does include the vast majority of what users would want, i.e., output file type of data); and 2) for large data (densities, MO coefficients for large molecules, etc.) you probably need a binary alternative on the side.

From a paper currently in print with JOCS:

The Extensible Markup Language (XML) has the advantage that it is both human- and machine-readable, and that it is widely used as a data-exchange format supported by the World WideWeb consortium (http://www.w3.org/XML/). XML allows expressing content, structure and logic, and the schema can always be adapted to serve a particular application or purpose. The data model in XML databases is built on hierarchial documents archived in collections, different from relational databases that are defined by tables. It offers a good balance between structure and flexibility. Flexibility is an issue, as not every single calculation will deliver the same set of data; the data to be archived will strongly depend on the kind of problem: the optimization of a molecular geometry or the response of this same molecule to an external field (electric or magnetic; static or time-dependent) will create very different sets of outputs. This balance of structure and flexibility has been harnessed by the NoSQL community to develop powerful, yet flexible database platforms such as MongoDB(www.mongodb.org), which is now also being used by the HarvardCEP. Note that XML can be easily adapted to be fed directly into aMongoDB.XML also comes with a large number of tools for analysis, trans-formation and data archival. Chemical Markup Language (CML) is an established “XML dialect” for chemistry, and its schema covers most of the semantics needed, even for computational quantum chemistry [4]. More recently, some of those authors published CompChem, a CML-based convention for computational chemistry [10]. CompChem has a working CML and convention validator code. Unfortunately, as of 2015, CompChem has seen very limited adoption, has only tentative basis-set and array specifications, and a published extension of CML to the NWChemprogram [23] deviated from the original CompChem schema[11].

loriab · November 30, 2015, 6:29pm

Thanks, I hadn’t seen the CompChem CML extension before.

My reservation about XML (though admittedly I don’t have a good understanding of all the interconversion tools) is that the XML Schema rather demands that it be the source of structure in the file, resulting in deep nesting (though adding some independence to file interpretation). It could be convenient for something else to control the structure (graph/tree, ontology) and the archiving file to be less structured (and less independent) with labels that enable hooking up data pieces to any of multiple structure definitions.