I’ve spent a good bit of last week and again this morning working on some samples of sharing scientific data to the cloud and have found myself coming back to the question posed by the title. When I started on this project, I felt that this would be one of the easiest aspects… take a bunch of data, run it through “something” and have REST/JSON/ATOM PUB and possibly SOAP come out the other end. The users would have very clear access to the data and would be able to consume it in whatever tool/platform/means they saw fit. To a large degree, I still would argue that this is the ideal scenario, but as I’m digging into the samples and tools used by those I work with, I’m finding that much of the data is stored in custom (well, custom to me) file formats that support self-description (a good thing) and n-dimensional arrays (good, but hard to express in XML). I’m seeing data stored as NetCDF files, HDF5 files and other formats. Many of these formats have a steep history within their respective fields (i.e. computational climatologists would be used to seeing NetCDF files) but don’t have much visibility outside those arenas. One of the questions I’ve been chewing on is, “is that a bad thing?”… meaning, if the audience is really domain scientists, then providing data in a format consistent with what they are used to is a good thing and doesn’t necessarily diminish the value of the cloud at all. If, on the other hand, the goal is to provide the data in a more general sense, to a possibly more general audience and supporting the unknown client/toolset, I’d argue that a more Internet-friendly format would be preferable. Unfortunately, targeting the later case is not as easy as it might seem… meaning, many of these datasets are large… measured in the 10’s of GB to 10’s of TB… in their currently-optimized binary formats. Converting and exposing those datasets as text (assuming for a moment that a suitable text-based representation of n-dimensional data existed) would bloat the total data size making the consumption of such out of reach for many.
In a meeting I had last week discussing this topic, a colleague suggested that maybe the text-based version of the data set could be exposed as a subset or lower-resolution version of the main set with pointers for the interested to the full set. An alternative might be to provide data services that would, based on a formed query or URL string, return subsets of the full dataset to the caller.
Within the scientific community there exists tools/platforms such as OpeNDAP and Live Access Server(s) but for the most part, these serve science-community-specific data formats (again, possibly a good thing). HDF5 group provides an XML representation of their data structure, which is interesting and could be used within REST/etc. but the NetCDF format doesn’t appear to have a counter-part function (xml versions of the metadata exist, but the data itself doesn’t appear to be available).
My current thinking is as follows:
- For domain scientists who are working in cloud-computing environments, look for ways to get the full datasets exposed in their native formats as close to the compute as possible (i.e. stored in S3 for Amazon customers, stored in Azure for Microsoft customers, etc.)
- For domain scientists who are working in private “clouds” or otherwise closed environments, look for ways to increase their accessibility to the native file formats (Amazon’s Cloud Front and dev-pay could be an interesting option here – CDN gets the data closer to those consuming it and the bandwidth cost for distribution could possibly be passed to the consumer of such).
- For generalists/the unknown client, explore xml-based representations of the common data formats and suitable means of serving that data.
- For generalists/the unknown client, explore ways to easily make lower-res versions of massive datasets available as well as look at means of developing services that provide subsets of the entire set as appropriate.
In all cases, look for ways to make the tooling for consuming/exploring the datasets more “reachable” (the support for *nix platforms is fairly robust whilst the support for Windows-based clients is fairly weak. While one could argue all day about which platform is best for this sort of work, it is currently an indisputable fact that Windows-based clients represent the majority of the computers in the world and that data formats that do not provide support [or only provide poor support] on the Windows platform by definition exclude a large section of potential consumers of such data.).
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5