One of the aspects of our project is to evaluate the “right” way to expose large datasets such that they can be consumed appropriately by other cloud-oriented tools. For our purposes we are considering datasets ranging from a few GB to a few PB – and it is certainly the upper end of this spectrum that causes the most concern. We are targeting the following general use-cases:

  • Tools/Compute currently targeted at Amazon’s S3 service.
  • Tools/Compute currently targeted at Amazon’s EC2 service (not sure if the S3 interfaces solve both issues).
  • Tools/Compute currently targeted at Microsoft’s Azure Storage (blob/table as appropriate)
  • The “General” web client (this scenario makes the case for exposing data in a human-readable format and/or standards-based formats such that tools that don’t yet exist can be reasonably expected to be able to consume the data

Because of the sheer quantity of data, it is our expectation that the data will be stored centrally in some semi-generic fashion that will then be exposed in a number of protocol/format specific means. This may be an incorrect assumption, but it is our current plan.

Some general questions that currently exist:

  • Some of the APIs listed above support multiple protocols (REST, JSON, SOAP) – should all be implemented or just REST?
  • How do we make it easy for various research groups to get their data published?
  • Should service-specific (i.e. S3/Azure) interfaces be built, or should general, REST-based (or other) interfaces be built.
  • Should dataset-specific REST interfaces be provided? (rather than generic interfaces)

I’ve been poking around at http://data.gov hoping that something interesting will start to appear (as of yet, it is still a place-holder site). I’ve also done some poking around at Microsoft’s Open Government Data Initiative (http://ogdisdk.cloudapp.net/) which looks interesting but there remains to be code posted – maybe their starter kit will appear soon.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
0 Comments