May
26.
2009

I’ve been thinking quite a bit lately about the role of cloud computing as it applies to scientific research (as hinted at by the title of this site). One possible flaw in my approach is that I’ve been delving into MPI-based compute as much as I can to wrap my mind around how it works with the notion of then applying that paradigm to cloud compute. I list it as a flaw only because I wonder if it is possibly time to think a bit further outside the proverbial box if you will. I’ve been mulling over the following:

How do people actually use multiple machines to solve a problem? – This is really the root question behind all of this work. The first scenario is high-end shared-memory machines (ala Cray supercomputers) and I’m going to eliminate that type of compute from the conversation due to the fact that it simply can’t be well-replicated in the cloud as we currently know it. The far opposite end of the spectrum is “manual” clustering or map reduce – someone figures out a problem they want to solve, divvy’s it up amongst N nodes, and then individually runs a program on each node with the appropriate settings and then manually aggregates the results. This extreme is most likely done by ad-hoc projects or those not familiar with traditional HPC technologies and approaches. Between the two extremes listed, there are Map/Reduce implementations and traditional MPI programs targeted at distributed memory systems.

Amazon’s EC2 – very easy to utilize for lower-throughput MPI-based HPC - given you can get n Linux boxes for 0.10/cpu/hour and, because of the vast community that has grown up around it, there are pre-packaged clusters (via AIM) and even commercial vendors building businesses on top of providing HPC-style compute in EC2 in an “on-demand” fashion. Further, traditional grid computing platforms such as Nimbus have been radically adapted to provide a rather compelling local – to – cloud story for scientific HPC. It would seem that if you are working in HPC today, and simply want to utilize an HPC cluster “in the cloud” (maybe because of lack of access to sufficient hardware) that Amazon’s EC2 and the toolsets such as Nimbus (and others) that sit on top of it is a natural solution.

Microsoft’s Azure – While it is a quickly adapting platform (seeing as it hasn’t yet released) and they have hinted at plans to adapt the platform based on customer demand, if you look at it currently, there’s not an obvious fit for the traditional HPC model. The customer of Azure is given the choice of deploying web or worker roles, and one can imagine using worker roles in a fashion analogous to cluster nodes… but there currently isn’t any built-in infrastructure to bind those nodes into a single group/cluster. As it stands now, Azure seems to lean towards the manual-approach to large-scale compute. What could change this story completely is if Microsoft decided to offer HPC Pack-enabled nodes as a type of resource you could request, although there’s been nothing to hint that they are planning anything like this.

Where do we go next? – I’ve been chewing on whether or not it makes any sense to try to push HPC-style work into Azure, or if it should simply be relegated to the EC2’s of the world… One could conceivably build an implementation of MPI that, rather than relying on the underlying cluster would provide cloud-style/enabled communications between nodes… this could allow those most comfortable with (or with large existing code bases of) MPI-style apps to continue to utilize those libraries/applications, but one has to wonder if, unless the Microsoft pricing (to be announced later this summer) is incredibly cheaper than that of EC2, why would one bother (other than academic interest, of course) to build such? Again, this could be mitigated by Microsoft providing such itself, but the platform would have to provide additional compelling aspects to pull someone away from what would otherwise be a very comfortable transition (local cluster to an EC2-hosted cluster running the same software stacks).

What I’ve really been wondering – is if it is not time to throw MPI out altogether (or, more accurately, the programming paradigm that it represents). Is it time to look for ways to raise the level of abstraction for the computational researcher… and, if so, does something like Azure have a more interesting role? I’m wondering if some of the abstraction tools (workflow engines, queue services, etc.) will begin to have a role or if we need to continue to stretch for every raw bit of horsepower from the system (acquiescing to the fact that abstraction layers cost is in reduced raw system power). For many of the large simulation models it seems that the raw horsepower is indeed necessary. You also cannot simply ignore the vast collection of existing tools and libraries that already exist and target this paradigm. The flip question is that is there a collection of computational research for which, if the cost per cpu hour was low enough, and the increase in development productivity was great enough (assuming that the proposed layers of abstraction resulted in such), would it really matter if the job took an additional 30-50% time to run? This is, of course, only salient if we live in a world wherein I can get however many compute nodes I want whenever I want them (no waiting in queue).

My gut tells me we aren’t quite there yet, but I wonder how far out it really is?

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
0 Comments
May
13.
2009

Data Structures

Posted by: Rob Gillen in Categories: Data.
Tags: , , , ,

One of the aspects of our project is to evaluate the “right” way to expose large datasets such that they can be consumed appropriately by other cloud-oriented tools. For our purposes we are considering datasets ranging from a few GB to a few PB – and it is certainly the upper end of this spectrum that causes the most concern. We are targeting the following general use-cases:

  • Tools/Compute currently targeted at Amazon’s S3 service.
  • Tools/Compute currently targeted at Amazon’s EC2 service (not sure if the S3 interfaces solve both issues).
  • Tools/Compute currently targeted at Microsoft’s Azure Storage (blob/table as appropriate)
  • The “General” web client (this scenario makes the case for exposing data in a human-readable format and/or standards-based formats such that tools that don’t yet exist can be reasonably expected to be able to consume the data

Because of the sheer quantity of data, it is our expectation that the data will be stored centrally in some semi-generic fashion that will then be exposed in a number of protocol/format specific means. This may be an incorrect assumption, but it is our current plan.

Some general questions that currently exist:

  • Some of the APIs listed above support multiple protocols (REST, JSON, SOAP) – should all be implemented or just REST?
  • How do we make it easy for various research groups to get their data published?
  • Should service-specific (i.e. S3/Azure) interfaces be built, or should general, REST-based (or other) interfaces be built.
  • Should dataset-specific REST interfaces be provided? (rather than generic interfaces)

I’ve been poking around at http://data.gov hoping that something interesting will start to appear (as of yet, it is still a place-holder site). I’ve also done some poking around at Microsoft’s Open Government Data Initiative (http://ogdisdk.cloudapp.net/) which looks interesting but there remains to be code posted – maybe their starter kit will appear soon.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
0 Comments