I’ve been thinking quite a bit lately about the role of cloud computing as it applies to scientific research (as hinted at by the title of this site). One possible flaw in my approach is that I’ve been delving into MPI-based compute as much as I can to wrap my mind around how it works with the notion of then applying that paradigm to cloud compute. I list it as a flaw only because I wonder if it is possibly time to think a bit further outside the proverbial box if you will. I’ve been mulling over the following:
How do people actually use multiple machines to solve a problem? – This is really the root question behind all of this work. The first scenario is high-end shared-memory machines (ala Cray supercomputers) and I’m going to eliminate that type of compute from the conversation due to the fact that it simply can’t be well-replicated in the cloud as we currently know it. The far opposite end of the spectrum is “manual” clustering or map reduce – someone figures out a problem they want to solve, divvy’s it up amongst N nodes, and then individually runs a program on each node with the appropriate settings and then manually aggregates the results. This extreme is most likely done by ad-hoc projects or those not familiar with traditional HPC technologies and approaches. Between the two extremes listed, there are Map/Reduce implementations and traditional MPI programs targeted at distributed memory systems.
Amazon’s EC2 – very easy to utilize for lower-throughput MPI-based HPC - given you can get n Linux boxes for 0.10/cpu/hour and, because of the vast community that has grown up around it, there are pre-packaged clusters (via AIM) and even commercial vendors building businesses on top of providing HPC-style compute in EC2 in an “on-demand” fashion. Further, traditional grid computing platforms such as Nimbus have been radically adapted to provide a rather compelling local – to – cloud story for scientific HPC. It would seem that if you are working in HPC today, and simply want to utilize an HPC cluster “in the cloud” (maybe because of lack of access to sufficient hardware) that Amazon’s EC2 and the toolsets such as Nimbus (and others) that sit on top of it is a natural solution.
Microsoft’s Azure – While it is a quickly adapting platform (seeing as it hasn’t yet released) and they have hinted at plans to adapt the platform based on customer demand, if you look at it currently, there’s not an obvious fit for the traditional HPC model. The customer of Azure is given the choice of deploying web or worker roles, and one can imagine using worker roles in a fashion analogous to cluster nodes… but there currently isn’t any built-in infrastructure to bind those nodes into a single group/cluster. As it stands now, Azure seems to lean towards the manual-approach to large-scale compute. What could change this story completely is if Microsoft decided to offer HPC Pack-enabled nodes as a type of resource you could request, although there’s been nothing to hint that they are planning anything like this.
Where do we go next? – I’ve been chewing on whether or not it makes any sense to try to push HPC-style work into Azure, or if it should simply be relegated to the EC2’s of the world… One could conceivably build an implementation of MPI that, rather than relying on the underlying cluster would provide cloud-style/enabled communications between nodes… this could allow those most comfortable with (or with large existing code bases of) MPI-style apps to continue to utilize those libraries/applications, but one has to wonder if, unless the Microsoft pricing (to be announced later this summer) is incredibly cheaper than that of EC2, why would one bother (other than academic interest, of course) to build such? Again, this could be mitigated by Microsoft providing such itself, but the platform would have to provide additional compelling aspects to pull someone away from what would otherwise be a very comfortable transition (local cluster to an EC2-hosted cluster running the same software stacks).
What I’ve really been wondering – is if it is not time to throw MPI out altogether (or, more accurately, the programming paradigm that it represents). Is it time to look for ways to raise the level of abstraction for the computational researcher… and, if so, does something like Azure have a more interesting role? I’m wondering if some of the abstraction tools (workflow engines, queue services, etc.) will begin to have a role or if we need to continue to stretch for every raw bit of horsepower from the system (acquiescing to the fact that abstraction layers cost is in reduced raw system power). For many of the large simulation models it seems that the raw horsepower is indeed necessary. You also cannot simply ignore the vast collection of existing tools and libraries that already exist and target this paradigm. The flip question is that is there a collection of computational research for which, if the cost per cpu hour was low enough, and the increase in development productivity was great enough (assuming that the proposed layers of abstraction resulted in such), would it really matter if the job took an additional 30-50% time to run? This is, of course, only salient if we live in a world wherein I can get however many compute nodes I want whenever I want them (no waiting in queue).
My gut tells me we aren’t quite there yet, but I wonder how far out it really is?
Currently rated 5.0 by 1 people
- Currently 5/5 Stars.
- 1
- 2
- 3
- 4
- 5
I’ve spent a good bit of last week and again this morning working on some samples of sharing scientific data to the cloud and have found myself coming back to the question posed by the title. When I started on this project, I felt that this would be one of the easiest aspects… take a bunch of data, run it through “something” and have REST/JSON/ATOM PUB and possibly SOAP come out the other end. The users would have very clear access to the data and would be able to consume it in whatever tool/platform/means they saw fit. To a large degree, I still would argue that this is the ideal scenario, but as I’m digging into the samples and tools used by those I work with, I’m finding that much of the data is stored in custom (well, custom to me) file formats that support self-description (a good thing) and n-dimensional arrays (good, but hard to express in XML). I’m seeing data stored as NetCDF files, HDF5 files and other formats. Many of these formats have a steep history within their respective fields (i.e. computational climatologists would be used to seeing NetCDF files) but don’t have much visibility outside those arenas. One of the questions I’ve been chewing on is, “is that a bad thing?”… meaning, if the audience is really domain scientists, then providing data in a format consistent with what they are used to is a good thing and doesn’t necessarily diminish the value of the cloud at all. If, on the other hand, the goal is to provide the data in a more general sense, to a possibly more general audience and supporting the unknown client/toolset, I’d argue that a more Internet-friendly format would be preferable. Unfortunately, targeting the later case is not as easy as it might seem… meaning, many of these datasets are large… measured in the 10’s of GB to 10’s of TB… in their currently-optimized binary formats. Converting and exposing those datasets as text (assuming for a moment that a suitable text-based representation of n-dimensional data existed) would bloat the total data size making the consumption of such out of reach for many.
In a meeting I had last week discussing this topic, a colleague suggested that maybe the text-based version of the data set could be exposed as a subset or lower-resolution version of the main set with pointers for the interested to the full set. An alternative might be to provide data services that would, based on a formed query or URL string, return subsets of the full dataset to the caller.
Within the scientific community there exists tools/platforms such as OpeNDAP and Live Access Server(s) but for the most part, these serve science-community-specific data formats (again, possibly a good thing). HDF5 group provides an XML representation of their data structure, which is interesting and could be used within REST/etc. but the NetCDF format doesn’t appear to have a counter-part function (xml versions of the metadata exist, but the data itself doesn’t appear to be available).
My current thinking is as follows:
- For domain scientists who are working in cloud-computing environments, look for ways to get the full datasets exposed in their native formats as close to the compute as possible (i.e. stored in S3 for Amazon customers, stored in Azure for Microsoft customers, etc.)
- For domain scientists who are working in private “clouds” or otherwise closed environments, look for ways to increase their accessibility to the native file formats (Amazon’s Cloud Front and dev-pay could be an interesting option here – CDN gets the data closer to those consuming it and the bandwidth cost for distribution could possibly be passed to the consumer of such).
- For generalists/the unknown client, explore xml-based representations of the common data formats and suitable means of serving that data.
- For generalists/the unknown client, explore ways to easily make lower-res versions of massive datasets available as well as look at means of developing services that provide subsets of the entire set as appropriate.
In all cases, look for ways to make the tooling for consuming/exploring the datasets more “reachable” (the support for *nix platforms is fairly robust whilst the support for Windows-based clients is fairly weak. While one could argue all day about which platform is best for this sort of work, it is currently an indisputable fact that Windows-based clients represent the majority of the computers in the world and that data formats that do not provide support [or only provide poor support] on the Windows platform by definition exclude a large section of potential consumers of such data.).
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5
One of the aspects of our project is to evaluate the “right” way to expose large datasets such that they can be consumed appropriately by other cloud-oriented tools. For our purposes we are considering datasets ranging from a few GB to a few PB – and it is certainly the upper end of this spectrum that causes the most concern. We are targeting the following general use-cases:
- Tools/Compute currently targeted at Amazon’s S3 service.
- Tools/Compute currently targeted at Amazon’s EC2 service (not sure if the S3 interfaces solve both issues).
- Tools/Compute currently targeted at Microsoft’s Azure Storage (blob/table as appropriate)
- The “General” web client (this scenario makes the case for exposing data in a human-readable format and/or standards-based formats such that tools that don’t yet exist can be reasonably expected to be able to consume the data
Because of the sheer quantity of data, it is our expectation that the data will be stored centrally in some semi-generic fashion that will then be exposed in a number of protocol/format specific means. This may be an incorrect assumption, but it is our current plan.
Some general questions that currently exist:
- Some of the APIs listed above support multiple protocols (REST, JSON, SOAP) – should all be implemented or just REST?
- How do we make it easy for various research groups to get their data published?
- Should service-specific (i.e. S3/Azure) interfaces be built, or should general, REST-based (or other) interfaces be built.
- Should dataset-specific REST interfaces be provided? (rather than generic interfaces)
I’ve been poking around at http://data.gov hoping that something interesting will start to appear (as of yet, it is still a place-holder site). I’ve also done some poking around at Microsoft’s Open Government Data Initiative (http://ogdisdk.cloudapp.net/) which looks interesting but there remains to be code posted – maybe their starter kit will appear soon.
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5