The following represents data and results gathered from the second research institution connection cloud transfer test and compares results from Azure’s US North Central data center and Azure’s US South Central data center. The methodology applied during this test is detailed here and should be reviewed prior to considering the results or commentary below.
Test Overview:
- 05561 Cloud Transfer Tests: Research Institution Test 02
- Local Connection: Research Networks
- Started: February 9, 2010
- Finished: February 16, 2010
- Origination Point: Oak Ridge, TN
Disclaimer:
- Standard Disclaimer Applies
Test Objectives:
- Standard objectives apply
- Specific to this test: Test a research institution connection as the researcher’s “workstation” and gather data aimed at building a realistic expectation of performance
Test Setup
- Included File Sizes:
- 2KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, 1GB
- Network Connectivity - “research institution”
- Consists of a computer connected to a local network router via 100Mbps hard-wire.
- Multiple switches/routers/firewalls may exist between workstation and the public internet
- There may exist multiple high-speed networks that may be leveraged for connectivity to remote datacenters (ESNet, I2, NLR
- Reasonable effort has been made to ensure that no other applications or TSRs are running on the source computer for the duration of the test.
- For this test, a newly-installed Windows 7 Professional installation was used, fully patched, with no other applications (beyond the test harness) installed.
Test Execution:
- Standard execution approach applied with the exception of the fact that Azure was tested for both cases – simply different datacenters (see slides for details)
Report Generation
- Standard report generation approach applied
Conventions:
- Standard conventions apply
Resources:
- Standard resources apply - no test-specific customizations beyond adaptations for the specific file sizes included in the test
Results:
Similar to other tests, there is some variability displayed that is obviously a result of traffic issues. We are continuing to look into this.
In general, the data from the Azure US North Central data center proved better than that of US South Central which is not altogether surprising as we are physically closer to the USNC location.
Slides 171 and 172 remain disturbing as the download values for the 750MB file size continue to be outside of what would be expected.
Slide 172 in particular is of interest as it draws attention to some wide variability across file sizes for the USSC datacenter (not just the 750MB size).
Full results are available in slide form here:
PDF of results are available here: http://sciencecloud.us/media/05561_Xfer-Research_02.pdf
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5
The following represents data and results gathered from the first research institution connection cloud transfer test. The methodology applied during this test is detailed here and should be reviewed prior to considering the results or commentary below.
Test Overview:
- 05561 Cloud Transfer Tests: Research Institution Test 01
- Local Connection: Research Networks
- Started: February 8, 2010
- Finished: February 16, 2010
- Origination Point: Oak Ridge, TN
Disclaimer:
- Standard Disclaimer Applies
Test Objectives:
- Standard objectives apply
- Specific to this test: Test a research institution connection as the researcher’s “workstation” and gather data aimed at building a realistic expectation of performance
Test Setup
- Included File Sizes:
- 2KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, 1GB
- Network Connectivity - “research institution”
- Consists of a computer connected to a local network router via 100Mbps hard-wire.
- Multiple switches/routers/firewalls may exist between workstation and the public internet
- There may exist multiple high-speed networks that may be leveraged for connectivity to remote datacenters (ESNet, I2, NLR
- Reasonable effort has been made to ensure that no other applications or TSRs are running on the source computer for the duration of the test.
- For this test, a newly-installed Windows 7 Professional installation was used, fully patched, with no other applications (beyond the test harness) installed.
Test Execution:
- Standard execution approach applied
Report Generation
- Standard report generation approach applied
Conventions:
- Standard conventions apply
Resources:
- Standard resources apply - no test-specific customizations beyond adaptations for the specific file sizes included in the test
Results:
Across both services there exists an interesting amount of variability that is likely due to intermediate traffic or traffic management issues. Even within the same test run (see various scatter plots) you can detect “walls” of change wherein a the values will be hovering around a certain value and subsequently they hover around a much higher/lower value (ex. slide 133, 134).
There is not a consistent “winner” in this report. for various file sizes one platform would clearly outperform the other only to have the tables completely reversed for the next file size. This hints at network routing issues. A brief conversation with some of our local networking team indicates that some traffic (in particular Amazon’s) appeared to generally leave via the router connected to ESNet whereas most of the Microsoft traffic would leave via the router connected to Southern Crossing with subsequent connections to I2 and NLR. It may well be that the insertion of some static routs may help address some of the stability issues here.
Of particular interest is the “hump” seen by both services in slide 170. This has been seen in a similar location on the chart in other runs (see slide #82 here: http://www.slideshare.net/rgillen/cloud-storage-upload-tests-02). We don’t yet have a good explanation for this shape in the curve and are hoping to track that down soon.
Further, the shape of the Azure curve in slide 171 is inconsistent with other tests – specifically the data points for the 750MB size. We will continue to compare with other sets/runs to see if this continues or was simply transient.
What remains consistent across all tests so far is that the level of variability tends to be greater with the S3 platform as compared to the Azure Blob storage.
Full results are available in slide form here:
PDF of results are available here: http://sciencecloud.us/media/05561_Xfer-Research_01.pdf
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5
The following represents data and results gathered from the first consumer connection cloud transfer test. The methodology applied during this test is detailed here and should be reviewed prior to considering the results or commentary below.
Test Overview:
- 05561 Cloud Transfer Tests: Consumer Connection Test 01
- Local Connection: Comcast Residential
- Started: February 9, 2010
- Finished: February 14, 2010
- Origination Point: Knoxville, TN
Disclaimer:
- Standard Disclaimer Applies
Test Objectives:
- Standard objectives apply
- Specific to this test: Test a consumer/commodity connection as the researcher’s “workstation” and gather data aimed at building a realistic expectation of performance
Test Setup
- Included File Sizes:
- 2KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB
- Network Connectivity - “typical home network”
- Consists of a computer connected to a local router via 1GE hard-wire.
- Router is then directly connected to service provider’s modem
- Consumer has a “general” plan for internet connectivity
- Reasonable effort has been made to ensure that no other applications or TSRs are running on the source computer for the duration of the test.
- For this test, a newly-installed Windows 7 Professional installation was used, fully patched, with no other applications (beyond the test harness) installed.
Test Execution:
- Standard execution approach applied
Report Generation
- Standard report generation approach applied
Conventions:
- Standard conventions apply
Resources:
- Standard resources apply - no test-specific customizations beyond adaptations for the specific file sizes included in the test
Results:
In contrast to some other test runs on other networks, in this test Azure seemed to generally (if barely) out-perform the Amazon platform and, consistent with other tests, Amazon’s interaction with Amazon’s platform shows greater variability across a given file size.
The test was limited to file sizes up to and including 100MB so as to avoid being flagged by the residential ISP for poor traffic habits (an issue to be addressed for large-bandwidth users on consumer connections).
Full results are available in slide form here:
PDF of results are available here: http://sciencecloud.us/media/05561_Xfer-Consumer_01.pdf
Currently rated 5.0 by 1 people
- Currently 5/5 Stars.
- 1
- 2
- 3
- 4
- 5
The following describes the methodology applied to some of the data transfer tests we are performing for various cloud storage platforms. In each case, the following approach should be assumed with the exception of test-specific details which will be posted with each result set.
Disclaimer:
- The research team understands that any time the public internet is introduced into a test a number of non-controllable factors are introduced. It is the intent of this project to test various scenarios often enough and with enough variance to obtain a reasonable average and thereby allow the team to make general assumptions about the quality of service (given the constraints stated) that one can reasonably expect to encounter when utilizing a given service.
- It is similarly understood that there may exist environmental factors (i.e. routing paths, proxy servers, firewalls) that affect the transfer rates being tested. In general, it is believed that these factors should affect all tested platforms equally. However, in the case of various research institutes where specialty networks (i.e. ESNet, NLR, I2) exist, there may be routing configurations that particularly favor one service or endpoint over another. It is an objective of these tests to expose these anomalies with the goal of addressing them as appropriate.
- Baseline: For the various services tested, these initial tests were performed using no particular optimization techniques. We took the respective vendor’s shipping SDK, integrated it into a very similar wrapper (source code available for verification) and executed it. Subsequent work should focus on optimizations in the SDKs, or the methods in which the libraries are utilized, etc.
- Not A Stand-Alone Work: This data should not be considered in isolation. Rather, it is a portion of a larger data set (some of which may remain to be published) and should be interpreted for what it is – a portion of a larger collection that aims to provide a more complete view of the entire problem domain.
Test Objectives:
- General: Generate data to set expectations for users of various cloud services focusing on a scenario of local compute combined with cloud-hosted data (blob storage). Note: the reverse scenario as well as cloud-hosted compute/cloud-hosted data will be tested separately
- These tests and data are crucial to our overall objective of improving the experience of researchers interacting with cloud computing assets as they provide a baseline against which any optimizations or alterations may be compared.
Test Setup:
- Test Setup
- A collection of random-data files were generated (RandomFileGenerator.exe). For each of the following file sizes, 50 files were generated and stored on standard disks local to the test computer: Range is specific to each test set.
- Network Connectivity: specific to each test set
Test Execution:
- For each file size, AWS_Console_App1.exe was called to upload the files to Amazon’s US Standard Region and record the duration
.\amazon\aws_console_app1.exe .\data\2KB
- For each file size, DownloadFiles.exe was called to download the files just uploaded to Amazon’s US Standard Region and record the duration
.\downloader\DownloadFiles.exe -i .\amazon_2KB.csv -p 6 -m yes
- For each file size, AzureTesting.exe was called to upload the files to Azure’s US North Central region and record the duration
.\azure\azuretesting.exe .\data\2KB
- For each file size, DownloadFiles.exe was called to download the files just uploaded to Azure’s North Central region and record the duration
.\downloader\DownloadFiles.exe -i .\azure_2KB.csv -p 6 -m yes
- NOTE: immediately following each operation for each file size, the resulting file (log.csv) was renamed to represent the source, transfer direction, and file size
ren log.csv azure_ussc_upload_2KB.csv
Report Generation:
- For each service tested, and each file size tested
- For both Uploads and Downloads (separately)
- Scatter plot is generated showing the distribution for the transfer duration (seconds)
- Scatter plot is generated showing the distribution for the transfer rate (Mb/s)
- Transfer duration average (seconds) is calculated
- Transfer duration standard deviation (seconds) is calculated
- Transfer rate average (Mb/s) is calculated
- Transfer rate standard deviation (Mb/s) is calculated
- For each file size tested
- For both Uploads and Downloads (separately)
- A comparison chart (column) is generated showing the average transfer duration (seconds) and error bars indicating one standard deviation (seconds). Also plotted is a dot indicating the associated average transfer rate on the secondary Y axis (Mb/s)
- Summary Charts
- For both Uploads and Downloads (separately)
- A range chart is generated showing the band covered by one standard deviation (per service tested) for the transfer duration (seconds) across the tested file sizes
- A range chart is generated showing the band covered by one standard deviation (per service tested) for the transfer rate (Mb/s) across the tested file sizes
- Presentation
- Once the above charts have been generated, they are assembled into a PowerPoint file
- Once the power point file has been generated and saved, it is published as a PDF file
- Automation
- All of the above steps are automated via a script (ProcessTransferLogs.ps1)
Conventions:
- Naming Conventions
- Amazon_USSTD: Amazon’s US Standard region was specified when the bucket was created
- Azure_USNC: Azure’s US North Central region was selected when the storage account was created
- Error Handling
- In most runs, errors were displayed to the screen but not captured to logs.
- Existence of errors (all of which were network-related) are manifested in the logs as collections of data points less than 50 (the test source size)
- Due to the fact that the respective download tests are based on the upload source files, a download file containing less than 50 entries is not necessarily indicative of errors but may simply be tied to the fact that the input file had less than 50 entries. This being said, there were more errors on downloads than uploads.
Resources:
Results: Specific to each test set
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5
I’ve spent a good bit of last week and again this morning working on some samples of sharing scientific data to the cloud and have found myself coming back to the question posed by the title. When I started on this project, I felt that this would be one of the easiest aspects… take a bunch of data, run it through “something” and have REST/JSON/ATOM PUB and possibly SOAP come out the other end. The users would have very clear access to the data and would be able to consume it in whatever tool/platform/means they saw fit. To a large degree, I still would argue that this is the ideal scenario, but as I’m digging into the samples and tools used by those I work with, I’m finding that much of the data is stored in custom (well, custom to me) file formats that support self-description (a good thing) and n-dimensional arrays (good, but hard to express in XML). I’m seeing data stored as NetCDF files, HDF5 files and other formats. Many of these formats have a steep history within their respective fields (i.e. computational climatologists would be used to seeing NetCDF files) but don’t have much visibility outside those arenas. One of the questions I’ve been chewing on is, “is that a bad thing?”… meaning, if the audience is really domain scientists, then providing data in a format consistent with what they are used to is a good thing and doesn’t necessarily diminish the value of the cloud at all. If, on the other hand, the goal is to provide the data in a more general sense, to a possibly more general audience and supporting the unknown client/toolset, I’d argue that a more Internet-friendly format would be preferable. Unfortunately, targeting the later case is not as easy as it might seem… meaning, many of these datasets are large… measured in the 10’s of GB to 10’s of TB… in their currently-optimized binary formats. Converting and exposing those datasets as text (assuming for a moment that a suitable text-based representation of n-dimensional data existed) would bloat the total data size making the consumption of such out of reach for many.
In a meeting I had last week discussing this topic, a colleague suggested that maybe the text-based version of the data set could be exposed as a subset or lower-resolution version of the main set with pointers for the interested to the full set. An alternative might be to provide data services that would, based on a formed query or URL string, return subsets of the full dataset to the caller.
Within the scientific community there exists tools/platforms such as OpeNDAP and Live Access Server(s) but for the most part, these serve science-community-specific data formats (again, possibly a good thing). HDF5 group provides an XML representation of their data structure, which is interesting and could be used within REST/etc. but the NetCDF format doesn’t appear to have a counter-part function (xml versions of the metadata exist, but the data itself doesn’t appear to be available).
My current thinking is as follows:
- For domain scientists who are working in cloud-computing environments, look for ways to get the full datasets exposed in their native formats as close to the compute as possible (i.e. stored in S3 for Amazon customers, stored in Azure for Microsoft customers, etc.)
- For domain scientists who are working in private “clouds” or otherwise closed environments, look for ways to increase their accessibility to the native file formats (Amazon’s Cloud Front and dev-pay could be an interesting option here – CDN gets the data closer to those consuming it and the bandwidth cost for distribution could possibly be passed to the consumer of such).
- For generalists/the unknown client, explore xml-based representations of the common data formats and suitable means of serving that data.
- For generalists/the unknown client, explore ways to easily make lower-res versions of massive datasets available as well as look at means of developing services that provide subsets of the entire set as appropriate.
In all cases, look for ways to make the tooling for consuming/exploring the datasets more “reachable” (the support for *nix platforms is fairly robust whilst the support for Windows-based clients is fairly weak. While one could argue all day about which platform is best for this sort of work, it is currently an indisputable fact that Windows-based clients represent the majority of the computers in the world and that data formats that do not provide support [or only provide poor support] on the Windows platform by definition exclude a large section of potential consumers of such data.).
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5
One of the aspects of our project is to evaluate the “right” way to expose large datasets such that they can be consumed appropriately by other cloud-oriented tools. For our purposes we are considering datasets ranging from a few GB to a few PB – and it is certainly the upper end of this spectrum that causes the most concern. We are targeting the following general use-cases:
- Tools/Compute currently targeted at Amazon’s S3 service.
- Tools/Compute currently targeted at Amazon’s EC2 service (not sure if the S3 interfaces solve both issues).
- Tools/Compute currently targeted at Microsoft’s Azure Storage (blob/table as appropriate)
- The “General” web client (this scenario makes the case for exposing data in a human-readable format and/or standards-based formats such that tools that don’t yet exist can be reasonably expected to be able to consume the data
Because of the sheer quantity of data, it is our expectation that the data will be stored centrally in some semi-generic fashion that will then be exposed in a number of protocol/format specific means. This may be an incorrect assumption, but it is our current plan.
Some general questions that currently exist:
- Some of the APIs listed above support multiple protocols (REST, JSON, SOAP) – should all be implemented or just REST?
- How do we make it easy for various research groups to get their data published?
- Should service-specific (i.e. S3/Azure) interfaces be built, or should general, REST-based (or other) interfaces be built.
- Should dataset-specific REST interfaces be provided? (rather than generic interfaces)
I’ve been poking around at http://data.gov hoping that something interesting will start to appear (as of yet, it is still a place-holder site). I’ve also done some poking around at Microsoft’s Open Government Data Initiative (http://ogdisdk.cloudapp.net/) which looks interesting but there remains to be code posted – maybe their starter kit will appear soon.
Be the first to rate this post
- Currently 0/5 Stars.
- 1
- 2
- 3
- 4
- 5