Dallas – Data as a Service

Data as a Service

There has been much interest in the various as a service features of the cloud: Infrastructure as a Service (IaaS); Platform as a Service (PaaS); and Software as a Service (SaaS). One of the principal benefits driving much of the interest in the cloud is the demand elasticity exhibited by the various XaaS features. This elasticity is particularly useful in data analytics where the cost associativity whereby one thousand instances for one hour costs the same as one instance for one thousand hours means that large datasets can be processed much quicker than they could previously. The rental feature of cloud computing then makes this a cost effective technique. I think there we are at the beginning of a revolutionary change in our ability to perform data analytics.

The drivers of this revolution are the ability to process large amounts of data and the availability of large datasets through Data as a Service offerings from cloud computing providers. They can negotiate directly with Government and commercial entities for the rights to host and make available the data stored by these entities and in doing so democratize access to that data. Amazon AWS has a Public Data Sets offering in which it exposes non-proprietary and public-domain datasets such as US Census Bureau and human genome data to developers. At PDC 2009, Microsoft announced Project Dallas which exposes both commercial and public-domain datasets as a service in Windows Azure.

Access to the Data

The gateway to Dallas is the Dallas portal. This provides pages to manage Dallas accounts, view the Dallas catalog, view subscribed datasets, and view access reports. Each Dallas account can be associated with one or more Dallas accounts which can be used for billing purposes. Furthermore, data access requests can be associated with arbitrary user identifiers so that a company consuming Dallas data and exposing it to its customers can identify who is using its services.

From a developer perspective, Dallas data is exposed through a REST interface. The host endpoint is: https://api.sqlazureservices.com. The query string for the data is provided with the subscription and is specific to the dataset and data series requested. For example, the following query string

/UnService.svc/FAO/3510?&$page=1&$itemsPerPage=100

specifies the United Nations Food and Agriculture Organization FAOSTAT dataset and specifically the 3510 data series it contains. It requests the first page of data with a page size of 100 items from the page.

The request must be accompanied by two Dallas-specific headers: $accountKey specifying the secret authentication key for the Dallas account and $uniqueUserId which further identifies which user requested the data. Note that these are simple headers and there is no need to implement any of the sophisticated authentication schemes used with Azure Storage. It is trivial, for example, to issue a Dallas request directly from a tool like Fiddler. It is simply a matter of supplying the URL and the two request headers. The following is an example of a Dallas data request issued in Fiddler:

GET /UnService.svc/FAO/3510?$format=atom10&$page=1&$itemsPerPage=100&crID=4&yr=1988 HTTP/1.1
$accountKey: MY_ACCOUNT
$uniqueUserID: 0e53b882-fbc5-4947-afc5-a4b49d6a042e
Host: api.sqlazureservices.com

In this example, the $uniqueUserID is a random Guid with no specific meaning to the Dallas service itself.

Example from Food and Agriculture Organization Data FAOSTAT Dataset

The Food and Agriculture Organization (FAO) has made available in Dallas an FAOSTAT dataset containing information pertaining to food and agriculture. This dataset comprises many series of data with different schemas. As with any dataset the portal can generate a set of classes that simplify access to objects from each of these series. The generated file contains the source code for an item class and a service class for each series in the dataset. The item class is a model class representing an individual entry in a series. The service class exposes a method to retrieve the data subset from Dallas.

The FAO 3510 data series is described as Agricultural production index, 1999-2001=100. The declaration of the item class for the FAO3510 series is:

public partial class FAO3510Item {
    // Constructors
    public FAO3510Item();

    // Properties
    public String CountryOrArea { get; set; }
    public Int16 CountryOrAreaCode { get; set; }
    public Int16 SeriesCode { get; set; }
    public Double Value { get; set; }
    public Int16 Year { get; set; }
}

An FAO3510Item object represents a single record in the 3510 series from the FAOSTAT dataset. SeriesCode specifies the series – in this case 3510. CountryOrArea and CountryOrAreaCode are different representations of the country or area code while Year specifies the year. The actual value of the data is stored in Value.

public partial class FAO3510Service {
    // Constructors
    public FAO3510Service(String accountKey, Guid uniqueUserID);

    // Properties
    public String AccountKey { get; }
    public Int32 CurrentPage { get; set; }
    public Int32 ItemsPerPage { get; set; }
    public Boolean SupportsPaging { get; }
    public Guid UniqueUserID { get; }
    public String Uri { get; }

    // Methods
    private IEnumerable<XElement> InvokeWebService(string url)
    public List<FAO3510Item> Invoke(String crID, String yr);
}

An FAO3510Service object is created by passing an accountKey and a uniqueUserId. The accountKey is one of the authentication keys associated with the Dallas account. The UniqueUserId is an arbitrary user identifier used to identify which user requested the data. SupportsPaging indicates whether or not the series supports paging while CurrentPage and ItemsPerPage specify the requested page (default of 1) and the number of items per page. Note that ItemsPerPage can be no more than 100 (default) during the CTP.

The Invoke() method calls the helper method InvokeWebService() to make the https calls to Dallas to retrieve the requested data from the series. crId filters the request by CountryOrAreaCode (4 is Afghanistan, for example) and yr filters by year.  It then iterates over the retrieved Atom feed containing the data and creates a List of FAO3510Item objects. All the data is retrieved if the parameters are set to null. It is a trivial matter to take this code and modify it to add, for example, more robust error handling, etc. The Azure Platform Training Kit Intro To Dallas hands-on-lab parses the retrieved data series into an XDocument and then runs LINQ queries against it.

The following example creates an FAO3510 object and invokes it with no filters. It then iterates over the results.

private void UseServiceClasses( String accountKey, Guid uniqueUserId )
{
    FAO3510Service service = new FAO3510Service(accountKey, uniqueUserId);

    List<FAO3510Item> results = service.Invoke(null, null);

    foreach (FAO3510Item item in results)
    {
        Int16 seriesCode = item.SeriesCode;
        Int16 countryOrAreaCode = item.CountryOrAreaCode;
        String countryOrArea = item.CountryOrArea;
        Int16 year = item.Year;
        Double value = item.Value;
    }
}

Data Formats

The default data format for data retrieved from Dallas appears to be an Atom feed. Microsoft has indicated that other formats will be available in future. Some datasets may have proprietary data formats and in his PDC talk Zach Owens suggested Microsoft would support these to help those already capable of handling these formats.

One issue with the Atom feed format is that it is very heavy. For example, a single Atom feed entry in the FAO3510 data series consumes almost 350 bytes even as the FAO3510Item representing it comprises three Int16 objects, one Double and a String. I think some more efficient data representation will be needed for very large datasets.

About Neil Mackenzie

Cloud Solutions Architect. Microsoft
This entry was posted in Windows Azure. Bookmark the permalink.

One Response to Dallas – Data as a Service

  1. Jamie says:

    Thanks for this Neil, great post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s