DataLakeBlog

I had the need to process some data from NOAA in Data Lake Analytics. The data is related to historical weather and I discovered that the data format is old-school - fixed field width and no delimiter style. As I have now realized, the built-in data extractors are limited to the following:

  • CSV - comma separated values
  • TSV - tab separated values
  • Text - general purpose delimited text

Unfortunately, the format of the data file did not match any of these - checkout the table below which describes the format of each row in the data file:

VariableColumnsType
ID1-11Character
LATITUDE13-20Real
LONGITUDE22-30Real
ELEVATION32-37Real
STATE39-40Character
NAME42-71Character
GSNFLAG73-75Character
HCNFLAG77-79Character
WMOID81-85Character
METHOD*87-99Character

A quick search for a fixed-column extractor turned up a blog article from Bryan C Smith that creates an extractor for standard fixed-width columns, but I would have had to think (shudder) and change the range of 1-11 characters (inclusive) and convert them into a length that included a space between fields, etc. As I wanted to limit the likelihood of introducing an error, I decided to create another extractor that could handle the column definitions that are provided. In a nutshell I wanted to be able to specify U-SQL similar to:

@allStations =
    EXTRACT 
        id string,
        latitude double,
        longitude double,
        elevation double,
        state string,
        name string
    FROM 
        @AllStationsInputPath
    USING new StartEndColumnExtractor(
          new List<ColumnDefinition>{
              new ColumnDefinition{ Name = "id", Start = 1, End = 11 },
              new ColumnDefinition{ Name = "latitude", Start = 13, End = 20 },
              new ColumnDefinition{ Name = "longitude", Start = 22, End = 30 },
              new ColumnDefinition{ Name = "elevation", Start = 32, End = 37 },
              new ColumnDefinition{ Name = "state", Start = 39, End = 40 },
              new ColumnDefinition{ Name = "name", Start = 42, End = 71 },
          },
          rowDelimiter : "\n" //unix line ending
    );
  

So, in a similar process to that documented by Bryan C Smith:

  1. I installed the Azure Data Lake Tools for Visual Studio.

  2. I created a Class Library (For U-SQL Application).

  3. I added a class that would represent a column definition:

    public class ColumnDefinition
    {
        public string Name { get; set; }
        public int Start { get; set; }
        public int End { get; set; }
    }
          
  4. I added a class that would implement the IExtractor interface:

    //
    // StartEndColumnExtractor
    // Inspired by an article by Bryan C Smith
    // https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data-lake-analytics/'
    //
    
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Text;
    using Microsoft.Analytics.Interfaces;
    
    namespace CustomMayd.DataLake.Extractors
    {
        [SqlUserDefinedExtractor]
        public class StartEndColumnExtractor : IExtractor
        {
            private readonly List<ColumnDefinition> _columnDefinitions;
            private readonly Encoding _encoding;
            private readonly byte[] _rowDelimiter;
            private readonly bool _startsAtZeroPosition;
    
            public StartEndColumnExtractor(List<ColumnDefinition> columnDefinitions, bool startsAtZeroPosition = false,
                Encoding encoding = null, string rowDelimiter = "\r\n")
            {
                _columnDefinitions = columnDefinitions;
                _startsAtZeroPosition = startsAtZeroPosition;
                _encoding = encoding ?? Encoding.UTF8;
                _rowDelimiter = _encoding.GetBytes(rowDelimiter);
            }
    
            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            {
                foreach (var currentLine in input.Split(_rowDelimiter))
                    using (var lineReader = new StreamReader(currentLine, _encoding))
                    {
                        var line = lineReader.ReadToEnd();
    
                        for (var index = 0; index < _columnDefinitions.Count; index++)
                        {
                            var columnDefinition = _columnDefinitions[index];
                            var startPosition = columnDefinition.Start - (_startsAtZeroPosition ? 0 : 1);
                            var columnWidth = columnDefinition.End - (columnDefinition.Start - 1);
                            var value = line.Substring(startPosition, columnWidth);
                            switch (output.Schema[index].Type.Name)
                            {
                                case "String":
                                    output.Set(index, value.Trim());
                                    break;
                                case "Int32":
                                    output.Set(index, int.Parse(value));
                                    break;
                                case "Double":
                                    output.Set(index, double.Parse(value));
                                    break;
                                case "Float":
                                    output.Set(index, float.Parse(value));
                                    break;
                                case "DateTime":
                                    output.Set(index, DateTime.Parse(value));
                                    break;
                                default:
                                    throw new Exception($"Unknown data type specified: {output.Schema[index].Type.Name}");
                            }
                        }
    
                        yield return output.AsReadOnly();
                    }
            }
        }
    }
          
  5. I compiled the class library and noted the location of the output DLL - (I compiled it in Release mode).

  6. I prefer to have a dedicated U-SQL database where I register my assemblies rather than using the Master database, so I created a DB in U-SQL:

    CREATE DATABASE IF NOT EXISTS WeatherDataDb;
          
  7. To register the assemblies, I use the Data Lake Analytics Explorer view in Visual Studio to connect to my Azure Subscription and:

    • Expand the Data Lake Analytics node
    • Expand the node for my Azure subscription (or the (Local) node if you want to deploy locally)
    • Expand the U-SQL Databases node
    • Right-click on the Assemblies node and select Register Assembly from the context menu
    • Click the ... button to the right of the Choose assembly path field to display the Load Assembly dialog.
    • Ensure Local is selected and click ... to display the Open dialog.
    • Browse to and select the DLL you compiled earlier, then close the Open dialog.
    • On the Load Assembly dialog you should see the Local Path: field is populated - click OK.
    • On the Assembly Registration dialog, you should see the Load assembly from path and the Assembly Name fields are populate - click Submit.

    A job will be submitted to Data Lake that will register the assembly. The Output pane should tell you the job was successful.

  8. Now that the assembly has been registered, the new extractor can be used in a new U-SQL job:

    REFERENCE ASSEMBLY WeatherDataDb.[CustomMayd.DataLake.Extractors];
    
    USING CustomMayd.DataLake.Extractors;
    
    DECLARE @AllStationsInputPath string = "/noaaData/allstations.txt";
    
    DECLARE @OutputFile string = "/output/noaaStations.csv";
    
     @allStations =
        EXTRACT
            id string,
            latitude double,
            longitude double,
            elevation double,
            state string,
            name string
        FROM
            @AllStationsInputPath
        USING new StartEndColumnExtractor(
              new List<ColumnDefinition>{
                  new ColumnDefinition{ Name = "id", Start = 1, End = 11 },
                  new ColumnDefinition{ Name = "latitude", Start = 13, End = 20 },
                  new ColumnDefinition{ Name = "longitude", Start = 22, End = 30 },
                  new ColumnDefinition{ Name = "elevation", Start = 32, End = 37 },
                  new ColumnDefinition{ Name = "state", Start = 39, End = 40 },
                  new ColumnDefinition{ Name = "name", Start = 42, End = 71 },
              },
              rowDelimiter : "\n" //unix line ending
        );
    
    
    OUTPUT @allStations
    TO @OutputFile
    USING Outputters.Csv(outputHeader : true);
          

Submitting and running that jobs reads:

AQC00914000 -14.3167 -170.7667  408.4 AS AASUFOU                                     
AQW00061705 -14.3306 -170.7136    3.7 AS PAGO PAGO WSO AP               GSN     91765
CAW00064757  44.2325  -79.7811  246.0 ON EGBERT 1 W                                  
CQC00914080  15.2136  145.7497  252.1 MP CAPITOL HILL 1      

And creates:

"id","latitude","longitude","elevation","state","name"
"AQC00914000",-14.3167,-170.7667,408.4,"AS","AASUFOU"
"AQW00061705",-14.3306,-170.7136,3.7,"AS","PAGO PAGO WSO AP"
"CAW00064757",44.2325,-79.7811,246,"ON","EGBERT 1 W"
"CQC00914080",15.2136,145.7497,252.1,"MP","CAPITOL HILL 1"

Of course, once you have extracted the content you can perform all the usual DLA magic - I'm just demonstrating the simplest case here of exporting to a CSV.

Thanks to Bryan for writing the article that sent me in the correct direction - I have added other useful resources below.

Resources

Win2D and Composition.md

The Windows 10 April 2018 Update includes a new set of APIs in the Composition namespace that support various geometries such as lines, ellipses, rectangles and paths. However, during the insider cycle, it wasn’t possible to create arbitrary paths as there was a reliance on an implementation of IGeometrySource2D that could not be found. Fast forward until today when a new version of the Win2D package has been released – V1.22.0 and now we can make use of CanvasGeometry class to create our paths.

Here is quick walk-through on getting started with the Geometry APIs.

  1. Create a blank UWP app.

  2. Add the Win2D.UWP nuget package.

  3. Update the MainPage.xaml so that the Grid has the name “RootGrid”:

    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}" x:Name="RootGrid"> </Grid>
  4. Switch to the MainPage.xaml.cs code behind and update the constructor to:

    public MainPage() { InitializeComponent(); Loaded += OnLoaded; }
  5. Add the OnLoaded method:

    private void OnLoaded(object sender, RoutedEventArgs routedEventArgs) { var c = Window.Current.Compositor; // Need this so we can add multiple shapes to a sprite var shapeContainer = c.CreateContainerShape(); // Rounded Rectangle - just the rounded rect properties var roundedRectangle = c.CreateRoundedRectangleGeometry(); roundedRectangle.CornerRadius = new Vector2(20); roundedRectangle.Size = new Vector2(400, 300); // Need to create a sprite shape from the rounded rect var roundedRectSpriteShape = c.CreateSpriteShape(roundedRectangle); roundedRectSpriteShape.FillBrush = c.CreateColorBrush(Colors.Red); roundedRectSpriteShape.StrokeBrush = c.CreateColorBrush(Colors.Green); roundedRectSpriteShape.StrokeThickness = 5; roundedRectSpriteShape.Offset = new Vector2(20); // Now we must add that share to the container shapeContainer.Shapes.Add(roundedRectSpriteShape); // Let's create another shape var roundedRectSpriteShape2 = c.CreateSpriteShape(roundedRectangle); roundedRectSpriteShape2.FillBrush = c.CreateColorBrush(Colors.Purple); roundedRectSpriteShape2.StrokeBrush = c.CreateColorBrush(Colors.Yellow); roundedRectSpriteShape2.StrokeThickness = 3; roundedRectSpriteShape2.Offset = new Vector2(90); roundedRectSpriteShape2.CenterPoint = new Vector2(200, 150); roundedRectSpriteShape2.RotationAngleInDegrees = 45; // Add it to the container - as it is added after the previous shape, it will appear on top shapeContainer.Shapes.Add(roundedRectSpriteShape2); // Create paths and animate them SetupPathAndAnimation(c, shapeContainer); // Now we need to create a ShapeVisual and add the ShapeContainer to it. var shapeVisual = c.CreateShapeVisual(); shapeVisual.Shapes.Add(shapeContainer); shapeVisual.Size = new Vector2(1000, 1000); // Display the shapeVisual ElementCompositionPreview.SetElementChildVisual(RootGrid, shapeVisual); }
  6. Add the empty (for now) SetupPathAndAnimation:

    private static void SetupPathAndAnimation(Compositor c, CompositionContainerShape shapeContainer) { // Empty for now! }
  7. Compile and run this - you will see the following. I believe this code is pretty self-explanatory.

  8. image

  9. Now we will add a path and animate it. Add the following classes and enum to the project (these are just some helpers I created):

    using System.Collections.Generic; using System.Linq; using System.Numerics; using Microsoft.Graphics.Canvas.Geometry; namespace YourNamespace { public static class PathBuilderExtensions { public static CanvasPathBuilder BuildPathWithLines( this CanvasPathBuilder builder, IEnumerable<Vector2> vectors, CanvasFigureLoop canvasFigureLoop) { var first = true; foreach (var vector2 in vectors) { if (first) { builder.BeginFigure(vector2); first = false; } else { builder.AddLine(vector2); } } builder.EndFigure(canvasFigureLoop); return builder; } public static CanvasPathBuilder BuildPathWithLines( this CanvasPathBuilder builder, IEnumerable<(float x, float y)> nodes, CanvasFigureLoop canvasFigureLoop) { var vectors = nodes.Select(n => new Vector2(n.x, n.y)); return BuildPathWithLines(builder, vectors, canvasFigureLoop); } } public class PathNode { private Vector2 _vector2; public PathNode(Vector2 vector2) { _vector2 = vector2; } } public enum NodeType { Line, Arc, CubicBezier, Geometry, QuadraticBezier } }
  10. Replace the SetupPathAndAnimation() method with:

    private static void SetupPathAndAnimation(Compositor c, CompositionContainerShape shapeContainer) { var startPathBuilder = new CanvasPathBuilder(new CanvasDevice()); // Use my helper to create a W shaped path startPathBuilder.BuildPathWithLines(new(float x, float y)[] { (10, 10), (30, 80), (50, 30), (70, 80), (90, 10) }, CanvasFigureLoop.Open); // Add another path startPathBuilder.BuildPathWithLines(new(float x, float y)[] { (105, 30), (105, 80) }, CanvasFigureLoop.Open); // Create geometry and path that represents the start position of an animation var startGeometry = CanvasGeometry.CreatePath(startPathBuilder); var startPath = new CompositionPath(startGeometry); // Now create the end state paths var endPathBuilder = new CanvasPathBuilder(new CanvasDevice()); endPathBuilder.BuildPathWithLines(new(float x, float y)[] { (10, 10), (30, 10), (50, 10), (70, 10), (90, 10) }, CanvasFigureLoop.Open); endPathBuilder.BuildPathWithLines(new(float x, float y)[] { (105, 30), (105, 80) }, CanvasFigureLoop.Open); var endGeometry = CanvasGeometry.CreatePath(endPathBuilder); var endPath = new CompositionPath(endGeometry); // Create a CompositionPathGeometery from the Win2D GeometeryPath var pathGeometry = c.CreatePathGeometry(startPath); // Create a CompositionSpriteShape from the path var pathShape = c.CreateSpriteShape(pathGeometry); pathShape.StrokeBrush = c.CreateColorBrush(Colors.Purple); pathShape.StrokeThickness = 5; pathShape.Offset = new Vector2(20); // Add the pathShape to the ShapeContainer that we used elsewhere // This will ensure it is rendered shapeContainer.Shapes.Add(pathShape); // Create an animation using the start and endpaths var animation = c.CreatePathKeyFrameAnimation(); animation.Target = "Geometry.Path"; animation.Duration = TimeSpan.FromSeconds(1); animation.InsertKeyFrame(0, startPath); animation.InsertKeyFrame(1, endPath); animation.IterationBehavior = AnimationIterationBehavior.Forever; animation.Direction = AnimationDirection.AlternateReverse; pathGeometry.StartAnimation(nameof(pathGeometry.Path), animation); }
  11. Compile and run the code - you will notice that two paths are now being rendered and the W is animating.

  12. image

This is obviously a very simple example of using the new Composition Geometery APIs, but should be enough to get you started.