niedziela, 17 września 2017

Runtime generated objects serialization

If you think in a generic way about all well implemented RESTful APIs you will find a pattern that can be easily described and stored in metadata. Most RESTful API is just a combination of the following elements:

  • Resource location (URL)
  • HTTP method
  • Header information
  • Input parameters (required and optional)
  • Content type
  • Output parameters
  • Business logic description

Today I would like to discuss an interesting problem that I came across recently. Imagine for a second that you need to implement a RESTful API client which uses a combination of metadata which describes API and a user input in order to make a HTTP calls. 

In such a scenario you will quickly realize that for a subset of API calls you will need to develop a custom classes in order to have  them later serialized (to JSON or XML) in runtime so that you can send it via POST or PUT requests. This rises a question. Do I really need to implement N - 1 classes that represent all types that some API(s) expect(s) as a part of HTTP request body?

Well, maybe. In my case I decided to use a more generic approach and leverage a simplicity of standard data formats like JSON. The 'hack' is very simple. From a serialization point of view any class is just a container for properties of specific type - methods and interfaces we can skip as its have nothing to do with serialization. Lets than simplify a class generic description. A class is a collection or key-value pairs. Sounds similar isn't it? Maybe it sounds like JSON format description? This is correct...milestone achieved. 

We already know that we have a key-value pair type of collection. How to describe it from a data structure perspective? Very simple! Actually so simple that we have more than one option available.

We can use the followings:
  • Dictionary<string,object>
  • dynamic type
  • Anonymous types and var 
Other:
  • Approach with List<Tuple<string,object>> does not work! A result JSON has a structure with a property names like Item1, Item2....expected.
[{"Item1":"Name","Item2":"Damian"},{"Item1":"Surname","Item2":"Damian"},{"Item1":"Age","Item2":12},{"Item1":"Books","Item2":["Book 1","Book 2"]}]
  • If you know any other method (even crazy and geeky version) please let me know. Just don't send a one with reflection...
OK. It's time to serialize out classes.

1. By using Dictionary<string, object>

      // Dictionary approach
      Dictionary<stringObject> userDict = new Dictionary<stringobject>();
      userDict.Add("Name""Damian");
      userDict.Add("Surname""Zapart");
      userDict.Add("Age", 12);
      userDict.Add("Books"new List<String> { "Book 1""Book 2" });
 
      JsonConvert.SerializeObject(userDict);

2. By using a dynamic type

       // Dynamic type approach
       dynamic user = new
       {
           Name = "Damian",
           Surname = "Zapart",
           Age = 12,
           Books = new List<String> { "Book 1""Book 2" },
       };
 
       JsonConvert.SerializeObject(user);




3. By using anonymous types

          // Anonymous approach
          var userAnonymous = new
          {
              Name = "Damian",
              Surname = "Zapart",
              Age = 12,
              Books = new List<String> { "Book 1""Book 2" }
          };

And we done! Was a pleasure. Next time will focus on a performance of each solution.

Result JSON.

wtorek, 27 września 2016

Deep dive in unit testing

These days each product reaching market is labelled as top quality - no matter if it`s a toy or a car or an application. Everyone talk about quality, quality is everywhere and at the same time quality by nature is a tricky thing to define and measure. To give you an example, imagine a two new brand cars from a two different car manufacturers like for example BMW and Fiat. Dealers of both brands will tell you that their cars are top quality and in fact that is true! The problem starts when you try to understand what top quality means for both car manufacturers - what are their standards of quality. What Fiat can consider as top quality might be completely not acceptable for BMW. From client perspective what really important is to understand how to measure quality in a standards driven way. As an example let's compare a European car safety performance assessment rating for both brands (NCAP is rated from 1 to 5 stars where 5 stars is given for most safety cars). In this rating cars manufactured by Fiat sometimes get 5 stars but usually it`s below 5. At the same time BMW gets must get 5 otherwise car won’t be released. Please remember that both car manufacturers claims their car to be top quality (!!) and without having a standard in place (like NCAP) from client perspective it`s difficult to understand quality standards baseline for each brand.
Having a common standards is also very important in the software development world as the same quality rules applies to all software products. All applications or systems might be considered as top quality as without common standards it`s just matter of defining what the quality means. For example having only one production outage per week might be considered as top quality by some teams when other teams do not tolerate any outage at all.
"A unit test is an automated piece of code that invokes a unit of work in the system and then checks a single assumption about the behaviour of that unit of work" (Unit Testing Lessons in Ruby, Java and .NET - The Art Of Unit Testing - Definition of a Unit Test). Above definition describes unit test to be a piece of code calling other small (atomic) piece of product code under test. For people who are new in development process this might sound very bizarre as a benefit of having unit tests in a project is not clearly defined. However unit tests are one of the most useful and powerfully tool in the developer hands. The advantages of having unit tests implemented are:
  • Cut dependencies - the most powerful feature of unit tests is that a piece of code can be executed in isolation (without dependencies fully in place) by using mock(s). Mock objects are simulated objects that mimic the behaviour of real objects in controlled ways. Therefore a product code which for example makes call to external API or database can be easily tested by 'mocking dependencies' (without a need of having other components setup) in such case dependencies still exists but it replaced by mock object.
  • Execution time - unit tests are fast, really fast. Short time of a test execution is achieved by cutting dependencies and isolating code which allows to execute entire test in memory - without any network traffic.
  • Fail fast - by the nature of defects, it`s more expensive and time consuming to fix potential bug in Production or UAT rather than on local development environment. Unit tests help in detection of defects at very early stage of development process. As per fast execution time and mocks each developer can run unit tests on local environment. However this manual activity (it`s good habit) can be easily automated as all unit tests can be executed as a part of gated check-in build in TFS (code commit process). In such a scenario a code change will be accepted only if all unit tests will pass.
  • Continuous delivery - as unit test are self-contained and therefore don`t need any pre-existing data in order to execute. This feature allow to run it as a part of continuous delivery process straight after code merge between for example UAT and Production branches. Such an approach allows to ensure that code merge did not introduce any unexpected regression or defect.
  • Enforces good design - implementing unit tests it not that easy but it`s not hard as well. Certain product code architecture must be put in place prior to writing tests itself. Thankfully just by following SOLID principles code unit testability can be achieved pretty fast. Moreover what is important is that following SOLID principles makes code even more open for change (agile) as well as easier to maintenance.
  • Facilitates change - introducing any change is any piece of code always brings a risk. This risk can be easily mitigated when product has high unit test coverage. Running unit tests after completing development changes on a local development environment can help to identify potential regression in a code and fix it straightaway.
  • Allow to measure code coverage – running a unit tests against source code under test provides a useful matrix of a percentage of executed (tested) lines of code by subset of unit tests per project.


I already described what unit tests are and how can be used to improve a product quality. However at this stage there are three additional questions in regards to unit testing:
Q: Who should develop unit tests?
A: In Agile Scrum world there is no separation in a team for Developer and Tester roles – all members of a developments teams are consider to be developers. To answer this question I encourage you to think about unit testing as way for developers (all team members) to proof that code works as expected. This means that is up to each team member who introduced any change in the code to actually add/modify/deprecate relevant unit tests. Without following such an approach developer cannot proof that code he/she implemented is actually functional. The good practice here is to change a standard development process (implement code -> test code) and start using a TDD (Test Driven Development) approach. Additionally the TDD approach help to solve another common problem with testing which is about time allocation. Unfortunately it`s very common patter across agile development teams to actually deprioritize (drop) testing work at the end of sprit under pressure of deadlines. With TDD tests goes first (before actual product code change) so that time allocation is always there.
In summary unit tests verifies quality of both code changes and overall product. It should be up to entire development team to make sure that enough time is allocated for testing and tests are implemented on time for each change.
Q: How to encourage your team to implement unit tests?
A: A short answer is that you should not have to do that. All developers should understand how beneficial is to actually have unit tests in place. Unfortunately this is not always true and in some cases a mentoring effort is required in order to explain to your team the benefits of unit testing. Additionally other aspects of software development process like code reviews and measuring code coverage during build might be used.
Q: When to run unit tests?
A: The beauty of unit tests is that it’s very fast and easy to run. As unit tests use mock to cut dependencies each developer can run thousands of it from his/her local development environment in a few seconds. This is very powerful tool as by being able to run set of tests against a project with a high code coverage means that each developer can check impact of a code change on entire product. This gives all team members a great tool which helps to avoid regressions in product code so that they can save more time. Additionally unit tests can be executed automatically by CI as a part of change commit to code repository (code check-in) so that breaking code change has no chance to get into repository as it will be rejected in a case of unit test failure.
The last question remains. Is it worth to make investment in unit tests? Of course it is! It’s never too late to bring some improvements for both your product and your team. Time invested in unit tests implementation might initially significant but will reduce in time and investment will certainly pay off.

niedziela, 8 lutego 2015

Using Hortonworks Hive in .NET

A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post.

First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here.

Picture 1. Up and running sandbox environment.

Now when I have my big data store ready I need to be able to establish a connection to it and start exchanging data. The problem is that Hadoop is not a database rather that that it`s a distributed file system (HDFS) and map reduce engine so  non of these fits my need directly. The tools that I`m looking for is called Hive which is data warehouse infrastructure built on top of Hadoop. Hive provides a SQL like syntax (called HiveQL) which I can use as in the case of normal database. To learn more about Hive syntax, data types and concept I`d stronly recommend this tutorial.

OK, so far so good. I decided that I`m going to use Hive as my warehouse that I will be connecting to from my .NET based application. The next step is to take care about a data provider which is a something like a driver that allow my code to interact with Hive. Of course there is no native .NET provider for Hive but this is where ODBC middle-ware API plays a part. The end-to-end tutorial how to download and setup ODBC drivers for Hortonworks Hive allowed me to set it up pretty easily and fast so I could focus on the last part which is a C# code.

Picture 2. Hortonworks Hive ODBC Driver setup.

For all developers who have at least some experience with ADO.NET or ODBC programming writing code for communicating with Hive should be very straightforward as overall concept as well as classes are exactly the same. First of all I need to have a connection string to my instance of Hive and I can build it very easily in two ways:
  • Just by specifying a predefined (Picture 2.) DNS name: dsn=Hadoop ODBC
  • By specifying all properties directly in the connection string:
    var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                      
                                            Host=192.168.56.101;
                                            Port=10000;
                                            Schema=default;
                                            HiveServerType=2;
                                            ApplySSPWithQueries=1;
                                            AsyncExecPollInterval=100;
                                            HS2AuthMech=2;
                                            UserName=sandbox;";

One way or another after setting up connection string the following steps in my code are:
1) Opening a connection to Hive.
2) Creating a simple Hive table called 'Searches' which contains 3 columns plus one partition column called searchTime.
3) Inserting data to 'Searches' table.
4) Retrieving data from 'Searches' table by using a simple Hive SELECT query and OdbcDataReader class.
5) Dropping a 'Searches' table.

The full code for my solution is presented below.

namespace Hadoopclient
{
    using System;
    using System.Data.Odbc;
 
    class Program
    {
        static void Main(string[] args)
        {
            // @"dsn=Hadoop ODBC"
            var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                                        
                                        Host=192.168.56.101;
                                        Port=10000;
                                        Schema=default;
                                        HiveServerType=2;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        HS2AuthMech=2;
                                        UserName=sandbox;";
 
            var createTableCommandText = "CREATE TABLE Searches(searchTerm STRING, userid BIGINT,userIp STRING) " +
                                            "COMMENT 'Stores all searches for data' " +
                                            "PARTITIONED BY(searchTime DATE) " +
                                            "STORED AS SEQUENCEFILE;";
 
            using (var connection = new OdbcConnection(connectionString))
            {
                using (var command = new OdbcCommand(createTableCommandText, connection))
                {
                    try
                    {
 
                        connection.Open();
 
                        // Create a table.
                        command.ExecuteNonQuery();
 
                        // Insert row of data.
                        command.CommandText = "INSERT INTO TABLE Searches PARTITION (searchTime = '2015-02-08') " +
                                               "VALUES ('search term', 1, '127.0.0.1')";
 
                        command.ExecuteNonQuery();
 
                        // Reading data from Hadoop.
                        command.CommandText = "SELECT * FROM Searches";
                        using (var reader = command.ExecuteReader())
                        {
                            while (reader.Read())
                            {
                                for (var i = 0; i < reader.FieldCount; i++)
                                {
                                    Console.WriteLine(reader[i]);
                                }
                            }
                        }
                    }
                    catch (OdbcException ex)
                    {
                        Console.WriteLine(ex.Message);
                        throw;
                    }
                    finally
                    {
                        // Drop table
                        command.CommandText = "DROP TABLE Searches";
                        command.ExecuteNonQuery();
                    }
                }
            }
        }
    }
}


Thank you.

środa, 3 grudnia 2014

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve. 

Scenario:
In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program.

Problem:
After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake.

Solution:
As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer. In general this pattern consists of a two main classes where:
  • Producer class is responsible for adding a new items to a shared collection
  • Consumer class is responsible for retrieving items from the collection and processing them in a specific way.
There is also a third component in this pattern which allows to share data between both consumer and producer - it`s a thread safe collection. Of course there can be multiple consumers and multiple producers working in the same time concurrently however the most important part is that no matter how many threads have been created all share the same collection. 

Using this solution will help me to speedup my upload process because a producer class will be adding a new records the shared collection and at the same time multiple consumers treads will be reading from it and processing items one by one.


Picture 1. Consumer producer design pattern basic schema.

Implementation:
To implement this design pattern I created a really simple console application and apart from the Program.cs class, which centralizes a program logic, I put there just a few more files. Firstly I implemented a thread-safe collection because as I mentioned it will be a place to share data between instances of producer and consumer(s). As my collection I`ve chosen a BlockinCollection<T> type from the System.Collections.Concurrent namespace as it`s purely designed to be used in the consumer producer design pattern implementation. At the same time it provides both a set of list like functions for adding and retrieving items in a thread-safe way and exposes a design pattern specific elements such a CompleteAdding() function or IsAddingCompleted property (the purpose of this elements is to allow producer class notify one or more consumers that items adding has been finished and consumer thread(s) should be completed). Therefore I put this collection in a new class and wrap it so then my code for my ItemCollection<T> class looks as follow:

   using System;
   using System.Collections.Concurrent;
 
   /// <summary>
   /// An items collection.
   /// </summary>
   /// <typeparam name="T">Type of the item.</typeparam>
   public class ItemCollection<T>
   {
       /// <summary>
       /// The internal collection of items.
       /// </summary>
       private BlockingCollection<T> collection;
 
       /// <summary>
       /// Gets the collection upper bound.
       /// </summary>
       /// <value>
       /// The upper bound.
       /// </value>
       public uint UpperBound { getprivate set; }
 
       /// <summary>
       /// Gets a value indicating whether this adding to the collection has been completed.
       /// </summary>
       public bool IsAddingCompleted
       {
           get
           {
               return this.collection.IsAddingCompleted;
           }
       }
 
       /// <summary>
       /// Initializes a new instance of the <see cref="ItemCollection{T}"/> class.
       /// </summary>
       /// <param name="upperBound">The collection upper bound.</param>
       public ItemCollection(uint upperBound = 25)
       {
           this.UpperBound = upperBound;
           this.collection = new BlockingCollection<T>((int)this.UpperBound);
       }
 
       /// <summary>
       /// Adds the specified item.
       /// </summary>
       /// <param name="item">The item.</param>
       /// <param name="timeoutMiliseconds">The timeout miliseconds.</param>
       /// <returns>Adding result.</returns>
       public bool TryAdd(T item, int timeoutMiliseconds)
       {
           var addResult = this.collection.TryAdd(item, timeoutMiliseconds);
 
           if (!addResult)
           {
               throw new InvalidOperationException("Unable to add item to collection.");
           }
 
           return addResult;
       }
 
       /// <summary>
       /// Try to take an item from collection.
       /// </summary>
       /// <param name="timeoutMiliseconds">The timeout miliseconds.</param>
       /// <returns>An instance of the item.</returns>
       public T TryTake(int timeoutMiliseconds)
       {
           var result = default(T);
 
           if (!this.collection.TryTake(out result, timeoutMiliseconds))
           {
               throw new InvalidOperationException("Unable to get item from collection.");
           }
 
           return result;
       }
 
       /// <summary>
       /// Completes the process of adding.
       /// </summary>
       public void CompleteAdding()
       {
           this.collection.CompleteAdding();
       }
   }

When my collection is ready a next step is to fill it with data. To do that I implemented a producer class which internally starts a new thread and within it in while loop it`s keeps adding items to the collection as long as evaluated producing function returned false. In such case it`s calling the
CompleteAdding() function on ItemCollection<T> class to let consumers know that there will be no more items added to it. I mentioned about a 'producing function' above so just to make it clear in my producer class constructor I expect a Func<T> definition. I done that as I want to keep producer logic generic so it can be use in multiple scenarios - in this case passed function comes from main thread and it`s responsible for retrieving data from a source database.

   /// <summary>
   /// The items producer.
   /// </summary>
   public class Producer<T>
       where T : classnew()
   {
       /// <summary>
       /// The collection.
       /// </summary>
       private readonly ItemCollection<T> collection;
 
       /// <summary>
       /// The producing function.
       /// </summary>
       private readonly Func<T> producingFunction;
 
       /// <summary>
       /// Initializes a new instance of the <see cref="Producer{T}"/> class.
       /// </summary>
       /// <param name="producingFunction">The producing function.</param>
       /// <param name="collection">The collection.</param>
       public Producer(Func<T> producingFunction, ItemCollection<T> collection)
       {
           if (producingFunction == null)
           {
               throw new ArgumentNullException("producingFunction");
           }
 
           if (collection == null)
           {
               throw new ArgumentNullException("collection");
           }
 
           this.collection = collection;
           this.producingFunction = producingFunction;
       }
 
       /// <summary>
       /// Starts producing items.
       /// </summary>
       public void Start()
       {
           Task.Factory.StartNew(() =>
           {
               while (this.Produce())
               {
                   continue;
               }
 
               this.collection.CompleteAdding();
           });
       }
 
       /// <summary>
       /// Produces this item.
       /// </summary>
       /// <returns>True is item has been produced and added to collection.</returns>
       public bool Produce()
       {
           var producingResult = this.producingFunction.Invoke();
           var result = false;
 
           if (producingFunction != default(T))
           {
               result = this.collection.TryAdd(producingResult, (int)TimeSpan.FromSeconds(10).TotalMilliseconds);
           }
 
           return result;
       }
   }



The last part of the design pattern is a consumer. I implemented it similarly to the producer and in the class constructor I expect a definition of an Action<T> which I invoke to consume the item from the shared collection.

/// <summary>
/// A consumer class.
/// </summary>
/// <typeparam name="T">Type of the object to process.</typeparam>
public class Consumer<T>
    where T : classnew()
{
    /// <summary>
    /// The collection.
    /// </summary>
    private readonly ItemCollection<T> collection;
 
    /// <summary>
    /// The consuming function.
    /// </summary>
    private readonly Action<T> consumingAction;
 
    /// <summary>
    /// Initializes a new instance of the <see cref="Consumer{T}"/> class.
    /// </summary>
    /// <param name="collection">The collection.</param>
    /// <param name="consumingFunction">The consuming function.</param>
    public Consumer(ItemCollection<T> collection, Action<T> consumingFunction)
    {
        if (collection == null)
        {
            throw new ArgumentNullException("collection");
        }
 
        if (collection == null)
        {
            throw new ArgumentNullException("consumingFunction");
        }
 
        this.collection = collection;
        this.consumingAction = consumingFunction;
    }
 
    /// <summary>
    /// Consumes this item from collection.
    /// </summary>
    public void Consume()
    {
        var instance = default(T);
 
        while (!this.collection.IsAddingCompleted)
        {
            instance = this.collection.TryTake((int)TimeSpan.FromMinutes(1).TotalMilliseconds);
 
            if (instance != null)
            {
                this.consumingAction.Invoke(instance);
            }
            else
            {
                throw new InvalidOperationException("Unable to get item from collection.");
            }
        }
    }
}

Lastly, in the Program.cs I  put all the logic required to married all pieces together. As you may notice in the code below I put a producing and consuming function definitions in this class body. Moreover instead of using just a single thread for consumer I decided to start multiple - where the number of threads is equal to number of cores on the host.


static void Main(string[] args)
       {
           var itemCollection = new ItemCollection<User>();
           var consumerTasks = new List<Task>();
           var connection = new SqlConnection(args[0]);
 
           connection.Open();
           var dataReader = GetDataReader(connection);
 
           // Producer initialization.
           var producer = new Producer<User>(() =>
           {
               User user = null;
 
               if (dataReader.Read())
               {
                   user = new User()
                   {
                       Id = dataReader.GetInt32(0),
                       Name = dataReader.GetString(1),
                       Surname = dataReader.GetString(2),
                       Email = dataReader.GetString(3)
                   };
               }
 
               return user;
           }, itemCollection);
 
           producer.Start();
 
           // One task perc logical processor.
           Enumerable.Range(0, System.Environment.ProcessorCount)
               .ToList()
               .ForEach(i =>
           {
               var consumer = new Consumer<User>(itemCollection, user =>
               {
                   Program.InsertToBigData(user);
               });
 
               // Start consumption.
               consumerTasks.Add(Task.Factory.StartNew(() => { consumer.Consume(); }));
           });
 
           // Waiting for all tasks to complete.
           Task.WhenAll(consumerTasks.ToArray())
               .ContinueWith((task) =>
                   {
                       if (!dataReader.IsClosed)
                       {
                           dataReader.Close();
                       }
 
                       connection.Close();
                       connection.Dispose();
                   });
       }


Thank you