Skip to main content

Using Hortonworks Hive in .NET

A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post.

First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here.

Picture 1. Up and running sandbox environment.

Now when I have my big data store ready I need to be able to establish a connection to it and start exchanging data. The problem is that Hadoop is not a database rather that that it`s a distributed file system (HDFS) and map reduce engine so  non of these fits my need directly. The tools that I`m looking for is called Hive which is data warehouse infrastructure built on top of Hadoop. Hive provides a SQL like syntax (called HiveQL) which I can use as in the case of normal database. To learn more about Hive syntax, data types and concept I`d stronly recommend this tutorial.

OK, so far so good. I decided that I`m going to use Hive as my warehouse that I will be connecting to from my .NET based application. The next step is to take care about a data provider which is a something like a driver that allow my code to interact with Hive. Of course there is no native .NET provider for Hive but this is where ODBC middle-ware API plays a part. The end-to-end tutorial how to download and setup ODBC drivers for Hortonworks Hive allowed me to set it up pretty easily and fast so I could focus on the last part which is a C# code.

Picture 2. Hortonworks Hive ODBC Driver setup.

For all developers who have at least some experience with ADO.NET or ODBC programming writing code for communicating with Hive should be very straightforward as overall concept as well as classes are exactly the same. First of all I need to have a connection string to my instance of Hive and I can build it very easily in two ways:
  • Just by specifying a predefined (Picture 2.) DNS name: dsn=Hadoop ODBC
  • By specifying all properties directly in the connection string:
    var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                      
                                            Host=192.168.56.101;
                                            Port=10000;
                                            Schema=default;
                                            HiveServerType=2;
                                            ApplySSPWithQueries=1;
                                            AsyncExecPollInterval=100;
                                            HS2AuthMech=2;
                                            UserName=sandbox;";

One way or another after setting up connection string the following steps in my code are:
1) Opening a connection to Hive.
2) Creating a simple Hive table called 'Searches' which contains 3 columns plus one partition column called searchTime.
3) Inserting data to 'Searches' table.
4) Retrieving data from 'Searches' table by using a simple Hive SELECT query and OdbcDataReader class.
5) Dropping a 'Searches' table.

The full code for my solution is presented below.

namespace Hadoopclient
{
    using System;
    using System.Data.Odbc;
 
    class Program
    {
        static void Main(string[] args)
        {
            // @"dsn=Hadoop ODBC"
            var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                                        
                                        Host=192.168.56.101;
                                        Port=10000;
                                        Schema=default;
                                        HiveServerType=2;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        HS2AuthMech=2;
                                        UserName=sandbox;";
 
            var createTableCommandText = "CREATE TABLE Searches(searchTerm STRING, userid BIGINT,userIp STRING) " +
                                            "COMMENT 'Stores all searches for data' " +
                                            "PARTITIONED BY(searchTime DATE) " +
                                            "STORED AS SEQUENCEFILE;";
 
            using (var connection = new OdbcConnection(connectionString))
            {
                using (var command = new OdbcCommand(createTableCommandText, connection))
                {
                    try
                    {
 
                        connection.Open();
 
                        // Create a table.
                        command.ExecuteNonQuery();
 
                        // Insert row of data.
                        command.CommandText = "INSERT INTO TABLE Searches PARTITION (searchTime = '2015-02-08') " +
                                               "VALUES ('search term', 1, '127.0.0.1')";
 
                        command.ExecuteNonQuery();
 
                        // Reading data from Hadoop.
                        command.CommandText = "SELECT * FROM Searches";
                        using (var reader = command.ExecuteReader())
                        {
                            while (reader.Read())
                            {
                                for (var i = 0; i < reader.FieldCount; i++)
                                {
                                    Console.WriteLine(reader[i]);
                                }
                            }
                        }
                    }
                    catch (OdbcException ex)
                    {
                        Console.WriteLine(ex.Message);
                        throw;
                    }
                    finally
                    {
                        // Drop table
                        command.CommandText = "DROP TABLE Searches";
                        command.ExecuteNonQuery();
                    }
                }
            }
        }
    }
}


Thank you.

Popular posts from this blog

Using Newtonsoft serializer in CosmosDB client

Problem In some scenarios engineers might want to use a custom JSON serializer for documents stored in CosmosDB.  Solution In CosmosDBV3 .NET Core API, when creating an instance of  CosmosClient one of optional setting in  CosmosClientOptions is to specify an instance of a Serializer . This serializer must be JSON based and be of  CosmosSerializer type. This means that if a custom serializer is needed this should inherit from CosmosSerializer abstract class and override its two methods for serializing and deserializing of an object. The challenge is that both methods from  CosmosSerializer are stream based and therefore might be not as easy to implement as engineers used to assume - still not super complex.  For demonstration purpose as or my custom serializer I'm going to use Netwonsoft.JSON library. Firstly a new type is needed and this must inherit from  CosmosSerializer.  using  Microsoft.Azure.Cosmos; using  Newtonsoft.Json; using  System.IO; using  System.Text; ///   <

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is respons