Skip to main content

Using Hortonworks Hive in .NET

A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post.

First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here.

Picture 1. Up and running sandbox environment.

Now when I have my big data store ready I need to be able to establish a connection to it and start exchanging data. The problem is that Hadoop is not a database rather that that it`s a distributed file system (HDFS) and map reduce engine so  non of these fits my need directly. The tools that I`m looking for is called Hive which is data warehouse infrastructure built on top of Hadoop. Hive provides a SQL like syntax (called HiveQL) which I can use as in the case of normal database. To learn more about Hive syntax, data types and concept I`d stronly recommend this tutorial.

OK, so far so good. I decided that I`m going to use Hive as my warehouse that I will be connecting to from my .NET based application. The next step is to take care about a data provider which is a something like a driver that allow my code to interact with Hive. Of course there is no native .NET provider for Hive but this is where ODBC middle-ware API plays a part. The end-to-end tutorial how to download and setup ODBC drivers for Hortonworks Hive allowed me to set it up pretty easily and fast so I could focus on the last part which is a C# code.

Picture 2. Hortonworks Hive ODBC Driver setup.

For all developers who have at least some experience with ADO.NET or ODBC programming writing code for communicating with Hive should be very straightforward as overall concept as well as classes are exactly the same. First of all I need to have a connection string to my instance of Hive and I can build it very easily in two ways:
  • Just by specifying a predefined (Picture 2.) DNS name: dsn=Hadoop ODBC
  • By specifying all properties directly in the connection string:
    var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                      
                                            Host=192.168.56.101;
                                            Port=10000;
                                            Schema=default;
                                            HiveServerType=2;
                                            ApplySSPWithQueries=1;
                                            AsyncExecPollInterval=100;
                                            HS2AuthMech=2;
                                            UserName=sandbox;";

One way or another after setting up connection string the following steps in my code are:
1) Opening a connection to Hive.
2) Creating a simple Hive table called 'Searches' which contains 3 columns plus one partition column called searchTime.
3) Inserting data to 'Searches' table.
4) Retrieving data from 'Searches' table by using a simple Hive SELECT query and OdbcDataReader class.
5) Dropping a 'Searches' table.

The full code for my solution is presented below.

namespace Hadoopclient
{
    using System;
    using System.Data.Odbc;
 
    class Program
    {
        static void Main(string[] args)
        {
            // @"dsn=Hadoop ODBC"
            var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                                        
                                        Host=192.168.56.101;
                                        Port=10000;
                                        Schema=default;
                                        HiveServerType=2;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        HS2AuthMech=2;
                                        UserName=sandbox;";
 
            var createTableCommandText = "CREATE TABLE Searches(searchTerm STRING, userid BIGINT,userIp STRING) " +
                                            "COMMENT 'Stores all searches for data' " +
                                            "PARTITIONED BY(searchTime DATE) " +
                                            "STORED AS SEQUENCEFILE;";
 
            using (var connection = new OdbcConnection(connectionString))
            {
                using (var command = new OdbcCommand(createTableCommandText, connection))
                {
                    try
                    {
 
                        connection.Open();
 
                        // Create a table.
                        command.ExecuteNonQuery();
 
                        // Insert row of data.
                        command.CommandText = "INSERT INTO TABLE Searches PARTITION (searchTime = '2015-02-08') " +
                                               "VALUES ('search term', 1, '127.0.0.1')";
 
                        command.ExecuteNonQuery();
 
                        // Reading data from Hadoop.
                        command.CommandText = "SELECT * FROM Searches";
                        using (var reader = command.ExecuteReader())
                        {
                            while (reader.Read())
                            {
                                for (var i = 0; i < reader.FieldCount; i++)
                                {
                                    Console.WriteLine(reader[i]);
                                }
                            }
                        }
                    }
                    catch (OdbcException ex)
                    {
                        Console.WriteLine(ex.Message);
                        throw;
                    }
                    finally
                    {
                        // Drop table
                        command.CommandText = "DROP TABLE Searches";
                        command.ExecuteNonQuery();
                    }
                }
            }
        }
    }
}


Thank you.

Popular posts from this blog

Persisting Enum in database with Entity Framework

Problem statement We all want to write clean code and follow best coding practices. This all engineers 'North Star' goal which in many cases can not be easily achievable because of many potential difficulties with converting our ideas/good practices into working solutions.  One of an example I recently came across was about using ASP.NET Core and Entity Framework 5 to store Enum values in a relational database (like Azure SQL). Why is this a problem you might ask... and my answer here is that you want to work with Enum types in your code but persist an integer in your databases. You can think about in that way. Why we use data types at all when everything could be just a string which is getting converted into a desirable type when needed. This 'all-string' approach is of course a huge anti-pattern and a bad practice for many reasons with few being: degraded performance, increased storage space, increased code duplication.  Pre-requirements 1. Status enum type definition...

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is res...

Creating common partial class with Entity Framework

When we use the Entity Framework (EF) in multilayer information systems sometimes we want to extend classes generated by EF by adding some common properties or functions. Such operation can`t be conduct on *.edmx data model so we need to make some improvement in our solution. Let`s begin... Lets assumed that in our soulution we have only three layer (three project): Client console application which has reference to the second layer  - ' ConsoleApplication ' project name Class library project with class interfaces only - ' Interfaces ' project name Class library class implementation and data model referenced to 'Interfaces' project - ' Classes ' project name. Picture 1. Solution structure. Now when we have all solution structure we can focus on data model. In the ' Classes ' project we create a new folder named ' Model ' and inside add new item of ADO.NET Entity Data Model named ' Learning.edmx ' - it may be empty ...