Skip to main content

Using Hortonworks Hive in .NET

A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post.

First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here.

Picture 1. Up and running sandbox environment.

Now when I have my big data store ready I need to be able to establish a connection to it and start exchanging data. The problem is that Hadoop is not a database rather that that it`s a distributed file system (HDFS) and map reduce engine so  non of these fits my need directly. The tools that I`m looking for is called Hive which is data warehouse infrastructure built on top of Hadoop. Hive provides a SQL like syntax (called HiveQL) which I can use as in the case of normal database. To learn more about Hive syntax, data types and concept I`d stronly recommend this tutorial.

OK, so far so good. I decided that I`m going to use Hive as my warehouse that I will be connecting to from my .NET based application. The next step is to take care about a data provider which is a something like a driver that allow my code to interact with Hive. Of course there is no native .NET provider for Hive but this is where ODBC middle-ware API plays a part. The end-to-end tutorial how to download and setup ODBC drivers for Hortonworks Hive allowed me to set it up pretty easily and fast so I could focus on the last part which is a C# code.

Picture 2. Hortonworks Hive ODBC Driver setup.

For all developers who have at least some experience with ADO.NET or ODBC programming writing code for communicating with Hive should be very straightforward as overall concept as well as classes are exactly the same. First of all I need to have a connection string to my instance of Hive and I can build it very easily in two ways:
  • Just by specifying a predefined (Picture 2.) DNS name: dsn=Hadoop ODBC
  • By specifying all properties directly in the connection string:
    var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                      
                                            Host=192.168.56.101;
                                            Port=10000;
                                            Schema=default;
                                            HiveServerType=2;
                                            ApplySSPWithQueries=1;
                                            AsyncExecPollInterval=100;
                                            HS2AuthMech=2;
                                            UserName=sandbox;";

One way or another after setting up connection string the following steps in my code are:
1) Opening a connection to Hive.
2) Creating a simple Hive table called 'Searches' which contains 3 columns plus one partition column called searchTime.
3) Inserting data to 'Searches' table.
4) Retrieving data from 'Searches' table by using a simple Hive SELECT query and OdbcDataReader class.
5) Dropping a 'Searches' table.

The full code for my solution is presented below.

namespace Hadoopclient
{
    using System;
    using System.Data.Odbc;
 
    class Program
    {
        static void Main(string[] args)
        {
            // @"dsn=Hadoop ODBC"
            var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver};                                        
                                        Host=192.168.56.101;
                                        Port=10000;
                                        Schema=default;
                                        HiveServerType=2;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        HS2AuthMech=2;
                                        UserName=sandbox;";
 
            var createTableCommandText = "CREATE TABLE Searches(searchTerm STRING, userid BIGINT,userIp STRING) " +
                                            "COMMENT 'Stores all searches for data' " +
                                            "PARTITIONED BY(searchTime DATE) " +
                                            "STORED AS SEQUENCEFILE;";
 
            using (var connection = new OdbcConnection(connectionString))
            {
                using (var command = new OdbcCommand(createTableCommandText, connection))
                {
                    try
                    {
 
                        connection.Open();
 
                        // Create a table.
                        command.ExecuteNonQuery();
 
                        // Insert row of data.
                        command.CommandText = "INSERT INTO TABLE Searches PARTITION (searchTime = '2015-02-08') " +
                                               "VALUES ('search term', 1, '127.0.0.1')";
 
                        command.ExecuteNonQuery();
 
                        // Reading data from Hadoop.
                        command.CommandText = "SELECT * FROM Searches";
                        using (var reader = command.ExecuteReader())
                        {
                            while (reader.Read())
                            {
                                for (var i = 0; i < reader.FieldCount; i++)
                                {
                                    Console.WriteLine(reader[i]);
                                }
                            }
                        }
                    }
                    catch (OdbcException ex)
                    {
                        Console.WriteLine(ex.Message);
                        throw;
                    }
                    finally
                    {
                        // Drop table
                        command.CommandText = "DROP TABLE Searches";
                        command.ExecuteNonQuery();
                    }
                }
            }
        }
    }
}


Thank you.

Popular posts from this blog

Multithread processing of the SqlDataReader - Producer/Consumer design pattern

In today post I want to describe how to optimize usage of a ADO.NET SqlDataReader class by using multi-threading. To present that lets me introduce a problem that I will try to solve.  Scenario : In a project we decided to move all data from a multiple databases to one data warehouse. It will be a good few terabytes of data or even more. Data transfer will be done by using a custom importer program. Problem : After implementing a database agnostic logic of generating and executing a query I realized that I can retrieve data from source databases faster that I can upload them to big data store through HTTP client -importer program. In other words, data reader is capable of reading data faster then I can process it an upload to my big data lake. Solution : As a solution for solving this problem I would like to propose one of a multi-thread design pattern called Producer/Consumer . In general this pattern consists of a two main classes where: Producer class is respons

Full-Text Search with PDF in Microsoft SQL Server

Last week I get interesting task to develop. The task was to search input text in PDF file stored in database as FileStream. The task implementation took me some time so I decided to share it with other developers. Here we are going to use SQL Server 2008 R2 (x64 Developers Edition), external driver from Adobe, Full-Text Search technology and FileStream technology.Because this sems a little bit comlicated let`s make this topic clear and do it step by step. 1) Enable FileStream - this part is pretty easy, just check wheter You already have enabled filestream on Your SQL Server instance - if no simply enable it as in the picture below. Picture 1. Enable filestream in SQL Server instance. 2) Create SQL table to store files  - mainly ther will be PDF file stored but some others is also be allright. Out table DocumentFile will be created in dbo schema and contain one column primary key with default value as sequential GUID. Important this is out table contains FileStream

MVC 3 Reload PartialView by using jQuery and AJAX

Sometimes it`s easier to reaload only part of the website witout reloading whole page. Thanks to AJAX technology such aproach is possible and easy to code. Let`s begin. Firstly we need to configure our environment so we must have a Visual Studio with MVC 3  Razor isntalled on it. Razor can be obtained from here . After installation process completed , the second step is to set up a new WebStie project (Picture 1.) Picture 1. After you confirm Your project type choise, next window starts. On it You should select the second web project type (with Forms autentication method)  and than choose RAZOR form dropdown (Picture 2.). Picture 2. Now You new web solution contains several folders and files by default. Their description is not a point of this article so allow myself to continue. Next step is to create a simple model for our PartialView. As an example I  created a simple _UserModel.cs in Models folder. The source code for this model is:  public class _UserModel     {