A few months ago I decided to learn a big data. This sounds very complex and of course it is. All these strange names which actually tells nothing to person who is new in these area combined with different way of looking at data storage makes entire topic even more complex. However after reading N blogs and watching many, many tutorials today I finally had a chance to try to write some code. As in last week I managed to setup a Hortonworks distribution of Hadoop today I decided to connect to it from my .NET based application and this is what I will describe in this post.
First things first I didn`t setup entire Hortonworks ecosystem from scratch - I`d love to but for now it`s far beyond my knowledge thus I decided to use a sandbox environment provided by Hortonworks. There are multiple different VMs available to download but in my case I`ve choose a Hyper-V. More about setting this environment up you can read here.
Picture 1. Up and running sandbox environment. |
Now when I have my big data store ready I need to be able to establish a connection to it and start exchanging data. The problem is that Hadoop is not a database rather that that it`s a distributed file system (HDFS) and map reduce engine so non of these fits my need directly. The tools that I`m looking for is called Hive which is data warehouse infrastructure built on top of Hadoop. Hive provides a SQL like syntax (called HiveQL) which I can use as in the case of normal database. To learn more about Hive syntax, data types and concept I`d stronly recommend this tutorial.
OK, so far so good. I decided that I`m going to use Hive as my warehouse that I will be connecting to from my .NET based application. The next step is to take care about a data provider which is a something like a driver that allow my code to interact with Hive. Of course there is no native .NET provider for Hive but this is where ODBC middle-ware API plays a part. The end-to-end tutorial how to download and setup ODBC drivers for Hortonworks Hive allowed me to set it up pretty easily and fast so I could focus on the last part which is a C# code.
Picture 2. Hortonworks Hive ODBC Driver setup. |
For all developers who have at least some experience with ADO.NET or ODBC programming writing code for communicating with Hive should be very straightforward as overall concept as well as classes are exactly the same. First of all I need to have a connection string to my instance of Hive and I can build it very easily in two ways:
- Just by specifying a predefined (Picture 2.) DNS name: dsn=Hadoop ODBC
- By specifying all properties directly in the connection string:
var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver}; Host=192.168.56.101; Port=10000; Schema=default; HiveServerType=2; ApplySSPWithQueries=1; AsyncExecPollInterval=100; HS2AuthMech=2; UserName=sandbox;";
1) Opening a connection to Hive.
2) Creating a simple Hive table called 'Searches' which contains 3 columns plus one partition column called searchTime.
3) Inserting data to 'Searches' table.
4) Retrieving data from 'Searches' table by using a simple Hive SELECT query and OdbcDataReader class.
5) Dropping a 'Searches' table.
The full code for my solution is presented below.
namespace Hadoopclient { using System; using System.Data.Odbc; class Program { static void Main(string[] args) { // @"dsn=Hadoop ODBC" var connectionString = @"DRIVER={Hortonworks Hive ODBC Driver}; Host=192.168.56.101; Port=10000; Schema=default; HiveServerType=2; ApplySSPWithQueries=1; AsyncExecPollInterval=100; HS2AuthMech=2; UserName=sandbox;"; var createTableCommandText = "CREATE TABLE Searches(searchTerm STRING, userid BIGINT,userIp STRING) " + "COMMENT 'Stores all searches for data' " + "PARTITIONED BY(searchTime DATE) " + "STORED AS SEQUENCEFILE;"; using (var connection = new OdbcConnection(connectionString)) { using (var command = new OdbcCommand(createTableCommandText, connection)) { try { connection.Open(); // Create a table. command.ExecuteNonQuery(); // Insert row of data. command.CommandText = "INSERT INTO TABLE Searches PARTITION (searchTime = '2015-02-08') " + "VALUES ('search term', 1, '127.0.0.1')"; command.ExecuteNonQuery(); // Reading data from Hadoop. command.CommandText = "SELECT * FROM Searches"; using (var reader = command.ExecuteReader()) { while (reader.Read()) { for (var i = 0; i < reader.FieldCount; i++) { Console.WriteLine(reader[i]); } } } } catch (OdbcException ex) { Console.WriteLine(ex.Message); throw; } finally { // Drop table command.CommandText = "DROP TABLE Searches"; command.ExecuteNonQuery(); } } } } } }
Thank you.