Connect Python to HCatalog on HDP Sandbox

One of the Hadoop automation components I’m currently working on needs¬†metadata from HCatalog, which is stored in MySQL (at least, that’s how it’s configured in my environment). Now, call me boring, but I prefer to do local dev instead of developing in production. ūüôā To this end, I’ve already installed Hadoop via homebrew. However, I had to spend a ridiculous amount of time configuring my local dev environment to emulate our production stack, and I didn’t want to have my team waste devote the same amount of time in order to use¬†this new tool for local dev. The solution?¬†Hortonworks Sandbox VM.¬†The Hortonworks Sandbox VM provides an easy way to get up and running with a local Hadoop dev environment. Granted, it may not perfectly emulate your production environment, but it dramatically narrows the gap. There are other Hadoop VM offerings out there, but I went with the Hortonworks VM because it was already installed and I’m lazy efficient like that.

So I’m going to pick up this post assuming you already have the Hortonworks Sandbox VM installed and running locally, but before we begin, here’s some versioning info for clarity and posterity:

  • HDP 2.3.1
  • VirtualBox 5.0.0
  • Python 2.7.10

Configuring the HDP Sandbox

The first thing we’ll need to do is¬†configure the VM to permit¬†traffic on the MySQL port (3306) between your laptop and the VM.

Click on Settings for the HDP VM

VirtualBox VM Console

 

Navigate to the Network tab and click on Port Forwarding… to edit its rules

VM Settings

 

Add a new port forwarding rule. In my app, this is a small green + icon in the upper-right-hand corner of the window.

Add a new port forwarding rule

 

Configure forwarding for port 3306 (MySQL’s default port)

  • Name = MySQL
  • Protocol = TCP
  • Host IP = your host IP (mine is 127.0.0.1)
  • Guest IP = your guest IP (mine is blank)
  • Guest Port = 3306

Add rules for MySQL port forwarding

Connections on port 3306 should now forward to the VM.

 

Configuring MySQL

Next, we’ll need to connect to MySQL to add a new app user. If you haven’t already, spin up your HDP VM. Once it’s running, it’ll give you a command to SSH into the VM. For me, this is:

Note: the default root password is hadoop (as of the time of this writing)

HDP VM Start-Up Window

 

Open a MySQL prompt and create a new user for your application:

SSH into the VM and open MySQL

 

Create a new login and grant access:

Note: This is just for example purposes; change the access as appropriate for your application

MySQL prompt

 

Now your app¬†can log into MySQL locally. BUT…¬†your app can’t connect from your laptop yet. Conceptually, think of connecting to¬†a VM as connecting to another¬†server. This means “localhost” won’t be sufficient; you’ll need to grant access to your IP (or %, if you live dangerously like that).

Probably the easiest way to find out your IP is by trying to connect to MySQL from your laptop. Open a new Terminal window and run:

Local IP Address

 

Grab the IP address specified in the error message. Now we’re going to switch back to the¬†MySQL prompt from our SSH connection to the VM:

Note: replace 10.0.2.2 with your own IP address ūüôā

Now return to your local Terminal window:

MySQL Prompt

 

This should now work and provide you with a MySQL prompt. Go ahead and poke around. ūüôā

Note: As of the time of this writing, the default HCatalog metastore is in the hive database.

HCatalog Queries

You should now be ready to connect to HCatalog from Python!

Example Python Script

As¬†an example, here’s a Python script that outputs a list of databases and tables¬†in HCatalog.¬†You can also find this script on my GitHub repo if you prefer or have copy/paste issues.

 

Example script output:

Example Output

Read More

Example HiveRC File to Configure Hive Preferences

Anyone who regularly works with query languages invariably develops personal preferences. For T-SQL, it may be a preference for which metadata is included with the results of a query, how quoted identifiers are handled, or what the default¬†behavior should be for transactions.¬†These types of settings can typically be configured at a session level, and¬†Hive is no exception in allowing this. In fact, Hive¬†provides users with an impressive number of configurable session¬†properties. Honestly,¬†you’ll probably never need to change the majority of these settings, and if/when you do, it’ll most likely apply to a specific Hive script (i.e. to improve performance). However, there are a handful of Hive settings that you may wish to always enable if they’re not already defaulted server-wide, such as displaying column headers. One option is to set these manually at the start of each session, using the SET command. But this can quickly get tedious if you have more than 1 or 2 settings to change. A better¬†option in those scenarios, and the topic of this blog post, is to use a HiveRC file to configure your personal¬†preferences for Hive’s default behavior.

For those of you not familiar with the concept, Linux commonly uses RC files — which I believe stands for “runtime configuration,” but don’t quote me on that ūüôā — for defining preferences, and various applications support these, typically in the format of .<app>rc. These will usually¬†live in a user’s home directory, and some examples include .bashrc, .pythonrc, and .hiverc.

Now that we have a little context, let’s¬†walk through how to create your personal .hiverc file. Note that all of these steps¬†take place on the same server you use for connecting to Hive.

 

Now, from inside Vim, do the following:

You should be back at your bash prompt. Now run these commands to verify everything is working as expected.

That’s all there is to it! Not too hard, huh? But to make things even easier, I’ve posted an example of my personal HiveRC file on my Hadoopsie GitHub repo.

hiverc

That’s all for now, you awesome nerds. ūüôā

Read More