Connect Python to HCatalog on HDP Sandbox

One of the Hadoop automation components I’m currently working on needs¬†metadata from HCatalog, which is stored in MySQL (at least, that’s how it’s configured in my environment). Now, call me boring, but I prefer to do local dev instead of developing in production. ūüôā To this end, I’ve already installed Hadoop via homebrew. However, I had to spend a ridiculous amount of time configuring my local dev environment to emulate our production stack, and I didn’t want to have my team waste devote the same amount of time in order to use¬†this new tool for local dev. The solution?¬†Hortonworks Sandbox VM.¬†The Hortonworks Sandbox VM provides an easy way to get up and running with a local Hadoop dev environment. Granted, it may not perfectly emulate your production environment, but it dramatically narrows the gap. There are other Hadoop VM offerings out there, but I went with the Hortonworks VM because it was already installed and I’m lazy efficient like that.

So I’m going to pick up this post assuming you already have the Hortonworks Sandbox VM installed and running locally, but before we begin, here’s some versioning info for clarity and posterity:

  • HDP 2.3.1
  • VirtualBox 5.0.0
  • Python 2.7.10

Configuring the HDP Sandbox

The first thing we’ll need to do is¬†configure the VM to permit¬†traffic on the MySQL port (3306) between your laptop and the VM.

Click on Settings for the HDP VM

VirtualBox VM Console

 

Navigate to the Network tab and click on Port Forwarding… to edit its rules

VM Settings

 

Add a new port forwarding rule. In my app, this is a small green + icon in the upper-right-hand corner of the window.

Add a new port forwarding rule

 

Configure forwarding for port 3306 (MySQL’s default port)

  • Name = MySQL
  • Protocol = TCP
  • Host IP = your host IP (mine is 127.0.0.1)
  • Guest IP = your guest IP (mine is blank)
  • Guest Port = 3306

Add rules for MySQL port forwarding

Connections on port 3306 should now forward to the VM.

 

Configuring MySQL

Next, we’ll need to connect to MySQL to add a new app user. If you haven’t already, spin up your HDP VM. Once it’s running, it’ll give you a command to SSH into the VM. For me, this is:

Note: the default root password is hadoop (as of the time of this writing)

HDP VM Start-Up Window

 

Open a MySQL prompt and create a new user for your application:

SSH into the VM and open MySQL

 

Create a new login and grant access:

Note: This is just for example purposes; change the access as appropriate for your application

MySQL prompt

 

Now your app¬†can log into MySQL locally. BUT…¬†your app can’t connect from your laptop yet. Conceptually, think of connecting to¬†a VM as connecting to another¬†server. This means “localhost” won’t be sufficient; you’ll need to grant access to your IP (or %, if you live dangerously like that).

Probably the easiest way to find out your IP is by trying to connect to MySQL from your laptop. Open a new Terminal window and run:

Local IP Address

 

Grab the IP address specified in the error message. Now we’re going to switch back to the¬†MySQL prompt from our SSH connection to the VM:

Note: replace 10.0.2.2 with your own IP address ūüôā

Now return to your local Terminal window:

MySQL Prompt

 

This should now work and provide you with a MySQL prompt. Go ahead and poke around. ūüôā

Note: As of the time of this writing, the default HCatalog metastore is in the hive database.

HCatalog Queries

You should now be ready to connect to HCatalog from Python!

Example Python Script

As¬†an example, here’s a Python script that outputs a list of databases and tables¬†in HCatalog.¬†You can also find this script on my GitHub repo if you prefer or have copy/paste issues.

 

Example script output:

Example Output

Leave a Reply

Your email address will not be published. Required fields are marked *