Connect Python to HCatalog on HDP Sandbox

One of the Hadoop automation components I’m currently working on needs¬†metadata from HCatalog, which is stored in MySQL (at least, that’s how it’s configured in my environment). Now, call me boring, but I prefer to do local dev instead of developing in production. ūüôā To this end, I’ve already installed Hadoop via homebrew. However, I had to spend a ridiculous amount of time configuring my local dev environment to emulate our production stack, and I didn’t want to have my team waste devote the same amount of time in order to use¬†this new tool for local dev. The solution?¬†Hortonworks Sandbox VM.¬†The Hortonworks Sandbox VM provides an easy way to get up and running with a local Hadoop dev environment. Granted, it may not perfectly emulate your production environment, but it dramatically narrows the gap. There are other Hadoop VM offerings out there, but I went with the Hortonworks VM because it was already installed and I’m lazy efficient like that.

So I’m going to pick up this post assuming you already have the Hortonworks Sandbox VM installed and running locally, but before we begin, here’s some versioning info for clarity and posterity:

  • HDP 2.3.1
  • VirtualBox 5.0.0
  • Python 2.7.10

Configuring the HDP Sandbox

The first thing we’ll need to do is¬†configure the VM to permit¬†traffic on the MySQL port (3306) between your laptop and the VM.

Click on Settings for the HDP VM

VirtualBox VM Console

 

Navigate to the Network tab and click on Port Forwarding… to edit its rules

VM Settings

 

Add a new port forwarding rule. In my app, this is a small green + icon in the upper-right-hand corner of the window.

Add a new port forwarding rule

 

Configure forwarding for port 3306 (MySQL’s default port)

  • Name = MySQL
  • Protocol = TCP
  • Host IP = your host IP (mine is 127.0.0.1)
  • Guest IP = your guest IP (mine is blank)
  • Guest Port = 3306

Add rules for MySQL port forwarding

Connections on port 3306 should now forward to the VM.

 

Configuring MySQL

Next, we’ll need to connect to MySQL to add a new app user. If you haven’t already, spin up your HDP VM. Once it’s running, it’ll give you a command to SSH into the VM. For me, this is:

Note: the default root password is hadoop (as of the time of this writing)

HDP VM Start-Up Window

 

Open a MySQL prompt and create a new user for your application:

SSH into the VM and open MySQL

 

Create a new login and grant access:

Note: This is just for example purposes; change the access as appropriate for your application

MySQL prompt

 

Now your app¬†can log into MySQL locally. BUT…¬†your app can’t connect from your laptop yet. Conceptually, think of connecting to¬†a VM as connecting to another¬†server. This means “localhost” won’t be sufficient; you’ll need to grant access to your IP (or %, if you live dangerously like that).

Probably the easiest way to find out your IP is by trying to connect to MySQL from your laptop. Open a new Terminal window and run:

Local IP Address

 

Grab the IP address specified in the error message. Now we’re going to switch back to the¬†MySQL prompt from our SSH connection to the VM:

Note: replace 10.0.2.2 with your own IP address ūüôā

Now return to your local Terminal window:

MySQL Prompt

 

This should now work and provide you with a MySQL prompt. Go ahead and poke around. ūüôā

Note: As of the time of this writing, the default HCatalog metastore is in the hive database.

HCatalog Queries

You should now be ready to connect to HCatalog from Python!

Example Python Script

As¬†an example, here’s a Python script that outputs a list of databases and tables¬†in HCatalog.¬†You can also find this script on my GitHub repo if you prefer or have copy/paste issues.

 

Example script output:

Example Output

Read More

Example HiveRC File to Configure Hive Preferences

Anyone who regularly works with query languages invariably develops personal preferences. For T-SQL, it may be a preference for which metadata is included with the results of a query, how quoted identifiers are handled, or what the default¬†behavior should be for transactions.¬†These types of settings can typically be configured at a session level, and¬†Hive is no exception in allowing this. In fact, Hive¬†provides users with an impressive number of configurable session¬†properties. Honestly,¬†you’ll probably never need to change the majority of these settings, and if/when you do, it’ll most likely apply to a specific Hive script (i.e. to improve performance). However, there are a handful of Hive settings that you may wish to always enable if they’re not already defaulted server-wide, such as displaying column headers. One option is to set these manually at the start of each session, using the SET command. But this can quickly get tedious if you have more than 1 or 2 settings to change. A better¬†option in those scenarios, and the topic of this blog post, is to use a HiveRC file to configure your personal¬†preferences for Hive’s default behavior.

For those of you not familiar with the concept, Linux commonly uses RC files — which I believe stands for “runtime configuration,” but don’t quote me on that ūüôā — for defining preferences, and various applications support these, typically in the format of .<app>rc. These will usually¬†live in a user’s home directory, and some examples include .bashrc, .pythonrc, and .hiverc.

Now that we have a little context, let’s¬†walk through how to create your personal .hiverc file. Note that all of these steps¬†take place on the same server you use for connecting to Hive.

 

Now, from inside Vim, do the following:

You should be back at your bash prompt. Now run these commands to verify everything is working as expected.

That’s all there is to it! Not too hard, huh? But to make things even easier, I’ve posted an example of my personal HiveRC file on my Hadoopsie GitHub repo.

hiverc

That’s all for now, you awesome nerds. ūüôā

Read More

My Mac Setup for Hadoop Development

I’ve recently decided to switch to a Mac. Having been such a proponent of all-things-Microsoft in the past, and having invested so much time in my dev skills using¬†a PC, this was a pretty huge move¬†for me. In fact, it took me a very long time to make the decision. But the more time I spent trying to figure out how to do Hadoop dev better, and faster, the more clear it became to me that switching to a Mac would help with these things. After only a few weeks, I’ve already found that many of the things that were very painful on a PC are exceedingly easy on a Mac, such as installing Hadoop locally.

Now, this post isn’t to convince you to switch to a Mac. A person’s OS preference is very personal, and as such,¬†discussions can get almost as heated as religious and political discussions. ūüôā However, for those who are already considering switching to a Mac from a PC, I thought it’d be helpful to outline some of the applications I’ve installed that have improved my Hadoop dev experience.

Applications

App What is it? Why do I use it? Where to get it?
HomeBrew Homebrew installs the stuff you need that Apple didn’t. Makes local app installs super easy. I used this to install Maven, MySQL, Python, Hadoop, Pig, & Spark, & much more. http://brew.sh/
iTerm2 iTerm2 is a replacement for Terminal and the successor to iTerm. For connecting to Hadoop via SSH. This provides some nice features, such as tabs and status colors, which makes it easier to keep track of numerous simultaneous activities in Hadoop. https://www.iterm2.com/
IntelliJ IDEA The Community Edition is an excellent free IDE For development of Pig, Hive, Spark, & Python scripts https://www.jetbrains.com/idea/
0xDBE (EAP) New Intelligent IDE for DBAs and SQL Developers For SQL Server & MySQL development (And yes, I *do* miss SSMS, but I don’t want to have to run a VM to use it) https://www.jetbrains.com/dbe/

My config

IntelliJ Plugins

  • Apache Pig Language Plugin
  • Python Community Edition
  • etc

iTerm2

Bash Profile

    I also added the following code to my local & Hadoop .bashrc profiles. This changes the title of a Bash window. This isn’t specific to iTerm2, and I could have done this on my PC if I had known about it at the time. So if you are using either Terminal or a PC SSH client (i.e. PuTTY, PowerShell), you may still be able to take advantage of this if your client displays window titles.

 

 

 

This is an example of how you would call the code at the start of any new Bash session

iterm2

My Dev Process
I have 3 monitors set up, which are typically configured as:

Monitor 1

  • email
  • calendar
  • web browser (Slack, JIRA, etc.)

Monitor 2

  • IntelliJ

Monitor 3

  • iTerm2, with tabs already open for
    • Pig
    • Hive
    • Hadoop Bash (HDFS, log files, etc.)
    • misc (Python, Spark, etc.)
    • local Bash (GitHub commits, etc.)

In general, I write the code in IntelliJ and copy/paste it into iTerm2. This provides nice syntax highlighting and makes it easy to check my code into GitHub when I’m done. Once I’m past the initial dev phase, I SCP the actual scripts over to the prod Hadoop shell box for scheduling. Overall, I’ve found that this approach makes iterative dev much faster.

That’s pretty much the highlights, though I’ll continue to add to this as I stumble across tweaks, hacks, and apps that make my life easier.

Hopefully for those just starting out on a Mac, you’ve found this post helpful for getting up and running with Hadoop dev. For those who have already made the switch — or who have always used a Mac — did I miss something? Is there a killer app that you love for Hadoop dev? If so, please let me know! ūüôā

Read More