Connect Python to HCatalog on HDP Sandbox

One of the Hadoop automation components I’m currently working on needs¬†metadata from HCatalog, which is stored in MySQL (at least, that’s how it’s configured in my environment). Now, call me boring, but I prefer to do local dev instead of developing in production. ūüôā To this end, I’ve already installed Hadoop via homebrew. However, I had to spend a ridiculous amount of time configuring my local dev environment to emulate our production stack, and I didn’t want to have my team waste devote the same amount of time in order to use¬†this new tool for local dev. The solution?¬†Hortonworks Sandbox VM.¬†The Hortonworks Sandbox VM provides an easy way to get up and running with a local Hadoop dev environment. Granted, it may not perfectly emulate your production environment, but it dramatically narrows the gap. There are other Hadoop VM offerings out there, but I went with the Hortonworks VM because it was already installed and I’m lazy efficient like that.

So I’m going to pick up this post assuming you already have the Hortonworks Sandbox VM installed and running locally, but before we begin, here’s some versioning info for clarity and posterity:

  • HDP 2.3.1
  • VirtualBox 5.0.0
  • Python 2.7.10

Configuring the HDP Sandbox

The first thing we’ll need to do is¬†configure the VM to permit¬†traffic on the MySQL port (3306) between your laptop and the VM.

Click on Settings for the HDP VM

VirtualBox VM Console

 

Navigate to the Network tab and click on Port Forwarding… to edit its rules

VM Settings

 

Add a new port forwarding rule. In my app, this is a small green + icon in the upper-right-hand corner of the window.

Add a new port forwarding rule

 

Configure forwarding for port 3306 (MySQL’s default port)

  • Name = MySQL
  • Protocol = TCP
  • Host IP = your host IP (mine is 127.0.0.1)
  • Guest IP = your guest IP (mine is blank)
  • Guest Port = 3306

Add rules for MySQL port forwarding

Connections on port 3306 should now forward to the VM.

 

Configuring MySQL

Next, we’ll need to connect to MySQL to add a new app user. If you haven’t already, spin up your HDP VM. Once it’s running, it’ll give you a command to SSH into the VM. For me, this is:

Note: the default root password is hadoop (as of the time of this writing)

HDP VM Start-Up Window

 

Open a MySQL prompt and create a new user for your application:

SSH into the VM and open MySQL

 

Create a new login and grant access:

Note: This is just for example purposes; change the access as appropriate for your application

MySQL prompt

 

Now your app¬†can log into MySQL locally. BUT…¬†your app can’t connect from your laptop yet. Conceptually, think of connecting to¬†a VM as connecting to another¬†server. This means “localhost” won’t be sufficient; you’ll need to grant access to your IP (or %, if you live dangerously like that).

Probably the easiest way to find out your IP is by trying to connect to MySQL from your laptop. Open a new Terminal window and run:

Local IP Address

 

Grab the IP address specified in the error message. Now we’re going to switch back to the¬†MySQL prompt from our SSH connection to the VM:

Note: replace 10.0.2.2 with your own IP address ūüôā

Now return to your local Terminal window:

MySQL Prompt

 

This should now work and provide you with a MySQL prompt. Go ahead and poke around. ūüôā

Note: As of the time of this writing, the default HCatalog metastore is in the hive database.

HCatalog Queries

You should now be ready to connect to HCatalog from Python!

Example Python Script

As¬†an example, here’s a Python script that outputs a list of databases and tables¬†in HCatalog.¬†You can also find this script on my GitHub repo if you prefer or have copy/paste issues.

 

Example script output:

Example Output

Read More

Example HiveRC File to Configure Hive Preferences

Anyone who regularly works with query languages invariably develops personal preferences. For T-SQL, it may be a preference for which metadata is included with the results of a query, how quoted identifiers are handled, or what the default¬†behavior should be for transactions.¬†These types of settings can typically be configured at a session level, and¬†Hive is no exception in allowing this. In fact, Hive¬†provides users with an impressive number of configurable session¬†properties. Honestly,¬†you’ll probably never need to change the majority of these settings, and if/when you do, it’ll most likely apply to a specific Hive script (i.e. to improve performance). However, there are a handful of Hive settings that you may wish to always enable if they’re not already defaulted server-wide, such as displaying column headers. One option is to set these manually at the start of each session, using the SET command. But this can quickly get tedious if you have more than 1 or 2 settings to change. A better¬†option in those scenarios, and the topic of this blog post, is to use a HiveRC file to configure your personal¬†preferences for Hive’s default behavior.

For those of you not familiar with the concept, Linux commonly uses RC files — which I believe stands for “runtime configuration,” but don’t quote me on that ūüôā — for defining preferences, and various applications support these, typically in the format of .<app>rc. These will usually¬†live in a user’s home directory, and some examples include .bashrc, .pythonrc, and .hiverc.

Now that we have a little context, let’s¬†walk through how to create your personal .hiverc file. Note that all of these steps¬†take place on the same server you use for connecting to Hive.

 

Now, from inside Vim, do the following:

You should be back at your bash prompt. Now run these commands to verify everything is working as expected.

That’s all there is to it! Not too hard, huh? But to make things even easier, I’ve posted an example of my personal HiveRC file on my Hadoopsie GitHub repo.

hiverc

That’s all for now, you awesome nerds. ūüôā

Read More

Hadoop Summit 2015 Sessions

Recordings from Hadoop Summit 2015 sessions are now available! The conference organizers have made session content readily available via the following ways:

To be clear, this content isn’t just available to¬†conference attendees; this is freely available to anyone who’s interested in it. So¬†take a few minutes to learn about what’s new in the Hadoop community and what the tech giants are doing with Hadoop.

Shameless Plug:¬†if you’re wondering what the frilly heck to do with all your data, check out my¬†session¬†on Data Warehousing in Hadoop¬†ūüôā

Read More

List of Free Hadoop Resources

If you’ve read any job trend reports lately, it’s hard to miss the growth in Big Data jobs and that¬†associated¬†technical and analytical skills are¬†in demand.¬†Employers are¬†looking for new hires¬†with Hadoop-related skills — which includes everything from¬†cluster administration to data analysis — but because it’s so new,¬†it’s rare to find someone with more than 1-2 years experience outside of the Bay Area. What this amounts to is a great opportunity for anyone looking to grow their role, experience,¬†and/or salary. Best of all,¬†because this ecosphere is¬†based on open-source technology, these skills are largely obtainable by anyone with a passion for learning and access to a computer.

To help go-getters get up to speed, I’ve just launched a new page on this blog:¬†Free Hadoop Resources. This is very¬†beginnings of a list I’m compiling of great, free resources for learning Hadoop¬†fundamentals, Hive, Pig, and Spark.

If I missed a great resource, please let me know and I’ll get it added. ūüôā

Read More

My Mac Setup for Hadoop Development

I’ve recently decided to switch to a Mac. Having been such a proponent of all-things-Microsoft in the past, and having invested so much time in my dev skills using¬†a PC, this was a pretty huge move¬†for me. In fact, it took me a very long time to make the decision. But the more time I spent trying to figure out how to do Hadoop dev better, and faster, the more clear it became to me that switching to a Mac would help with these things. After only a few weeks, I’ve already found that many of the things that were very painful on a PC are exceedingly easy on a Mac, such as installing Hadoop locally.

Now, this post isn’t to convince you to switch to a Mac. A person’s OS preference is very personal, and as such,¬†discussions can get almost as heated as religious and political discussions. ūüôā However, for those who are already considering switching to a Mac from a PC, I thought it’d be helpful to outline some of the applications I’ve installed that have improved my Hadoop dev experience.

Applications

App What is it? Why do I use it? Where to get it?
HomeBrew Homebrew installs the stuff you need that Apple didn’t. Makes local app installs super easy. I used this to install Maven, MySQL, Python, Hadoop, Pig, & Spark, & much more. http://brew.sh/
iTerm2 iTerm2 is a replacement for Terminal and the successor to iTerm. For connecting to Hadoop via SSH. This provides some nice features, such as tabs and status colors, which makes it easier to keep track of numerous simultaneous activities in Hadoop. https://www.iterm2.com/
IntelliJ IDEA The Community Edition is an excellent free IDE For development of Pig, Hive, Spark, & Python scripts https://www.jetbrains.com/idea/
0xDBE (EAP) New Intelligent IDE for DBAs and SQL Developers For SQL Server & MySQL development (And yes, I *do* miss SSMS, but I don’t want to have to run a VM to use it) https://www.jetbrains.com/dbe/

My config

IntelliJ Plugins

  • Apache Pig Language Plugin
  • Python Community Edition
  • etc

iTerm2

Bash Profile

    I also added the following code to my local & Hadoop .bashrc profiles. This changes the title of a Bash window. This isn’t specific to iTerm2, and I could have done this on my PC if I had known about it at the time. So if you are using either Terminal or a PC SSH client (i.e. PuTTY, PowerShell), you may still be able to take advantage of this if your client displays window titles.

 

 

 

This is an example of how you would call the code at the start of any new Bash session

iterm2

My Dev Process
I have 3 monitors set up, which are typically configured as:

Monitor 1

  • email
  • calendar
  • web browser (Slack, JIRA, etc.)

Monitor 2

  • IntelliJ

Monitor 3

  • iTerm2, with tabs already open for
    • Pig
    • Hive
    • Hadoop Bash (HDFS, log files, etc.)
    • misc (Python, Spark, etc.)
    • local Bash (GitHub commits, etc.)

In general, I write the code in IntelliJ and copy/paste it into iTerm2. This provides nice syntax highlighting and makes it easy to check my code into GitHub when I’m done. Once I’m past the initial dev phase, I SCP the actual scripts over to the prod Hadoop shell box for scheduling. Overall, I’ve found that this approach makes iterative dev much faster.

That’s pretty much the highlights, though I’ll continue to add to this as I stumble across tweaks, hacks, and apps that make my life easier.

Hopefully for those just starting out on a Mac, you’ve found this post helpful for getting up and running with Hadoop dev. For those who have already made the switch — or who have always used a Mac — did I miss something? Is there a killer app that you love for Hadoop dev? If so, please let me know! ūüôā

Read More

Data Warehousing in Hadoop

The slides from my highly-rated Hadoop Summit 2015 session are now available!

Session: Data Warehousing in Hadoop

Session Abstract

How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful.

Session Slides: Slideshare РData Warehousing in Hadoop

Hadoop Summit 2015
Hadoop Summit 2015

I’ll be honest: I didn’t know what to expect from Hadoop Summit. I’ve been surprised at the lack of overlap between the PASS community that I know so well and dearly love, and this new community of open-source aficionados, data hackers,¬†and Ph.D. data scientists. Would this new community be interested in data warehousing, a topic traditionally — and fallaciously, in my opinion ūüôā¬†— associated with all things BI and relational? Combine this with the fact that I’ve never even¬†attended Hadoop Summit before, and well… this was easily the most nervous I’ve been before a presentation since my first major presentation¬†in 2009. However, all my fears were for naught…¬†the session was packed — clearly, folks are interested in this topic!¬†And judging from the quantity¬†of conversations I had with people afterwards — many of whom are from companies you’d readily recognize, too — this is a topic that is only going to grow.

For those who were unable to attend but are interested in this topic, I have good news! The session recording should also be available online within the next couple of¬†weeks. I’ll post the link once it becomes available. ūüôā

Lastly, I typically¬†find¬†the conversations I have with session attendees after presentations to be my favorite part of conferences, and this was no exception. Thank you to everyone who attended and reached out to me afterwards! I met some great people, and I regret not doing a better job of exchanging contact information amidst the chaos of the event. If we connected at Hadoop Summit,¬†let’s connect on LinkedIn¬†too. ūüôā

Read More

Why I prefer Pig for Big Data Warehouse ETL


First, a brief note about this blog. Shortly after I announced this blog, an… event?… was announced, and it seemed prudent to avoid blogging while that event was¬†underway. However, the quiet period is now over and¬†I have several months of¬†blog posts in the queue! So let the blogging commence!¬†(again) ūüôā


Apache PigLast week, I had the pleasure of speaking at Hadoop Summit 2015 on Data Warehousing in Hadoop. There was a lot of interest in this topic… the session was packed, and I received a lot of great questions both during and after the session. One question that kept popping up was why I prefer Pig over Hive for performing Data Warehouse ETL in Hadoop. The question itself wasn’t as surprising as the context it was raised in, i.e. “But I thought Hive was for data warehousing?”¬†These questions were largely from people who were investigating and/or beginning their own data warehouse migration or enrichment project. After a few of these conversations, I came to realize that this was a result of the excellent marketing that Hive has done in billing itself as “data warehouse software.”

Given the confusion, please allow me to clarify my position on this topic: I think Hive and Pig both have a role in a Hadoop data warehouse. The purpose of this¬†post is to explain¬†my opinion ūüôā¬†of the¬†role each technology plays.

I¬†rely on Hive for two primary purposes: definitions/exposure of DDL¬†via HCatalog and ad hoc querying. I can create an awesome data warehouse, but if I don’t expose it in Hive via HCatalog, then data consumers won’t know what’s available to query. Commands such as show databases and show tables wouldn’t return information about the rich and valuable datasets my team¬†produces. So I think it’s actually extremely important to define DDL in Hive as the first step to producing new datasets, i.e. :

Also, Hive has done a¬†decent job of ensuring that the core query syntax & functionality from SQL has been ported into Hive.¬†Thusly, anyone who has a basic understanding of SQL can easily sit down and start to retrieve data from Hadoop. The importance of this cannot be understated… quite simply, it has lowered the barrier of entry and has¬†provided analysts with an easier transition from querying legacy DWs to querying Hadoop using HiveQL.

Hive also makes it easy to materialize the results of queries into tables. You can do this either through CTAS (Create-Table-As) statements, which are useful for storing the results of ad hoc queries, or using an INSERT statement. This makes it very easy and natural for someone with a data engineering background in pretty much any enterprise data warehouse project (SQL Server, APS PDW, Teradata, Netezza, Vertica, etc.) to gravitate toward Hive for this type of functionality.

However, I think that’s a short-sighted mistaken.

Here’s why:¬†when it comes to ETL,¬†my focus is on a robust solution that ensures enterprise-level, production-quality processes that data consumers¬†can rely on and have confidence in.¬†Here are some of the top reasons why I believe Pig fits this role better than Hive:

  1. Hive works very well with structured data, but the whole point of moving our data warehouse to Hadoop is to take advantage of so-called “new data”, also known as unstructured and semi-structured data. Hive does provide support for complex¬†data types, but it can quickly get… well, complex ūüôā when trying to work with this data and the limitations it imposes (lateral views, anyone?). In general, the more complex the data or transformation, the easier it seems to be to perform it in¬†Pig than Hive.
  2. Much of the processes I work with are pipeline-friendly; meaning, I can start with a single dataset, integrate/transform/cleanse it, write out the granular details to a table, then aggregate the same data and write it to a separate table. Pig makes this faster overall by allowing you to build a data pipeline and minimizes data quality issues resulting from inconsistent logic between the granular and aggregate table versions.
  3. Hadoop is not meant for serving data; instead, my team writes the final results of ETL to a serving layer, which includes SQL Server, MySQL, and Cassandra. Pig makes it easy to process the data once and write the exact same dataset to each destination server. This works well for both refresh and incremental patterns and, again, minimizes data inconsistencies resulting from the creation of separate ETL packages for each of these destination servers.
  4. Pig’s variable support is¬†better than Hive’s. I can write logic like…

    Anyone who has written enterprise ETL understands why this is a very good thing.
  5. PigStats makes it easier to identify jobs that may have exceptions, such as jobs that write zero rows or jobs that write a different number of rows to each destination server. This makes it easier to monitor for and raise alerts on these types of conditions.

With that said, I do recommend Hive as a great place to start for ad hoc and one-off analyses or for¬†prototyping new processes. However, once you’re ready to move towards production-quality processes, I think you’d be better served standardizing on¬†Pig for data warehouse ETL and Hive for data warehouse query access.

Your turn: what do you use for ETL in Hadoop? Do you like it or dislike it? ūüôā

Read More

Is Hadoop better than SQL Server?

Over the past year, I’ve switched my focus from SQL Server and Teradata to Hadoop. As someone who has spent the majority of my professional career focused on SQL Server and who has been awarded as a Microsoft Most Valuable Professional (MVP) in SQL Server for 4 consecutive years, it comes as no surprise that I often get asked:

“Why are you switching to Hadoop? Is it better than SQL Server?”

I’ll save you the suspense of a long post and answer the second question first: No, it’s not.¬†

SQL Server is Still Relevant
Here’s why. SQL Server does what it does *extremely* well. I would not hesitate to suggest SQL Server in numerous scenarios,¬†such as the database backend for an¬†OLTP application, a data store for¬†small-to-medium sized data marts or data warehouses, or an OLAP solution for¬†building and serving¬†cubes.¬†Honestly, with little exception, it remains¬†my go-to solution over MySQL and Oracle.

Now that we’ve cleared that up, let’s go back to the first question.¬†If SQL Server is still a valid and effective solution,¬†why did I switch my focus to Hadoop?

Excellent question, dear reader! I’m glad you asked. ūüôā

Before I get to the reason behind my personal decision, let’s discuss¬†arguably the biggest challenge we face in the data industry.

Yes, Data Really Is Exploding
We’re in the midst of a so-called Data Explosion.¬†You’ve probably heard about this… it’s one of the few technical topics that has actually made it into mainstream¬†media. But I still think it’s important to understand just how quickly it’s growing.

Every year, EMC sponsors a study called The Digital Universe, which “is the only study to quantify and forecast the amount of data produced annually.”¬†I’ve reviewed each of their¬†studies and taken the liberty of preparing the following graphic* based on past performance and future predictions. Also worth noting is that, EMC historically tends to be conservative in their data growth estimates.

Data Growth Rates according to EMC's The Digital Universe
Data Growth Rates – EMC’s The Digital Universe

* Feel free to borrow this graphic with¬†credit to: Michelle Ufford & EMC’s The Digital Universe

Take a moment and just really absorb this graphic. They say a picture is worth a thousand words. My hope is that this picture explains why the concept of Big Data is so important to all data professionals. DBAs, ETL developers, data warehouse engineers, BI analysts, and more are affected by the fact that data is growing at an alarming rate, and the majority of that data growth is coming in the form of unstructured and semi-structured data.

Throughout my career, I have been focused on using data to do really cool things for the business. I have built systems to personalize marketing offers, predict customer behaviors, and improve the customer experience in our applications. There is no doubt in my mind that Hadoop is absolutely critical to the ability of an enterprise to perform these types of activities.

The Bottom Line
SQL Server isn’t going away. Arguably, the most valuable raw data in an enterprise will still be managed in a SQL Server database, such as inventory, customer information, and order data.

So again: why did I make the decision to focus on Hadoop over the past year?

I once had the pleasure to work for a serial entrepreneur. One day over lunch, he gave me a piece of advice that resonated¬†with me¬†and would come to influence¬†my whole career: “Michelle, to be successful in whatever you do, you need to find the point where your heart and the money intersect.”

My heart is in data, the money is in the ability to effectively consume data, and Hadoop is where they intersect.

Read More