If you’ve read any job trend reports lately, it’s hard to miss the growth in Big Data jobs and that associated technical and analytical skills are in demand. Employers are looking for new hires with Hadoop-related skills — which includes everything from cluster administration to data analysis — but because it’s so new, it’s rare to find someone with more than 1-2 years experience outside of the Bay Area. What this amounts to is a great opportunity for anyone looking to grow their role, experience, and/or salary. Best of all, because this ecosphere is based on open-source technology, these skills are largely obtainable by anyone with a passion for learning and access to a computer.
To help go-getters get up to speed, I’ve just launched a new page on this blog: Free Hadoop Resources. This is very beginnings of a list I’m compiling of great, free resources for learning Hadoop fundamentals, Hive, Pig, and Spark.
If I missed a great resource, please let me know and I’ll get it added. 🙂
First, a brief note about this blog. Shortly after I announced this blog, an… event?… was announced, and it seemed prudent to avoid blogging while that event was underway. However, the quiet period is now over and I have several months of blog posts in the queue! So let the blogging commence! (again) 🙂
Last week, I had the pleasure of speaking at Hadoop Summit 2015 on Data Warehousing in Hadoop. There was a lot of interest in this topic… the session was packed, and I received a lot of great questions both during and after the session. One question that kept popping up was why I prefer Pig over Hive for performing Data Warehouse ETL in Hadoop. The question itself wasn’t as surprising as the context it was raised in, i.e. “But I thought Hive was for data warehousing?” These questions were largely from people who were investigating and/or beginning their own data warehouse migration or enrichment project. After a few of these conversations, I came to realize that this was a result of the excellent marketing that Hive has done in billing itself as “data warehouse software.”
Given the confusion, please allow me to clarify my position on this topic: I think Hive and Pig both have a role in a Hadoop data warehouse. The purpose of this post is to explain my opinion 🙂 of the role each technology plays.
I rely on Hive for two primary purposes: definitions/exposure of DDL via HCatalog and ad hoc querying. I can create an awesome data warehouse, but if I don’t expose it in Hive via HCatalog, then data consumers won’t know what’s available to query. Commands such as show databases and show tables wouldn’t return information about the rich and valuable datasets my team produces. So I think it’s actually extremely important to define DDL in Hive as the first step to producing new datasets, i.e. :
Also, Hive has done a decent job of ensuring that the core query syntax & functionality from SQL has been ported into Hive. Thusly, anyone who has a basic understanding of SQL can easily sit down and start to retrieve data from Hadoop. The importance of this cannot be understated… quite simply, it has lowered the barrier of entry and has provided analysts with an easier transition from querying legacy DWs to querying Hadoop using HiveQL.
Hive also makes it easy to materialize the results of queries into tables. You can do this either through CTAS (Create-Table-As) statements, which are useful for storing the results of ad hoc queries, or using an INSERT statement. This makes it very easy and natural for someone with a data engineering background in pretty much any enterprise data warehouse project (SQL Server, APS PDW, Teradata, Netezza, Vertica, etc.) to gravitate toward Hive for this type of functionality.
However, I think that’s a short-sighted mistaken.
Here’s why: when it comes to ETL, my focus is on a robust solution that ensures enterprise-level, production-quality processes that data consumers can rely on and have confidence in. Here are some of the top reasons why I believe Pig fits this role better than Hive:
Hive works very well with structured data, but the whole point of moving our data warehouse to Hadoop is to take advantage of so-called “new data”, also known as unstructured and semi-structured data. Hive does provide support for complex data types, but it can quickly get… well, complex 🙂 when trying to work with this data and the limitations it imposes (lateral views, anyone?). In general, the more complex the data or transformation, the easier it seems to be to perform it in Pig than Hive.
Much of the processes I work with are pipeline-friendly; meaning, I can start with a single dataset, integrate/transform/cleanse it, write out the granular details to a table, then aggregate the same data and write it to a separate table. Pig makes this faster overall by allowing you to build a data pipeline and minimizes data quality issues resulting from inconsistent logic between the granular and aggregate table versions.
Hadoop is not meant for serving data; instead, my team writes the final results of ETL to a serving layer, which includes SQL Server, MySQL, and Cassandra. Pig makes it easy to process the data once and write the exact same dataset to each destination server. This works well for both refresh and incremental patterns and, again, minimizes data inconsistencies resulting from the creation of separate ETL packages for each of these destination servers.
Pig’s variable support is better than Hive’s. I can write logic like…
<span style="color: #808080;">example_source_filter=FILTER primary_source_table_name BY example_date_column=ToDate('$etl_date','yyyy-MM-dd');</span>
Anyone who has written enterprise ETL understands why this is a very good thing.
PigStats makes it easier to identify jobs that may have exceptions, such as jobs that write zero rows or jobs that write a different number of rows to each destination server. This makes it easier to monitor for and raise alerts on these types of conditions.
With that said, I do recommend Hive as a great place to start for ad hoc and one-off analyses or for prototyping new processes. However, once you’re ready to move towards production-quality processes, I think you’d be better served standardizing on Pig for data warehouse ETL and Hive for data warehouse query access.
Your turn: what do you use for ETL in Hadoop? Do you like it or dislike it? 🙂
I’ll save you the suspense of a long post and answer the second question first: No, it’s not.
SQL Server is Still Relevant Here’s why. SQL Server does what it does *extremely* well. I would not hesitate to suggest SQL Server in numerous scenarios, such as the database backend for an OLTP application, a data store for small-to-medium sized data marts or data warehouses, or an OLAP solution for building and serving cubes. Honestly, with little exception, it remains my go-to solution over MySQL and Oracle.
Now that we’ve cleared that up, let’s go back to the first question. If SQL Server is still a valid and effective solution, why did I switch my focus to Hadoop?
Excellent question, dear reader! I’m glad you asked. 🙂
Before I get to the reason behind my personal decision, let’s discuss arguably the biggest challenge we face in the data industry.
Yes, Data Really Is Exploding We’re in the midst of a so-called Data Explosion. You’ve probably heard about this… it’s one of the few technical topics that has actually made it intomainstreammedia. But I still think it’s important to understand just how quickly it’s growing.
Every year, EMC sponsors a study called The Digital Universe, which “is the only study to quantify and forecast the amount of data produced annually.” I’ve reviewed each of their studies and taken the liberty of preparing the following graphic* based on past performance and future predictions. Also worth noting is that, EMC historically tends to be conservative in their data growth estimates.
* Feel free to borrow this graphic with credit to: Michelle Ufford & EMC’s The Digital Universe
Take a moment and just really absorb this graphic. They say a picture is worth a thousand words. My hope is that this picture explains why the concept of Big Data is so important to all data professionals. DBAs, ETL developers, data warehouse engineers, BI analysts, and more are affected by the fact that data is growing at an alarming rate, and the majority of that data growth is coming in the form of unstructured and semi-structured data.
Throughout my career, I have been focused on using data to do really cool things for the business. I have built systems to personalize marketing offers, predict customer behaviors, and improve the customer experience in our applications. There is no doubt in my mind that Hadoop is absolutely critical to the ability of an enterprise to perform these types of activities.
The Bottom Line SQL Server isn’t going away. Arguably, the most valuable raw data in an enterprise will still be managed in a SQL Server database, such as inventory, customer information, and order data.
So again: why did I make the decision to focus on Hadoop over the past year?
I once had the pleasure to work for a serial entrepreneur. One day over lunch, he gave me a piece of advice that resonated with me and would come to influence my whole career: “Michelle, to be successful in whatever you do, you need to find the point where your heart and the money intersect.”
My heart is in data, the money is in the ability to effectively consume data, and Hadoop is where they intersect.