One of the best features of embedding is how it simplifies writing UDFs and using them right away in the same script without superfluous declarations. Computing a transitive closure is a good example of an algorithm requiring an iteration, a few simple UDFs and an end condition to decide when to stop iterating. Embedding is available in Pig 0.9. Knowledge of both Pig and Python is required to follow. Examples are available on github.
2. The problem Before diving into the solution I will give a brief overview of the problem. We want to find the connected components of a graph by performing a transitive closure. Of course I will assume that the graph does not fit in main memory otherwise what follows is not very useful.
A picture is worth a thousand words; here is a visual statement of the problem:
I highlighted the diameter in red as it plays an important role in the complexity of the algorithm. The diameter of a connected component is the longest shortest path between 2 nodes of the component.
The algorithm will require log2(max(diameter of a component)) iterations because every iteration we double the number of relations that have been followed. This is not the overall complexity but it ensures that the number of iterations will stay low as with just 10 iterations you can handle components with a diameter of 1000+. In most real life use cases the diameter is very low and a few iterations will suffice.
3. The algorithm I wrote the following implementation to illustrate the embedding feature. It may not be optimal but it is a working solution with a reasonable number of iterations. The algorithm does a graph traversal from all the nodes in parallel. It does one hop per iteration (pardon the scientific jargon) and then merges the results per node (using follow() and OR()). As a consequence the diameter of the graph followed from a node doubles every iteration. In this case, the edges of the graph are considered bidirectional and are stored one way (see normalize() and bidirectional()) to save space. The edges are stored in two files: followed (empty at the beginning) and to_follow which will tell us when all the edges have been followed (to_follow is empty). The last part of the script does some formatting to turn a list of edges into group of nodes (using sort() to make sure they are always represented the same way).
4. The code Here is an example of how to solve this problem by embedding Pig in Python.
SPLIT new_links INTO new_followed IF followed != 0, new_to_follow IF followed == 0;
new_followed = FOREACH new_followed GENERATE id1, id2;
new_to_follow = FOREACH new_to_follow GENERATE id1, id2;
STORE new_followed INTO ‘$new_followed’;
STORE new_to_follow INTO ‘$new_to_follow’;
to_follow = “data/tc_data_simple”
followed = “out/tc/followed_0”
# create empty dataset for first iteration
Pig.fs(“mkdir “ + followed)
Pig.fs(“touchz “ + followed + “/part-m-00000”)
for i inrange(10):
new_to_follow = “out/tc/to_follow_” + str(i + 1)
new_followed = “out/tc/followed_” + str(i + 1)
job = Q.bind().runSingle() ifnot job.isSuccessful(): raise‘failed’
to_follow = new_to_follow
followed = new_followed # detect if we are done ifnot job.result(“new_to_follow”).iterator().hasNext(): break
links = LOAD ‘$followed’ AS (id1: chararray, id2: chararray);
links = FOREACH links GENERATE FLATTEN( bidirectional(id1, id2) );
result = DISTINCT (
FOREACH (GROUP links by id1)
STORE result INTO ‘out/tc/groups’;
5. Notables There are a few things to notice in this implementation:
UDFs are just defined by functions in the same script
The output schema of the UDF is defined by using the @outputSchema decorator. Pig needs the schema of the output to interpret the statements following the UDF call.
The native Python structures (dictionary, tuple, list) are used, the conversion to Pig types (Map, tuple, bag) is automatic. This makes the UDF code much more compact.
UDFs are directly available in embedded Pig scripts using the function names. No other declaration is required.
We iterate a maximum of 10 times (max diameter = 210 = 1024) and check if the “to follow” relation is empty to decide if we need to stop the iteration. This is done using the JobStats object returned by runSingle().
ASU's Data-Driven Innovation to Transform Society
With the San Jose DataWorks Summit (June 13-15) just two months away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Jay Etchings, Director of Operations for Research Computing at Arizona State University. In February we announced Jay’s new book, “Strategies in Biomedical Data…
As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…
It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons…
Jumpstart Your Digital Transformation with Hadoop...
Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern…
People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: https://www.youtube.com/watch?v=HPKI-U_ef5w In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single…
The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about…
Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here's why. The first big trend is IoT. According to BI Intelligence, we can now…
Insights Aggregation and Predictive Analytics within...
How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source…
Show us what you can do! Here at Hortonworks, we’ve been showing people how fast and easy it is to use Hortonworks DataFlow, powered by Apache NiFi to easily, quickly and securely move data to where you need it. So we thought we’d test it out - and we are offering a speed test challenge!…
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, HAWQ, Zeppelin, Atlas, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.