The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Hortonworks Sandbox Forum

Why is Select Count(*) slower than Select *

  • #14459

    Just got my VirtualBox Sandbox running and am going through the tutorials (which are amazing btw!) and had a question that I was hoping folks out here could help me with.

    I ran the Hive query (in Tutorial 1) for running a count on the NYSE_Stocks and then just on listing all the stocks (Select *)..and I am a bit puzzled on how the Select count(*) is slower than the Select *. Also, I noticed that the Select count(*) did not spawn any MR jobs.

    Can anyone explain what is going on behind the covers for both these queries and why one is slower than the other?

  • Author
  • #14461
    Carter Shanklin

    In Hive an aggregation like count requires a map-reduce job while select * is streamed locally out of a single process.

    Tom Hanlon

    There are three types of operations that a hive query can perform.

    In order of cheapest and fastest to more expensive and slower here they are.

    A hive query can be a metadata only request.

    Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.

    A hive query can be an hdfs get request.
    Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.

    A hive query can be a Map Reduce job.

    Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.

    The Map Reduce job has different possibilities as well.

    It can be a Map only job.
    Select * from table where id > 100 , for example all of that logic can be applied on the mapper.

    It can be a Map and Reduce job,
    Select min(id) from table;
    Select * from table order by id ;

    It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.


The forum ‘Hortonworks Sandbox’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.