The Hortonworks Community Connection is now live. A completely rebuilt Q&A forum, Knowledge Base, Code Hub and more, backed by the experts in the industry.

You will be redirected here in 10 seconds. If your are not redirected, click here to visit the new site.

The legacy Hortonworks Forum is now closed. You can view a read-only version of the former site by clicking here. The site will be taken offline on January 31,2016

Hive / HCatalog Forum

analyse xml file in hadoop

  • #28745
    Anupam Gupta

    HI, i have uploaded XML file in hdfs , Now I want to know how can analyse/see the xml file in hadoop? I am new to hadoop please help.


  • Author
  • #28939
    Carter Shanklin


    Hive provides a number of XPath UDFs you can use.


    What is usually done is that the XML files are loaded into a Hive table using string columns, one per row. So you might have a DDL like CREATE TABLE xmlfiles (id int, xmlfile string);

    Then you can use any of the UDFs against the XML data.

    David Novogrodsky

    I am having a similar problem.

    I created a Hive table using one column. Each row contains one XML record. Here is the script I used to create this first table:
    CREATE EXTERNAL TABLE xml_event_table (
    xmlevent string)
    LOCATION “/user/cloudera/vector/events”;

    Here is a sample XML Event. Part of an XML Event
    <Event xmlns=””><System><Provider Name=”Microsoft-Windows-Security-Auditing” Guid=”54849625-5478-4994-a5ba-3e3b0328c30d”></Provider> <EventID Qualifiers=””>4672</EventID> <Version>0</Version>…</Event>

    I want to create a view that contains the EventID. But the xPath is not working correctly:
    CREATE VIEW xpath_xml_event_view01(event_id, computer, user_id)
    xpath_string(xmlevent, ‘Event/System/EventID’),
    xpath_string(xmlevent, ‘/Event[1]/System[1]/Computer’),
    xpath_string(xmlevent, ‘/Event[1]/System[1]/EventID’)
    FROM xml_event_table;

    Josh Spiegel

    Oracle XML Extensions for Hive can be used to create a Hive table over XML.

The forum ‘Hive / HCatalog’ is closed to new topics and replies.

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.