Hive / HCatalog Forum

analyse xml file in hadoop

  • #28745
    Anupam Gupta

    HI, i have uploaded XML file in hdfs , Now I want to know how can analyse/see the xml file in hadoop? I am new to hadoop please help.


to create new topics or reply. | New User Registration

  • Author
  • #28939
    Carter Shanklin


    Hive provides a number of XPath UDFs you can use.


    What is usually done is that the XML files are loaded into a Hive table using string columns, one per row. So you might have a DDL like CREATE TABLE xmlfiles (id int, xmlfile string);

    Then you can use any of the UDFs against the XML data.

    David Novogrodsky

    I am having a similar problem.

    I created a Hive table using one column. Each row contains one XML record. Here is the script I used to create this first table:
    CREATE EXTERNAL TABLE xml_event_table (
    xmlevent string)
    LOCATION “/user/cloudera/vector/events”;

    Here is a sample XML Event. Part of an XML Event
    <Event xmlns=””><System><Provider Name=”Microsoft-Windows-Security-Auditing” Guid=”54849625-5478-4994-a5ba-3e3b0328c30d”></Provider> <EventID Qualifiers=””>4672</EventID> <Version>0</Version>…</Event>

    I want to create a view that contains the EventID. But the xPath is not working correctly:
    CREATE VIEW xpath_xml_event_view01(event_id, computer, user_id)
    xpath_string(xmlevent, ‘Event/System/EventID’),
    xpath_string(xmlevent, ‘/Event[1]/System[1]/Computer’),
    xpath_string(xmlevent, ‘/Event[1]/System[1]/EventID’)
    FROM xml_event_table;

    Josh Spiegel

    Oracle XML Extensions for Hive can be used to create a Hive table over XML.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.