HDFS Forum

Loading a Fixed Width File

  • #43477
    Seth Ramey

    I’m trying to figure out how to load the file below. I’ve heard that the best way is something to do with streaming but I don’t know to actually do this.

    Rows that begin with a “01” denote a new record. There can be many (variable number of) rows between “01” rows, but we are only interested in a handful (starts with “02” or “03” or “23” for example). For those rows, we may need to grab 10 characters beginning at position 7 for example. Another row may require 7 characters at position 140.

    My question is – how would we iterate through that via the query tools in Pig or Hive or some other method?

    I’ve uploaded a picture of a sample of the file below:


to create new topics or reply. | New User Registration

  • Author
  • #43487
    Seth Ramey

    If it helps, here are the rows we are interested in that I’m currently parsing via a small custom c# program. So to obtain the well district, I look on row (that starts with) 01, go to char 14 and get the next 2 characters. Assumption is that rows positions are zero based:

    public struct PositionLength
    public int position, length;
    public string rowbeginswith;

    public PositionLength(string row, int pos, int len)
    rowbeginswith = row;
    position = pos;
    length = len;

    static PositionLength pl_district = new PositionLength(“01”, 14, 2);
    static PositionLength pl_apinumber = new PositionLength(“01”, 2, 9);
    static PositionLength pl_total_depth = new PositionLength(“01”, 28, 5);
    static PositionLength pl_isplugged = new PositionLength(“01”, 90, 1);

    static PositionLength pl_oilgas = new PositionLength(“02”, 2, 1);

    static PositionLength pl_wellcompletiondate = new PositionLength(“03”, 39, 8);

    static PositionLength pl_lat = new PositionLength(“13”, 132, 9);
    static PositionLength pl_lng = new PositionLength(“13”, 142, 9);

    static PositionLength pl_oilnumber = new PositionLength(“23”, 50, 6);
    static PositionLength pl_gasnumber = new PositionLength(“23”, 56, 6);
    static PositionLength pl_operator = new PositionLength(“23”, 11, 6);

    Carter Shanklin


    Pig might work, I’m not an expert. Hive — no don’t use that for this use case.

    Streaming refers to Hadoop streaming. It will allow you to use your .NET application in conjunction with Map/Reduce. This requires Hadoop on Windows. See http://www.nuget.org/packages/Microsoft.Hadoop.MapReduce/ — that’s all I know about it unfortunately.

You must be to reply to this topic. | Create Account

Support from the Experts

A HDP Support Subscription connects you experts with deep experience running Apache Hadoop in production, at-scale on the most demanding workloads.

Enterprise Support »

Become HDP Certified

Real world training designed by the core architects of Hadoop. Scenario-based training courses are available in-classroom or online from anywhere in the world

Training »

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.