Loading a Fixed Width File

to create new topics or reply. | New User Registration


This topic contains 2 replies, has 2 voices, and was last updated by  Carter Shanklin 1 year, 4 months ago.

  • Creator
  • #43477

    Seth Ramey

    I’m trying to figure out how to load the file below. I’ve heard that the best way is something to do with streaming but I don’t know to actually do this.

    Rows that begin with a “01” denote a new record. There can be many (variable number of) rows between “01” rows, but we are only interested in a handful (starts with “02” or “03” or “23” for example). For those rows, we may need to grab 10 characters beginning at position 7 for example. Another row may require 7 characters at position 140.

    My question is – how would we iterate through that via the query tools in Pig or Hive or some other method?

    I’ve uploaded a picture of a sample of the file below:


Viewing 2 replies - 1 through 2 (of 2 total)

You must be to reply to this topic. | Create Account

  • Author
  • #43631

    Carter Shanklin


    Pig might work, I’m not an expert. Hive — no don’t use that for this use case.

    Streaming refers to Hadoop streaming. It will allow you to use your .NET application in conjunction with Map/Reduce. This requires Hadoop on Windows. See http://www.nuget.org/packages/Microsoft.Hadoop.MapReduce/ — that’s all I know about it unfortunately.


    Seth Ramey

    If it helps, here are the rows we are interested in that I’m currently parsing via a small custom c# program. So to obtain the well district, I look on row (that starts with) 01, go to char 14 and get the next 2 characters. Assumption is that rows positions are zero based:

    public struct PositionLength
    public int position, length;
    public string rowbeginswith;

    public PositionLength(string row, int pos, int len)
    rowbeginswith = row;
    position = pos;
    length = len;

    static PositionLength pl_district = new PositionLength(“01″, 14, 2);
    static PositionLength pl_apinumber = new PositionLength(“01″, 2, 9);
    static PositionLength pl_total_depth = new PositionLength(“01″, 28, 5);
    static PositionLength pl_isplugged = new PositionLength(“01″, 90, 1);

    static PositionLength pl_oilgas = new PositionLength(“02″, 2, 1);

    static PositionLength pl_wellcompletiondate = new PositionLength(“03″, 39, 8);

    static PositionLength pl_lat = new PositionLength(“13″, 132, 9);
    static PositionLength pl_lng = new PositionLength(“13″, 142, 9);

    static PositionLength pl_oilnumber = new PositionLength(“23″, 50, 6);
    static PositionLength pl_gasnumber = new PositionLength(“23″, 56, 6);
    static PositionLength pl_operator = new PositionLength(“23″, 11, 6);

Viewing 2 replies - 1 through 2 (of 2 total)
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.