Scalable Cobol Copybook Data Processing Using Cascading


No data originates within the Hadoop platform. For many enterprises, integrating Hadoop with upstream systems means the ability to parse and ingest data generated by mainframe systems.

In this tutorial, we will parse an EBCDIC Cobol Copybook file generated by a mainframe. To do that we will implement a new Scheme to be used by our existing Cascading file tap. The structure of this data is represented by Copybook text files containing Cobol code. Given a copybook, it is possible to parse the data from the EBCDIC stream.

The mainframe input data used in this tutorial was generated randomly by another program. It is not actual customer data.

In this tutorial, we will develop a parser for translating one specific set of copybooks. Next, we will use the parser in a Cascading application to read sample data and display its contents on the screen. Along the way, we will also view the application in Driven, which can offer unique insight into the application’s performance.

Getting Started

Step 1: Install the prerequsites

Please follow the instructions on this page to install the prerequsite software needed to compile and run this tutorial’s code.

Step 2: Compile your program

$ cd cascading-copybook
$ gradle clean fatjar

Step 3: Run your program in one of two ways:

  • on Hadoop:

    $ hadoop dfs -mkdir /tmp
    $ hadoop dfs -copyFromLocal data/ZOS.FCUSTDAT.RDW.bin /tmp
    $ hadoop jar ./build/libs/cascading-copybook-fat.jar /tmp/ZOS.FCUSTDAT.RDW.bin output/custdat.csv
  • or, in local mode:

    $ java -cp './build/libs/cascading-copybook-fat.jar' data/sample.dat

Step 4: If you ran the program in local mode, verify the output on the screen. The data is read from the sample data file which was generated randomly and printed on the screen:

-6683146426114,%qU`!wOBsbE0'D |u*C3,-19679,0,-687,-0.04,-884,0.00
662321675420,'V !+XpdZt,vKn=^":"E,59304,0,491,0.00,-238,-0.01
8744669117912,=cL2n&szr7H8(0# <*V$,-93071,0,-785,-0.05,-893,0.04

Step 5: If you ran the application using Hadoop, view the execution of your program through Driven

Depending on how you configured your Driven Plugin, either click the Driven URL from your console or log into the Driven application.

14/08/28 12:01:53 INFO state.AppStats: shutdown hook finished. +
14/08/28 12:01:53 INFO rest.DrivenDocumentService: ** +
14/08/28 12:01:53 INFO rest.DrivenDocumentService: plugin version 1.2-eap-5
Application view in Driven


You can also use this live link to view the application in Driven.

Code Details

Solution Architecture and End to End Flow

This tutorial is an illustration on how to use Cascading for reading data generated by Mainframes. The application will work with offline data—​the application doesn’t connect directly to a mainframe computer, but assumes that the data has been exported from the mainframe to a platform where Cascading jobs are being executed. The following diagram shows in yellow the typical steps that the mainframe’s operator has to perform in order to prepare the data for processing by Cascading.

Data Preparation Steps


Discussion about Mainframe Data Format

The code for this tutorial can be divided into two broad components:

  1. Set of classes for handling the low level parsing of copybooks.

  2. Classes specific to Cascading—​scheme, fields, etc., which use the parser classes.

The classes responsible for the low level translation were generated offline using Legstar and were imported into the project for completeness. In production applications, these classes will be imported as a jar dependency.

Let’s first see how we generated the parsers using Legstar

Parser Code Generation

Step 1: Translate COBOL copybooks to XML schemas and Java beans

As explained earlier, the structure of the data is represented by Cobol Copybooks. To translate these structures into Java classes. For this tutorial, we make use of the open source Legstar project to do this translation. LegStar is a set of development tools and execution runtimes aimed at integrating legacy COBOL applications with Java and SOA technologies.

Legstar’s utility "cob2trans" can translate Cobol copybooks into XML Schemas. The generated XML schemas contain COBOL annotations that keep back references from each XML element to the originating COBOL item. The COBOL-annotated XML schemas are then processed via JAXB <link>. LegStar has a JAXB plugin that propagates the original COBOL annotations to the Java beans that JAXB generates. Once JAXB classes are produced, cob2trans invokes the Java compiler and then creates an additional set of classes called binding classes. These classes are key to the COBOL to Java transformation runtime performances—​they avoid the cost of reflection at runtime.

We have done this step offline, but have provided the resulting Java beans classes in the beans subpackage.

Step 2: Translate XML Schemas to Cascading Fields

Cascading models the data stream as a series of records, with each record containing one or more Fields. You can think of Fields as the columns in a database table. In step 1 we were able to parse the copybooks and create XML schemas and Java bean classes from it. We now need to convert the Cobol annotated XML schemas into Cascading Field classes. The source code for the translator classes is present in the "translate" package. We invoked the Cob2Fields translator and placed the resulting Field classes in the fields subpackage.

For instance, here’s one of the copybooks converted offline to a Fields class:

public class Field4 extends Fields

  private static final long serialVersionUID = -1L;

  public Field4()
      new Comparable[]{
        , "Ogp03Earner"
        , "Ogp03TaxcertHeld"
        , "Ogp03VatMarker"
        , "Ogp03PartiesToAccount"
        , "Ogp03IntCertIss"
        , "Ogp03OresCode"
      }, new Type[]{
        , short.class
        , java.lang.String.class
        , short.class
        , short.class
        , short.class
        , short.class


Now that we have the code generated for low level parsing of the copybooks, let’s use it to build a Cascading Scheme. In a typical application, the code generated by the steps performed so far will be bundled as a jar, and the application building and using the scheme will have a dependency on it. For the purposes of this tutorial, in order to keep things simple, we have provided the classes in their source code form.

Cascading Scheme and Client Code

Step 1: Create a Cascading Scheme

A Scheme in Cascading represents the format of the data an application is trying to read or write. Given that we are able to parse the cobol copybooks and translate them into Cascading Field classes, we are now in a position to develop our scheme.

Let’s examine the constructor of the Scheme class:

    private static final CopybookConfig COPYBOOK_CONFIG = new CopybookConfig();

    public Bdfo27Scheme()
      super( Fields.merge(
      new Fields( "BdfoKey" ),
      Fields.merge( COPYBOOK_CONFIG.getFields().values()
        .toArray( new Fields[ COPYBOOK_CONFIG.getFields().size() ] ) ) ) );

In the code shown above, we first instantiate a helper class CopybookConfig which is a container for all the beans and fields specific to our copybooks. For some other copybook, the generated bean and field classes will be different, and this container will hold other beans. You can easily modify this class for your particular use case. In the scheme’s constructor we use the CopybookConfig object to discover the fields specific to this copybook, and append them to the account key field.

The main processing logic of the CopybookScheme is encapsulated in the Source method which is responsible for accepting one input at a time, and converting it to a Cascading Tuple instance.

Step 2: Use the Scheme to Read Mainframe Data and Display Results

Now that we have the parser and the scheme, we are ready to wire everything together and create a simple app which reads a sample EBCIDC coded data and prints the values of some of the fields on the screen. Let’s take a look at the source code of the class app.Main.

In Cascading, data connectivity is provided with Taps. A Tap can read or write data according to the Scheme it is bound with. With the CopybookScheme, we first create an input tap to read the EBCDIC encoded copybook data:

    String path = args[ 0 ];
    Tap<Properties, InputStream, OutputStream> inTap = new FileTap(new CopybookScheme(), path );

Next, we create an output tap to print the values of some select fields on the screen. The sample data contains a lot of fields, but we will select and display the value of only some of them. The fields of interest are passed in to the constructor of the TextDelimited scheme, which is a class to print fields separated by a character separator. In this case, we use comma as the separator character. Note that while we used the CopybookScheme to read data, we are using the in-built Cascading scheme, TextDelimited for output.

    SinkTap<Properties, OutputStream> outTap =
             new StdOutTap(new TextDelimited( new Fields( "Key", "Sname",
            "BicIndclass", "TransfFromSortCode",
            "BalIdent_0", "Bal_0",
            "BalIdent_1", "Bal_1" ), true, "," ) );

Finally, we connect these two taps using a copy pipe, create the flow and execute it:

    Pipe copyPipe = new Pipe( "testPipe" );
    FlowDef flowDef = FlowDef.flowDef().addSource( copyPipe, inTap )
      .addTailSink( copyPipe, outTap )
      .setDebugLevel( DebugLevel.VERBOSE );

    FlowConnector flowConnector = new LocalFlowConnector();
    flowConnector.connect( flowDef ).complete();

Executing this flow will read the data file which was generated randomly, and print the contents of some of its fields on the screen. The fields shown are the account key, followed by some transaction details. While we are only printing the field values on the screen, you can easily proceed with complex data manipulation tasks after you’re able to connect to the EBCDIC data as shown in this tutorial. Cascading has a wealth of in-built data processing primitives such as joins, group-by, etc which can express any custom data processing logic.

What’s next?

This tutorial was a quick introduction to the world of mainframe data and we showed you how you can process EBCDIC data using a robust and scalable framework like Cascading. Using the legstar tool, you can create parser code and then develop the Cascading scheme for your copybook formats.

To understand what you can do next after ingesting the data, we encourage you to try out the ETL tutorial.