Skip to content

library that enables reading COBOL sequential files into the Apache Arrow in-memory data format

License

Notifications You must be signed in to change notification settings

navapbc/cobol-sequential

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cobol Sequential

License

The Cobol Sequential project aims to provide a library for reading COBOL sequential files into the Apache Arrow In-Memory data format. This will make it easier to work with data from legacy COBOL systems in a modern data format.

Cobol Sequential File Format

The Cobol Sequential File Format is made up of records that are meant to be read in sequential order.

File Attributes:

  • A file can contain more than one type of record.
    • Some additional record types may contain meta-data related to a file.
    • Some record types may form a parent and child relationship.
  • Often times files may use a mainframe file encoding such as EBCDIC.

Record Attributes:

  • Each record is made up of several fields with strict data types that are defined by a COBOL Copybook.
  • A record can have a variable or fixed length.
  • A record could have a byte-prefix.
  • A record may or may not be followed by a delimiter depending on if it's fixed or variable length.

There can be multiple variations of the attributes listed above that impact how a file is structured. You can find more detailed documentation on the file format here.

Why Apache Arrow?

Apache Arrow provides a columnar In-Memory data format that is typically used for Analytical based workloads. But it is also designed to be highly interoperable which will help make data that was once difficult to reach compatible with some of the most modern data tools around.

Once you have data in the Apache Arrow format you can:

  • Convert the data to other common formats such as CSV, Parquet, ORC, or Arrow IPC.
  • Share the data across programming languages or processes.
  • Write the data directly over the network via Apache Arrow Flight.
  • Perform computations through the pandas library.
  • Perform computations over large files with the Arrow Compute library.
  • Perform computations over large collections of files with the Arrow DataSet library.

Although, Apache Arrow may not be the final destination for your data, it can help you get there in a way that leaves room for your project to pivot as needs change.

Project Status

Currently the project is developing an early proto-type of copybook parser in rust.

After a basic copybook parser is implemented we will be able to start building the sequential file reader.

Contributing

See the contributing guidelines

This is an open project so feel free to report issues, propose new features, or open pull-requests.

License

see the Apache 2.0 license here

About

library that enables reading COBOL sequential files into the Apache Arrow in-memory data format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published