An old-school technique for new-school big data

Use this obscure technique to realize the benefits of big data while avoiding flat file bottlenecks.


By compressing the file and streaming the result from source to target, you use less processing power, less memory, and less disk space. And you’ll get the same job done in a fraction of the time.

Sometimes the best solutions to new problems come from using time-tested techniques in a different context. This can be refreshing for developers using traditional techniques, a revelation for younger developers, and potentially eye-opening for both—especially when you are talking big data.

The business has high expectations of big data. They expect immediate access, which can stress existing processes. Extract, transform, and load (ETL) tools need to stream data efficiently when it comes to real-time data. Bulk ETL operations frequently process too much information and cannot keep up with incoming data. Out-of-the-box approaches sometimes require too much network bandwidth, memory, processing power, and manpower. Additional techniques can be helpful.

Alternative to CDC

Other legacy data integration techniques can excel when it comes to big data. For instance, change data capture (CDC) flags only changed or updated data. As a result, data can move in smaller, more efficient batches, allowing more frequent queries. But there is a catch with CDC: You cannot track records inserted into flat files.

So how can you efficiently process flat files and avoid intake bottlenecks? The trick is to minimize and stream your intake flat file size through the Linux ‘Compress’ command. If you are dealing with sensitive data, you can use an ‘Encrypt’ command here as well. The shell script creates a named pipe file, which is read on the target side as a named pipe file as well. Once unencrypted and decompressed, an otherwise enormous flat file can be streamed and processed efficiently.

Show me the benefits

This technique can be used, for example, to negotiate the best contract with a vendor. Various business units at a company may buy the same product from the same vendor under different contract terms. By extracting business units’ purchase orders from the flat files, the business can identify where cost savings can be realized. They can then renegotiate contract terms and potentially save millions of dollars.

You get the performance, efficiency, scalability, and speed afforded by CDC by:

  • Avoiding downtime by making changes to enterprise data while your operational systems are running
  • Providing timely data more often
  • Reducing software, hardware, and personnel costs because you are moving only the changed data

To put it simply, compressing the file removes the possibility of an intake bottleneck. As a result, you use less processing power, less memory, and less disk space. And you get the job done in a fraction of the time. Most importantly, you can analyze otherwise inaccessible data, which is where the real value lies in the case of big data.

For more on using alternative techniques to analyze real-time data, read last edition’s Potential at Work article Turn to CDC for real-time data.

Related content


Take agility to the next level by using data virtualization for prototyping

Shed the waterfall model and reap the benefits of prototyping: earlier feedback, tighter collaboration, and more accurate results.


Build on innovation and avoid reinvention

How can developers reuse objects if they don’t know they exist in the first place? Try employing a “write once, deploy anywhere” methodology.


Use lean integration principles to clean up your metadata mess

Remain efficient and uncompromising in quality, cost, and speed by keeping your data integration process clean.