“A good tool improves the way you work. A great tool improves the way you think” Jeff Duntemann
In my previous blog, I described characteristics of various data processing pipelines regardless of use case, persona, context, or data size. To support these use cases or personas, the right design tools are essential to creating data processing pipelines using Lego-like blocks and an easy-to-use interface to build a pipeline using those blocks.
From our extensive experience, here are key capabilities that proved that make data processing pipelines successful. Understanding these key capabilities makes it easier to choose the right tools to support your use case.
1.Easy to learn and build: As mentioned in my Top 10 Principles of an Intelligent Data Platform blog, data platform interfaces must be design-led and code-less. The data platform must supply Lego-like building blocks for building data processing pipelines in minutes without requiring complex programming frameworks.
2. Metadata driven: At pipeline design time, you want to explore your data sources, understand datasets, and see where else those data sources are used. Are those data sources authoritative, and/or possess other metrics-driven attributes that establish their viability? To answer these questions, you need excellent metadata. For instance, tooling integrated with an intelligent data catalog allows you to do all this very easily.
3. Comprehensive: A design tool must provide a way to create all types of data processing pipelines including integration, quality, security, B2B, and so on. Tools must support all data types and provide a contextual interface for both batch and streaming. For example, tooling for hierarchical data types must understand and provide interfaces for structures, arrays, and dictionaries. They must also provide an interface for flattening and creating hierarchies.
4. Reusability: Once you build a pipeline, wouldn’t it be great to reuse that logic easily wherever you need it? Yes, of course. Tooling must support capabilities that enable reuse of logic, like parameterization, mapplets, and dynamic mappings.
5. Designed to support both self-service and IT-driven use: A design tool must support a spreadsheet-like self-service interface for business users. A design tool must capture the operations a business user performs on data and translate those operations into a data processing pipeline that IT can operationalize.
6. Intelligent: Even visual and easy-to-use tools can become cumbersome for repetitive tasks like choosing next transformation in the pipeline. A tool must learn to predict such next steps and recommend them to the designer.
7. Easy to test and debug: Once you’re done with development, you will need to test your logic. How do you test? The design tool must provide a way to interact with your test environment and validate results without interfering with your production environment.
How do you debug a data pipeline if you don’t get the expected results? Design tools must provide a good debugging environment, for example, the ability to preview data at any point in the pipeline.
8. Easy to deploy: When you are done with development and testing, how do you deploy? Design tools must provide a way to package up all required logic and deploy fully or incrementally to the target environment.
9. Well-abstracted: Data processing languages and frameworks evolve very fast. The data pipeline you create must be able to abstract and leverage the most performant framework without requiring you to completely rewrite your data processing pipeline logic.
10. DevOps support: The design tool must provide out-of-the-box functionality or APIs for easy integration into your DevOps infrastructure. It must integrate with source code control systems, artifact repositories, and monitoring tools.
11. Cloud-based or hybrid hosting of the tool: Are you ready to host the design environment on-premises? Does the tool provide a fully cloud-based design environment? Cloud-based SaaS design environments can significantly cut down operational overhead for you.
When you’re evaluating data platforms to support your data-driven initiative, I highly recommend you take the time to understand design tools to support data processing pipelines. As the opening quote suggests, good tools improve the way you work and think.
Now that we’ve focused on design tools and data processing pipelines, in the next blog I will discuss data pipeline execution engines.