Hadoop in the Enterprise
As Tom White, one of the foremost experts for Hadoop, put it in his book Hadoop: The Definitive Guide, “The good news is that Big Data is here. The bad news is that we are struggling to store and analyze it.” With Hadoop, organizations are discovering and putting into practice new data analysis and mining techniques that were previously impractical for performance, cost, and technological reasons. As a result, Hadoop is increasingly becoming a popular option to process, store and analyze huge volumes of semi-structured, unstructured, or raw data that often comes from disparate data sources.
But exactly how and when do you take advantage of Hadoop?
The primary strength of Hadoop is proven cost-effective scalability to leverage commodity hardware. It provides support for the processing of all data types – whether structured, semi-structured or unstructured – and the open extensibility of Hadoop enables developers to augment it with specialized capabilities for a broad range of applications.
Many organizations are beginning to look at Hadoop as an extension to their environments to tackle the volume, velocity, and variety of Big Data. As a result, Hadoop adoption will grow - in a recent survey of large-scale data users, more than half of the respondents stated that they are considering Hadoop within their environment.
Data Integration and Hadoop
Hadoop does not replace existing systems. Instead, Hadoop augments them by enabling the additional processing of large volumes of data so existing systems can focus on what they do best. Data integration plays a key role for organizations that want to combine Hadoop with data from multiple systems to realize breakthrough business insights not otherwise possible. The Informatica Platform allows organizations to leverage Hadoop within a hybrid environment in order to take advantage of the unique strengths of each technology, and maximize performance of the overall environment.
Using a Data Integration Platform for Hadoop
Like any emerging technology, Hadoop is not without its challenges. A comprehensive, open and unified data integration platform allows organizations to address these challenges and take full advantage of Hadoop by providing the following capabilities:
- Universal data access – Organizations will use Hadoop to store and process a variety of diverse data sources and often face challenges in combining and processing all relevant data. A data integration platform helps organizations achieve ease and reliability of pre- and post-processing of data into and out of Hadoop.
- Data parsing and exchange - Hadoop excels at storing a diversity of data, but the ability to derive meanings and make sense of it across all relevant data types is a major challenge. A data integration platform helps improve productivity for extracting greater value from unstructured data sources – images, texts, binaries, industry standards, etc.
- Managing metadata. Hadoop lacks metadata management and data auditability without which, the outcomes of projects are suspect and may suffer from inconsistency and poor visibility. A data integration platform supplies full metadata management capabilities, with data lineage and auditability, and promotes standardization.
- Data quality and data governance. While some data in Hadoop is kept for storage or experimental tasks that do not require high level of data quality, many organizations will use Hadoop for end-user reporting and analytics. They will find it hard to trust the underlying data. A data integration platform provides capabilities to profile, cleanse and manage data to better understand what data means, increase trust, and manage data growth effectively and securely.
- Mixed workload management. Hadoop is not able to manage mixed workloads according to user service-level agreements (SLAs). A data integration platform enables integration of data sets from Hadoop and other transaction sources to do real-time business intelligence and analytics as events unfold.
- Resource optimization and reuse. Organizations will need to find and recruit Hadoop resources and create a framework to reuse and standardize data integration tasks. A data integration platform promotes reuse of IT resources across multiple projects and boosts return on investment in personnel recruitment and training while ensuring availability of resources supported by eco-system.
- Interoperability with rest of architecture. It is challenging to rationalize Hadoop and incorporate Hadoop as part of the extended environment. A data integration platform’s capabilities for universal data access and transformation support the addition of Hadoop as part of an end-to-end analytics and data processing cycle that helps bridge the gap between Hadoop and your existing IT investment.
A variety of Hadoop projects, including those requiring metadata management, mixed workloads, resource optimization, and interoperability can benefit from a platform approach to data integration. A platform approach to data integration can help you can take full advantage of the data processing power of Hadoop and exploit the proven capabilities of an open, neutral, and complete platform for integrating data.
Informatica for Hadoop
Informatica is uniquely positioned to help you get more from your Hadoop investments and leverage existing data integration and ETL skill sets. With the Informatica Platform
- Achieve ease and reliability of pre- and post-processing of data into and out of Hadoop
- Improve productivity for extracting greater value from unstructured data sources – images, texts, binaries, industry standards, etc.
- Drive metadata-driven auditability
- Promote governance, trust and security over siloed activities with Hadoop deployments
- Combine flexibility with high data processing power
- Manage mixed workloads and concurrency with high throughput