Informatica's vision for an "intelligent data platform" maps out the next generation of products and solutions that will move our customers confidently into a universe fueled by data. By providing a "data highway" that connects all people, places, and things in an increasingly data-centric world, an intelligent data platform will help increase the efficiency of every individual, process, and application in organizations. The power of an our vision is illustrated in three current initiatives under development:
- Self-service data, code-named "Springbok"
- Data-centric security, which we're calling "Secure@Source"
- And a managed data lake solution
Springbok self-service data
It's commonly accepted that approximately 80 percent of the data analytics process involves preparing data for use by locating it, cleansing it, and standardizing it. The usual approach to creating or shaping a new dataset requires a highly iterative process between business and IT – a process that is time-consuming and frustrating for both.
Our vision is to solve this by empowering and enabling non-technical users to easily find data and then have the system guide them through the process of enriching and shaping their own data. They won't need deep technical skills or prolonged back-and-forth with IT.
Springbok—part of the data intelligence layer of our intelligent data platform vision—will deliver benefits to both business users and IT professionals.
Benefits for business users: Users will be guided through the data preparation process in a self-service manner via intelligent recommendations or suggestions based on the specific data they're using. They will be able to connect to both internal and external sources and export data to their next-generation BI tool of choice.
Automatic data suggestions will ensure the data is:
- High quality – Users will be able to rapidly cleanse data themselves
- Enriched – Data will automatically be enriched
- Relevant – Relevant data will automatically surface
- Complete – The solution will automatically suggest additional datasets
- Combined – The solution will automatically combine relevant datasets
- Trusted – Key influencers will be able to rate their confidence in the data via social collaboration
Benefits for IT professionals: With Springbok, IT professionals will no longer need to respond on a request-by-request basis. Instead, they can proactively begin to understand user demand based on what data business users access. And whatever IT professionals do build becomes part of the set of building blocks for users to reuse when creating new datasets.
The Springbok self-service data solution will enable IT professionals to:
- Understand dataset evolution
- Find key data influencers
- Find key external sources
- Anticipate user data requests
- Dynamically translate business user actions into PowerCenter mappings
Ultimately, Springbok will allow IT to scale to effectively provision data to the business.
Secure@Source data-centric security
The majority of today's security strategies involve securing an enterprise's perimeters: data centers, applications, devices, data. But today's security measures are not always effective, exhibited by the thousands of security breaches that occur each year. Additionally, data is not restricted to these boundaries. Data is moving all over to fuel applications, devices, and decisions. In a data-fueled world, perimeter-based security simply no longer works.
Chief security and risk officers are challenged to properly secure sensitive data. As they no longer know where sensitive data resides, they are not able to implement a common, comprehensive security policy and execute it wherever the data flows. The risk to organizations only increases as data is propagated out from the source to potentially thousands of destinations.
Secure@Source represents Informatica's vision for securing today's data. Built on our intelligent data platform vision, it will allow you to discover, locate, and tag sensitive data where it resides and then map it where it proliferates. In this way, you will be able to secure the data at the source while minimizing the risk downstream.
By implementing security at the origin of the data before it is copied and distributed to many locations, insecure devices, and uncontrolled environments, Secure@Source will protect sensitive data while it complements existing data and network security approaches. It won't replace what you have today, but instead provide an additional layer of security that will protect data throughout its lifecycle.
Cloud-based Secure@Source will do this by allowing you to visualize the risk, which turns the conversation from one of battling ignorance to one of preventing negligence. Planned features will allow you to:
- Create a "data risk heat map" – It will accomplish this by combining three different indexes: a data sensitivity index that identifies sensitive data from patterns across the enterprise; a data proliferation index that identifies intermediate staging and hops; and a data usage index, which shows who in the organization uses what data, how much of it, and what types of access privileges they have.
- Monitor real-time activities – Secure@Source will perform ongoing identification of data usage patterns and detect suspicious usage patterns, especially among privileged users.
- Protect based on a data risk index – This will be tied to compliance regulations and data governance policies and rules, such as mask, alert, block, encrypt and tokenization.
Managed data lake
Big Data can be a big headache to data scientists, analysts, and IT. When working with data from a machine sensor or device or log data from a website, data analysts typically rely on IT to set up and provide access to the data. Next, they're challenged with the time-consuming process of preparing and readying the data for analysis, which requires more iteration with IT, especially if they vaguely define requirements. Worse, analysts may implement their own rules or routines for enriching the data, but seldom share those rules with developers or other users, resulting in compliance, audit, and standardization issues.
Part of Informatica's vision for an intelligent data platform, a managed data lake would aim to empower business users to discover and use data while simultaneously enabling IT to meet data governance and compliance mandates.
The main properties of the managed data lake solution will fall into four areas:
- Data collection – The data to be collected could be traditional transactional data, web data, social data, or Big Data like Hadoop. It could even be machine data or a web log that is collected and then persisted in some way in Hadoop or even on the Amazon EC2 cloud.
Such data could be on-boarded quickly with auto-discovery of the structure and formatting. It could even be a business user's own dataset, with the structure inferred by machine learning. The data would be enabled by universal connectivity, formats, sizes, and latency, with automated cataloging, and infinite storage and processing.
- Data refinery – The data would be systematically cleansed and refined through various stages. For instance, the data is ingested in a raw format. Then it would reside in a "sandbox," where business users could begin to shape or clean it for prototype reporting and scenario testing. Last, the data could be further enhanced and cleansed until it is in the most refined stage. This could include data preparation and versioning—both manual and guided.
- Data consumption – This would include consumption into any tool or application. This is where a business user can easily search for and discover data to use in a self-service manner. Other capabilities will include searching for metadata.
- End-to-end data lake governance – This is where capabilities like collaborative governance (Wiki-like rating, authoring, and sharing), data privacy, authorization, and resource management come into play.
Benefits of the managed data lake: The managed data lake solution will work with all types of data (unstructured, unrefined, and structured) and offer organizations:
- A persistent data layer to collect, discover, refine, govern, share securely, and control all data assets and feeds – both internal and external.
- Cost-effective, one-stop shop for any data search and consumption at any latency.