Wondering about the four characteristics of big data? Learn the four Vs of big data, what we consider to be the fifth V, and how Informatica solutions address no-limits big data.
Our world has never been more digitized. We are constantly bombarded by technology, in all aspects of life. Mobile phones, smart devices, social networks, sensors, streaming videos, IoT devices—all fuel the massive growth in data in recent decades.
The result is a new class of data problems categorized under the name “big data.” Nearly all organizations are struggling to deal with big data as they face challenges associated with how to manage it, analyze it, protect it, and make it available for use for everyone from data scientists to marketing leaders.
Successful next-generation analytics solutions require a new approach to accommodate the new environment of no-limits data, demands for no-code solutions, and enhanced operationalization while also being cloud-ready and leveraging AI/ML for automation. To improve business operations, however, it’s important to first understand the characteristics of big data.
What are the four characteristics of big data?
There are few definitions of big data (read ours here), but it is commonly agreed that big data has these four key characteristics:
- Volume: the amount of data being generated
- Velocity: the speed at which data is being generated
- Variety: the various types of data being generated, which can largely be grouped into three categories: structured data, semi-structured data, and unstructured data
- Veracity: the trustworthiness of the data
Let’s dig deeper into the four Vs and how Informatica can help you tackle each of them.
Data is being produced at a massive scale. For example, think about how much data is being constantly generated by your mobile phones: chats, blogs, SMS, photos/videos, web searches, streaming music, gaming, traffic data, location data, news feeds, emails, and so on. By 2025, IDC predicts that the Global Datasphere will grow to 175 zettabytes—and nearly 30% of that data will be real-time, created in part by connected users who will have a digital interaction about once every 18 seconds.
All that data does not simply sit in your phone, but instead travels through the Internet via your mobile network and Wi-Fi to eventually end up in businesses with which you interacted. Companies collect and store the data in modern elastic storage platforms like Hadoop, Amazon S3, Azure, Google Cloud, and other cloud storage providers, all of which are designed to host large quantities of data efficiently and economically.
Similarly, big data engines came to life to keep pace with data growth. Computing concepts in parallel processing, data partitioning, horizontal scaling, push compute to data are all put to work to meet the demands posed by big data.
To address the volume problem, Informatica developed the Big Data Management solution (BDM), which incorporates all the computing concepts mentioned above and runs the big data Spark engine in all Hadoop distributions. BDM enables you to process big data spanning the ingesting, transforming, cleansing, and loading phases of the data—from any source to any target, for any data type, and at any scale. In addition, we are building the next-generation platform in the cloud as an iPaaS solution called Integration at Scale. It uses the latest technology in microservices, serverless computing, Spark, and Kubernetes to take the big data solution to the cloud.
Velocity goes hand-in-hand with volume. When data is being generated at high speeds and continuously, it can accumulate rapidly, creating the volume problem. However, velocity presents another challenge that needs a different kind of solution. Much of the data generated in the modern world is in fact streaming data: log files from mobile apps, telemetry, geolocation data, social media streams, IoT device and instrumentation data, and more. Streaming data often requires immediate attention before the data loses much of its value. Here are a few streaming data examples:
- The traffic sensor data that Google Maps uses to alert the user to the best alternate route when there is an accident on the original route
- Credit card transactions that need to be constantly analyzed in real-time to detect potentially fraudulent activities so the bank can proactively halt approval of future suspicious transactions
- Election-day exit-poll tweets that provide valuable insight on early election results when analyzed in a timely fashion
Informatica’s ingestion services allow customers to collect streaming data from the edges and IoT devices and ingest the data into streaming collectors like Kafka or AWS Kinesis. The Big Data Streaming solution (BDS) takes data collected by Kafka or other streaming sources and processes it in real time to produce insights that downstream applications can use to take specific actions. Under the hood, BDS utilizes the big data Spark engine and structured streaming to enable the massive parallel processing of streaming data, in real-time, at big data scale.
Variety refers to the different types of data generated by today’s systems and applications. Big data can include:
- Structured data commonly seen in relational database systems, Hive, or flat files
- Unstructured data seen in music or video files, emails, text messages, and social media data
- Semi-structured data popularized by JSON and XML
Historically, data engines focused on optimizing for structured data processing because it is the most popular form of data (especially in the transactional world). However, there is now a much greater percentage of unstructured data being produced in social, mobile, and streaming apps. Many app-to-app communications are, in fact, done with REST and JSON.
Modern data processing engines like Informatica BDM and BDS have built-in capabilities to handle hierarchical data natively. These solutions understand the native form of the hierarchical data starting from the metadata import and discovery phases, moving into ingestion and transformation, and all the way through to the loading of the data. Both BDM and BDS can handle flat and hierarchical data simultaneously to allow the transformation of both types of data in the same processing pipeline (for example, look up the customer table for customer details from a purchase order in JSON streaming input). Both BDM and BDS leverage Spark’s native hierarchical constructs like RDD, struct, map, array, and operators to process both types of data in their native form.
Veracity ensures the quality of the data so the results produced from it will be accurate and trustworthy. Poor data quality produces poor and inconsistent reports, so it is vital to have clean, trusted data for analytics and reporting initiatives. AI/ML-generated models depend on accurate data or they will produce low-quality predictions and diminish the value of machine learning.
Informatica’s BDM solution, in combination with the Informatica Data Quality and Governance portfolio, helps customers cleanse and standardize their data. Informatica Enterprise Data Catalog supports data discovery and end-to-end lineage to describe the origin and derivation of the data. Enterprise Data Catalog can also profile the data to automatically associate business semantics. Because it is part of the Informatica Intelligent Data Platform, Enterprise Data Catalog shares the same big data engine as BDM for data profiling and to achieve high performance and availability.
What about Value as a big data characteristic?
Many organizations consider Value to be another big data characteristic, bringing the list up to five Vs of big data. Value corresponds to the usefulness of the data. The key lies in being able to separate and select the most relevant and appropriate data for your need from the large (and fast-moving) pool of big data. Big data has immense amounts of potential value if it can be correctly managed and shared to drive analysis, reporting, and confident decision-making.
Learn more about how to manage, use, and operationalize big data, and how Informatica can help you get the most from your fast-growing data resources.
- Next-Gen iPaaS For Dummies, Informatica Special Edition.
- Learn how Informatica uses ML/AI to improve productivity of big data users.
- Read our reference article for more big data basics.
- Watch our webinar for a deep dive into the Integration at Scale and Ingestion at Scale services.
- View an introduction video about Informatica Big Data Streaming.