Big data is a complex subject with many underlying technologies and principles. It can be difficult for nontechnical business users to talk about big data basics without understanding some of these terms. As you get started on your big data journey, here is a glossary to help you unpack some of the key ideas behind this exciting technology discipline.
Analytics explores large volumes of data for new insights. It comes in various forms: Behavioral analytics identifies patterns in people’s actions, while clickstream analytics explores their activity on a website. Predictive analytics uses historical information to forecast future trends, while location analytics overlays geographic mapping on other kinds of data. Text analytics examines written content for meaning, which can include sentiment.
Much of the data used for big data analytics has personally identifiable information (PII) attached to it. Anonymization strips this data away so that big data scientists can identify trends in it without violating individual privacy.
Modern sources generate unprecedented amounts of data. These data sets are now so large and varied that traditional data processing techniques, such as relational databases and data warehouses, can’t handle them. This data is known as big data. Big data has several characteristics, including:
This is the term for data that is traveling over any kind of network. Also known as data flows, it contrasts with data that is stored somewhere, such as a database. Data-in-motion often has its own management, security, and encryption requirements.
Data mining algorithms find hidden patterns in data. This process forms the basis of many analytics solutions.
Data science uses interdisciplinary methods, processes, and systems to extract insights from data in its various forms, either structured or unstructured.
This is a type of database used for understanding relationships in large sets of highly connected data. It can be used for a range of applications that traditional relational databases aren’t suited for, such as understanding relationships between people or managing geographic data.
Originally developed for Yahoo, this is an open source distributed data management platform that is now managed by the Apache Software Foundation. It excels at storing and processing very large volumes of data by spreading it across lots of computers.
Apache Hive is a data warehouse that enables easy data summarization and ad-hoc queries through an SQL-like interface for large stored data sets. Hive supports three execution engines: MapReduce, Tez, and Spark. It makes Hadoop data look just like relational database tables and is used in the majority of clusters. It provides an easy entry point into Hadoop for SQL developers and is often used in a customer’s first use case.
Unlike traditional databases, which pull information from hard drives, in-memory databases manipulate their data in the computer’s main memory. This makes them far faster at processing data.
The Internet of Things (IoT) is the generic name for everyday physical objects that are connected to the internet, ranging from household appliances to automobiles and weather sensors. These devices often generate streams of semi-structured data that we can use to better understand activities and patterns in the physical world.
An advanced form of statistical analysis that allows computers to make decisions without being explicitly programmed for them, machine learning typically uses neural networks to analyze large data sets and find common patterns in a process called training. It then uses what it has learned to analyze new data sets in a process called inference.
Machine learning algorithms can constantly learn from new data, refining their accuracy over time. Machine learning’s more sophisticated sibling, deep learning, uses more layers in neural networks to achieve more accurate results.
This is the computing model that Hadoop uses to process large volumes of data. MapReduce is mainly focused on batch processing. Mapping the job divides it into many parts for different computing nodes to process. Reducing it aggregates all their results into a single answer.
This is a new class of database systems that doesn’t use the incumbent relational database model. These databases are used for larger, quickly evolving data sets. Their schemas (the structure of their stored data) can be easily changed, whereas the schemas in relational databases are more rigid and unwieldy. NoSQL databases don’t use the traditional structured query language (SQL) for data queries.
In serial processing, one computer processes data. This creates a bottleneck for large data jobs. Often, these big data jobs can be broken down into multiple parts, each of which can be given to a separate computer to process. These computers can collectively complete the job much faster by processing them in parallel.
R is a statistical programming language used to query large data sets. It is the lingua franca for many big data scientists.
This is the act of putting personally identifiable information back into an anonymized big data set so that it can be used operationally.
Semi-structured data comes from log files or IoT sensor streams. Often, it does not conform with the formal structure of data models, but contains tags or other markers to separate elements and apply hierarchies of records and fields within the data
Big data analysis is often conducted on static data that has been collected previously. This is known as batch processing. In many cases, though, data scientists need to analyze data from a constantly updated feed in real or near-real time. Examples of streaming data sources include online player interactions for gaming companies, information from industrial sensors, and real-time changes in stock market prices.
Structured data conforms to a predetermined format. A customer record may contain names, addresses, telephone numbers, and credit card details, and each of these fields will have clearly defined data types and characteristics. Unstructured data is everything else: all the web pages, social media posts, Word documents, images, video, and PowerPoint slides that can be found in every organization. It can be difficult and time-consuming to harvest intelligence from these documents using big data algorithms, but the rewards can be immense.
YARN serves as a common data operating system that enables the Apache Hadoop ecosystem to integrate applications natively, and leverage existing technologies while extending consistent security, governance, and operations across the platform. With YARN as a cornerstone embedded within, it enables the mainstream adoption of Hadoop by enterprises of all types and sizes for production use cases at scale.
As you map out or expand your big data strategy, knowing these big data basics will help you be successful.
To learn more about how to advance your big data strategy, check out this big data scorecard tool.