newsletter

Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
prev slide
Scaling Your Big Data Strategy: How Global Data Management Can Help
March 05, 2018
Six Big Data Questions Your Business Needs to Answer
Next slide

Big Data Basics: A Glossary of the Terminology You Should Know

Big data is a complex subject with many underlying technologies and principles. It can be difficult for nontechnical business users to talk about big data basics without understanding some of these terms. As you get started on your big data journey, here is a glossary to help you unpack some of the key ideas behind this exciting technology discipline.

Analytics

Analytics explores large volumes of data for new insights. It comes in various forms: Behavioral analytics identifies patterns in people’s actions, while clickstream analytics explores their activity on a website. Predictive analytics uses historical information to forecast future trends, while location analytics overlays geographic mapping on other kinds of data. Text analytics examines written content for meaning, which can include sentiment.

Anonymization

Much of the data used for big data analytics has personally identifiable information (PII) attached to it. Anonymization strips this data away so that big data scientists can identify trends in it without violating individual privacy.

Big Data

Modern sources generate unprecedented amounts of data. These data sets are now so large and varied that traditional data processing techniques, such as relational databases and data warehouses, can’t handle them. This data is known as big data. Big data has several characteristics, including:

  • Volume — the amount of data
  • Variety — the type of data (structured or unstructured)
  • Velocity — the speed of data

Data-in-Motion

This is the term for data that is traveling over any kind of network. Also known as data flows, it contrasts with data that is stored somewhere, such as a database. Data-in-motion often has its own management, security, and encryption requirements.

Data Mining

Data mining algorithms find hidden patterns in data. This process forms the basis of many analytics solutions.

Data Science

Data science uses interdisciplinary methods, processes, and systems to extract insights from data in its various forms, either structured or unstructured.

Graph Database

This is a type of database used for understanding relationships in large sets of highly connected data. It can be used for a range of applications that traditional relational databases aren’t suited for, such as understanding relationships between people or managing geographic data.

Apache Hadoop

Originally developed for Yahoo, this is an open source distributed data management platform that is now managed by the Apache Software Foundation. It excels at storing and processing very large volumes of data by spreading it across lots of computers.

Apache Hive

Apache Hive is a data warehouse that enables easy data summarization and ad-hoc queries through an SQL-like interface for large stored data sets. Hive supports three execution engines: MapReduce, Tez, and Spark. It makes Hadoop data look just like relational database tables and is used in the majority of clusters. It provides an easy entry point into Hadoop for SQL developers and is often used in a customer’s first use case.

In-Memory Database

Unlike traditional databases, which pull information from hard drives, in-memory databases manipulate their data in the computer’s main memory. This makes them far faster at processing data.

Internet of Things (IoT)

The Internet of Things (IoT) is the generic name for everyday physical objects that are connected to the internet, ranging from household appliances to automobiles and weather sensors. These devices often generate streams of semi-structured data that we can use to better understand activities and patterns in the physical world.

Machine Learning

An advanced form of statistical analysis that allows computers to make decisions without being explicitly programmed for them, machine learning typically uses neural networks to analyze large data sets and find common patterns in a process called training. It then uses what it has learned to analyze new data sets in a process called inference.

Machine learning algorithms can constantly learn from new data, refining their accuracy over time. Machine learning’s more sophisticated sibling, deep learning, uses more layers in neural networks to achieve more accurate results.

MapReduce

This is the computing model that Hadoop uses to process large volumes of data. MapReduce is mainly focused on batch processing. Mapping the job divides it into many parts for different computing nodes to process. Reducing it aggregates all their results into a single answer.

NoSQL

This is a new class of database systems that doesn’t use the incumbent relational database model. These databases are used for larger, quickly evolving data sets. Their schemas (the structure of their stored data) can be easily changed, whereas the schemas in relational databases are more rigid and unwieldy. NoSQL databases don’t use the traditional structured query language (SQL) for data queries.

Parallel Processing

In serial processing, one computer processes data. This creates a bottleneck for large data jobs. Often, these big data jobs can be broken down into multiple parts, each of which can be given to a separate computer to process. These computers can collectively complete the job much faster by processing them in parallel.

R

R is a statistical programming language used to query large data sets. It is the lingua franca for many big data scientists.

Re-identification

This is the act of putting personally identifiable information back into an anonymized big data set so that it can be used operationally.

Semi-structured Data

Semi-structured data comes from log files or IoT sensor streams. Often, it does not conform with the formal structure of data models, but contains tags or other markers to separate elements and apply hierarchies of records and fields within the data

Streaming

Big data analysis is often conducted on static data that has been collected previously. This is known as batch processing. In many cases, though, data scientists need to analyze data from a constantly updated feed in real or near-real time. Examples of streaming data sources include online player interactions for gaming companies, information from industrial sensors, and real-time changes in stock market prices.

Unstructured Data

Structured data conforms to a predetermined format. A customer record may contain names, addresses, telephone numbers, and credit card details, and each of these fields will have clearly defined data types and characteristics. Unstructured data is everything else: all the web pages, social media posts, Word documents, images, video, and PowerPoint slides that can be found in every organization. It can be difficult and time-consuming to harvest intelligence from these documents using big data algorithms, but the rewards can be immense.

YARN

YARN serves as a common data operating system that enables the Apache Hadoop ecosystem to integrate applications natively, and leverage existing technologies while extending consistent security, governance, and operations across the platform. With YARN as a cornerstone embedded within, it enables the mainstream adoption of Hadoop by enterprises of all types and sizes for production use cases at scale.

As you map out or expand your big data strategy, knowing these big data basics will help you be successful.

To learn more about how to advance your big data strategy, check out this big data scorecard tool.

Leave a Reply

Your email address will not be published. Required fields are marked *