Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button

Fast, Easy and Secure Big Data Ingestion

Transform Data Ingestion From Months To Minutes

What Is Data Ingestion?

Big data ingestion is about moving data - especially unstructured data - from where it is originated, into a system where it can be stored and analyzed such as Hadoop.

Data ingestion may be continuous or asynchronous, real-time or batched or both (lambda architecture) depending upon the characteristics of the source and the destination. In many scenarios, the source and the destination may not have the same data timing, format or protocol and will require some type of transformation or conversion to be usable by the destination system.

As the number of IoT devices grows, both volume and variance of data sources are expanding rapidly, sources which now need to be accommodated, and often in real time. Yet extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Making data ingestion as efficient as possible helps focus resources on big data streaming and analysis, rather than the mundane efforts of data preparation and transformation.

HDF Makes Big Data Ingest Easy


Complicated, messy, and takes weeks to months to move the right data into Hadoop


Streamlined, Efficient, Easy

Typical Problems of Data Ingestion

Complex, Slow and Expensive


Purpose-built and over-engineered tools make big data ingestion complex, time consuming, and expensive


Writing customized scripts, and combining multiple products together to acquire and ingestion data associated with current big data ingest solutions takes too long and prevents on-time decision making required of today’s business environment


• Command line interfaces for existing streaming data processing tools create dependencies on developers and fetters access to data and decision making

Security and Trust of Data


The need to share discrete bits of data is incompatible with current transport layer data security capabilities which limit access at the group or role level


Adherence to compliance and data security regulations is difficult, complex and costly


Verification of data access and usage is difficult and time consuming and often involves a manual process of piecing together different systems and reports to verify where data is sourced from, how it is used, and who has used it and how often

Problems of Data Ingestion for IoT


• Difficult to balancing limited resources of power, computing and bandwidth with the volume of data signals being generated from big data streaming sources


Unreliable connectivity disrupts communication outages and causes data loss


Lack of security on most of the world’s deployed sensors puts businesses and safety at risk

Optimizing Data Ingestion with Hortonworks DataFlow

Fast, Easy, Secure


The fastest way to address many big data ingestion problems today


Real-time, interactive point and click control of dataflows


Accelerated data collection and movement for increased big data ROI


Real-time operational visibility, feedback, and control


Business agility and responsiveness


Real-time decision making from big data streaming sources


Unprecedented operational effectiveness is achieved by eliminating the dependence and delays inherent in a coding and custom scripting approach


Off-the shelf, flow based programming for big data infrastructure


Secure, reliable and prioritized data collection over geographically dispersed, variable bandwidth environments


End-to-end data provenance that enables a chain-of-custody for data compliance and data “valuation” and dataflow optimization and trouble shooting

Single, Flexible, Adaptive Bi-Directional Real-Time System


Integrated data-source agnostic collection from dynamic, disparate and distributed sources


Adaptive to fluctuating conditions of remote, distributed data sources over geographically disperse communication links in varying bandwidth and latency environments


Dynamic, real-time data prioritization at the edge to send, drop or locally store data


Bi-Directional movement of data, commands and contextual data


Equally well designed to run on the small scale data sources that make up the Internet of Things as well as on large scale clusters in today's enterprise data centers


Visual chain of custody for data (provenance) provides real-time event-level data lineage for verification and trust of data from the Internet of Things

How real-time dataflows accelerate big data ROI
Secure dataflows from IoT
Real-time, visual data lineage
Secure data access and control
Dynamic prioritization of data in motion

Use-Cases of Data Ingestion with Hortonworks Dataflow


On-Ramp Into Hadoop

Accelerate the time typically required to move data into Hadoop, from months to minutes through a real-time drag and drop interface. Read about a real-world use case and see how to move data into HDFS in 30 seconds.


Prescient Video | Blog 30 Second Live Demo View Now

media img

Log Collection / Splunk Optimization

Log data can be complex to capture, typically collected in limited amounts and difficult to operationalize at scale. HDF helps efficiently collect, funnel and access expanding volumes of log data and eases integration with log analytics systems such as Splunk, SumoLogic, Graylog, LogStash, etc. for easy, secure and comprehensive data ingest of log files.


Log Analytics Optimization whitepaper DOWNLOAD NOW

media img

IoT Ingestion

Realizing the promise of real-time decision making enabled by real-time IoT big data streaming is a challenge due to the distributed and disparate nature of IoT data. HDF simplifies data collection and helps push intelligence to at the very edge of highly distributed networks.


A. Edge Intelligence for IoT LEARN MORE B. Retail and IoT LEARN MORE C. Open Energi IoT LEARN MORE

media img

Deliver data into stream processing engines

Big data ingestion leads to processing that delivers business intelligence. HDF enables streaming data processing for your organization to support real-time enterprise use cases with two of the most popular open-source solutions Apache Storm and Spark Streaming. NiFi Kafka and Storm blog, slides, webinar LEARN MORE Comcast NiFi into Spark from Keynote at Hadoop Summit VIDEO