Apache Accumulo

A sorted, distributed key-value store with cell-based access control

Apache™ Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop® and Apache ZooKeeper.

Cell-level access control is important for organizations with complex policies governing who is allowed to see data.  It enables the intermingling of different data sets with different access control policies and proper handling of individual data sets that have some sensitive portions.

Without Accumulo, those policies are difficult to enforce systematically. Accumulo encodes those rules for each individual data cell and allows fine-grained access control.

What Accumulo Does

Accumulo contains a variety of features for general administration, table design, data integrity and availability, performance, testing, client APIs, extensible behaviors and data management. Some of those features are listed here:

Category Features
General administration
  • Monitoring dashboard
  • Time tracing of system operations
  • System and table configurations (stored in ZooKeeper)
  • Easy table renaming
Table design and configuration
  • Iterators for filtering and aggregation
  • Cell labels for cell-level access control
  • Configurable constraints for writing to a table
  • Support for sharded document stores
  • Large rows that need not fit in memory
Integrity and availability
  • Master fail over with ZooKeeper locks
  • Write-ahead logs for recovery
  • Logical time for properly ordered timestamps when inserting mutations or bulk importing files
  • Fault tolerant executor (FATE)
  • Scalable master metadata store
  • Scan isolation
Performance
  • Relative encoding for additional compression of similar consecutive keys
  • Native in-memory map for improved performance
  • Parallel server threads for long scans
  • Caching of recently scanned data
  • Rfiles with multi-level index trees (for large indices)
  • Binary searches in Rfile blocks
Testing
  • Mock implementations for unit testing
  • Mini Accumulo cluster spins up all Accumulo processes in a single JVM for testing
  • Functional tests
  • Scale tests
  • Random walk tests
Client APIs
  • Scanner – looks up a single key or scans over a range of keys in sorted order
  • Batch Scanner – takes a list of Ranges, batches them to the appropriate tablet servers, and returns data as it is received
  • Batch Writer – clients buffer writes in memory before sending them in batches to the appropriate servers
  • Bulk Import – instead of writing individual mutations to Accumulo, entire files of sorted key value pairs can be imported
  • MapReduce – Accumulo can be a source and sink for MapReduce jobs
  • Offline MapReduce – maps over underlying files instead of reading data through an Accumulo tablet server for more efficient use of resources
Extensible behaviors
  • Pluggable balancer for tablet distribution
  • Pluggable memory manager for tablet compaction
Data management Internal capabilities

  • Group columns within a single file
  • Configure data compaction ratios
  • Throttle ingest with merging minor compactions
  • Load JARS using Apache commons VFS
  • Automatic fault-tolerant tablet splitting and rebalancing

On-demand capabilities

  • Force compactions
  • Add table split points
  • Merge tablets (remove split points)
  • Clone/snapshot tables
  • Compact tablets that fall within a pre-determined range of rows
  • Delete a range of rows from a table

How Accumulo Works

Accumulo stores sorted key-value pairs. Sorting data by key allows rapid lookups of individual keys or scans over a range of keys.  Since data is retrieved by key, the keys should contain the information that will be used to do the lookup.

  • If retrieving data by a unique identifier, the identifier should be in the key.
  • If retrieving data by its intrinsic features, such as values or words, the keys should contain those features.

The values may contain anything since they are not used for retrieval.

The original Big Table design has a row and column paradigm. Accumulo extends the column with an additional “visibility” label that provides the fine-grained access control.

Accumulo is written in Java, but a thrift proxy allows users to interact with Accumulo using C++, Python or Ruby. A pluggable user-authentication system allows LDAP connections to Accumulo. An HDFS class loader loads JARs from Hadoop Distributed File System (HDFS) to multiple servers. Accumulo also has connectors with other Apache projects such as Hive and Pig.

Hortonworks Committers
2

Try Accumulo with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
HDP 2.1 Webinar Series
Join us for a series of talks on some of the new enterprise functionality available in HDP 2.1 including data governance, security, operations and data access :
Contact Us
Hortonworks provides enterprise-grade support, services and training. Discuss how to leverage Hadoop in your business with our sales team.