Solutions
Apache Hadoop
In developing open-source software for reliable, scalable, distributed computing, we have partnered with Apache Hadoop to offer the distributed processing of large data sets across computers using simple programming models.
|
Scale up from single servers to thousands of machines |
Experience local computation and storage |
Detect and handle failures at the application layer |
Deliver a highly-available service on top of a cluster of computers |
A diversity of companies and organisations engage us for both research and production solutions.
Our scope of solutions includes the following:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
- Ambari™
A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and the ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
- Avro™
A data serialisation system.
- Cassandra™
A scalable multi-master database with no single points of failure.
- Chukwa™
A data collection system for managing large distributed systems.
- HBase™
A scalable, distributed database that supports structured data storage for large tables.
- Hive™
A data warehouse infrastructure that provides data summarisation and ad hoc querying.
- Mahout™
A scalable machine learning and data mining library.
- Pig™
A high-level data-flow language and execution framework for parallel computation.
- Spark™
A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- Tez™
A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
- ZooKeeper™
A high-performance coordination service for distributed applications.
