is first and foremost data. It is not tools and technologies. There is however a number of tools and technologies for that have seen the light of the day in the recent years, that are used to store, process, analyse and otherwise harvest big data. Vendor marketing being what it is (and I am as guilty as anyone else), these are often referred to as “big data tools” or “big data technologies”.In this post I will try to clarify some of the basic concepts and tools for big data.
At the foundation of these technologies is a concept called MapReduce. It provides a massively parallel environment for executing computationally advanced functions in very little time, on a grid of commodity hardware. Exit, the Crays and other massively parallel, helium-cooled supercomputers.
MapReduce allows a programmer to express a transformation of data that can be executed on a cluster that may include thousands of computers operating in parallel. At its core, it uses a series of “maps” to divide a problem across multiple parallel servers and then uses a “reduce” to consolidate responses from each map and identify an answer to the original problem.
MapReduce enables technology for big data such as Hadoop to function.
Started at Yahoo! as an implementation of MapReduce in 2005 and released as an open source project in 2007, Apache Hadoop has the basic constructs needed to perform computing: a file system, a language to write programs, a way of managing the distribution of those programs over a distributed cluster, and a way of accepting the results of those programs. Ultimately the goal is to create a single result set.
With Hadoop, big data is distributed into pieces that are spread over a series of nodes running on commodity hardware. In this structure the data is also replicated several times on different nodes to secure against node failure. The data is not organized into the relational rows and columns as expected in traditional persistence. This lends to the ability to store structured, semi-structured and unstructured content.
Hadoop is an Apache top level project.
The Hadoop Ecosystem
A number of projects have seen the light of day around Hadoop, aiming at providing additional features. The main ones for processing big data include:
Pig, to write complex MapReduce transformations using a scripting language, Pig Latin, which defines a set of transformations on a data set such as aggregate, join and sort.
Hive, a data warehouse infrastructure built on top of Hadoop for providing data summarisation, ad-hoc query, and analysis of large datasets.
HBase, a non-relational columnar database, which provides fault-tolerant storage and quick access to large quantities of sparse data.
HCatalog, a table and storage management service for data created using Apache Hadoop that provides a table abstraction to where and how data is stored.
Sqoop, a set of tools that allow Hadoop to interact with traditional relational databases and data warehouses.
This post is meant only as a high level overview of tools for big data. For more detailed information on these technologies (and more), I would recommend a very good Talend white paper: Big Data for the Masses (registration required ) or reserve some time with a Talend on our stand at 2012 by emailing email@example.com
Yves de Montcheuil – VP of Marketing – Talend