Los números de 2014

Los duendes de las estadísticas de WordPress.com prepararon un informe sobre el año 2014 de este blog.

Aquí hay un extracto:

El Museo del Louvre tiene 8.5 millones de visitantes por año. Este blog fue visto cerca de 150.000 veces en 2014. Si fuese una exposición en el Museo del Louvre, se precisarían alrededor de 6 días para que toda esa gente la visitase.

Haz click para ver el reporte completo.

Create a really big file / Crear un archivo realmente grande

This is sometimes useful when playing with bigdata.

Instead of a dd command and wait the file being created block by clock, we can run:

$ fallocate -l 200G /mnt/reallyBigFile.csv

It essentially “allocates” all of the space you’re seeking, but it doesn’t bother to write anything. So, when you use fallocate to create a 200 GB virtual drive space, you really do get a 200 GB file.

Hadoop 1 vs Hadoop 2

There are a lot of articles about this, but, I just needed a good summary of concepts:

Hadoop 1:

Hadoop1A master process called the JobTracker is the central scheduler for all MapReduce jobs in the cluster.

Nodes have a TaskTracker process that manages tasks on the individual nodes. The TaskTrackers, communicate with and are controlled by the JobTracker. Similar to most resource managers, the JobTracker has two pluggable scheduler modules, Capacity and Fair.

The JobTracker is responsible for managing the TaskTrackers on worker server nodes, tracking resource consumption and availability, scheduling individual job tasks, tracking progress, and providing fault tolerance for tasks.

The TaskTracker is directed by the JobTracker and is responsible for launch and tear down of jobs and provides task status information to the JobTracker.

The TaskTrackers also communicate through heartbeats to the JobTracker: If the JobTracker does not receive a heartbeat from a TaskTracker, it assumes it has failed and takes appropriate action (e.g., restarts jobs).

Hadoop 2 (YARN):


In YARN, the job tracker is split into two different daemons called ResourceManager and NodeManager (node specific).

Also there are new components: an ApplicationMaster and application Containers.

The ResourceManager is a pure scheduler. Its sole purpose is to manage available resources among multiple applications on the cluster. As with version 1, both Fair and Capacity scheduling options are available.

The ApplicationMaster is responsible for accepting job submissions, negotiating resource Containers from the ResourceManager, and tracking progress.

ApplicationMasters are specific to and written for each type of application. For example, YARN includes a distributed Shell framework that runs a shell script on multiple nodes on the cluster. The ApplicationMaster also provides the service for restarting the ApplicationMaster Container on failure.

ApplicationMasters request and manage Containers, which grant rights to an application to use a specific amount of resources (memory, CPU, etc.) on a specific host.

The ApplicationMaster, once given resources by the ResourceManager, contacts the NodeManager to start individual tasks. For example, using the MapReduce framework, these tasks would be mapper and reducer processes.

The NodeManager is the per-machine framework agent that is responsible for Containers, monitoring their resource usage (CPU, memory, disk, network), and reporting back to the ResourceManager.

On previous figure we have two ApplicationMasters running within the cluster, one of which has three Containers (the red client) and one that has one Container (the blue client). Note that the ApplicationMasters run on cluster nodes and not as part of the ResourceManager, thus reducing the pressure on a central scheduler. Also, because ApplicationMasters have dynamic control of Containers, cluster utilization can be improved.


– Hadoop 2 has the concept of containers, while Hadoop 1 has slots. Containers are generic and can run any type of tasks, but a slot can run either a map or a reduce task.

– The same MR program without any modifications can be executed in Hadoop 1 and 2, but needs to be compiled with the appropriate set of jar files.

– Hadoop 1 allows us to write programs in only MR, while Hadoop 2 with  resource management framework (YARN) allows to write programs in multiple distributed computing models.  As example MR, Spark, Hama, Giraph, MPI and HBase coprocessors, among others.

Instalando Maven en instancia Amazon EC2

Maven es una herramienta de software para la gestión y construcción de proyectos Java

Obtenemos maven:

$ wget http://apache.saix.net/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz


$ tar -xzvf apache-maven-3.2.3-bin.tar.gz

Movemos la carpeta a un directorio de instalación permanente:

$ sudo mv /home/ec2-user/apache-maven-3.2.3 /usr/local/maven

Creamos link simbólico a la version current (por si instalamos otras versiones):

$ sudo ln -s /usr/local/maven /usr/local/maven/current

Modificamos el inicio de sesión del usuario para hacer disponible maven:

$ sudo vi ~/.bashrc

Añadimos las siguientes entradas:

export MAVEN_HOME=/usr/local/maven/current
export M2_HOME=/usr/local/maven/current
export M2=/usr/local/maven/current/bin
export PATH=/usr/local/maven/current/bin:$PATH

Cargamos el bash profile con esta modifiación:

$ source ~/.bashrc

Chequeamos la instalación:

$ mvn -version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9;2014-02-14T17:37:52+00:00)
Maven home: /usr/local/maven/current
Java version: 1.7.0_55, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-

Adding a JAR path to Hadoop classpath

This is simple, but it is a frequent question:

If we need to add some specific path pointing to a thirdparty library we can run a command like the following:

$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/*:/home/hadoop/.versions/Cascading-2.5-SDK/binary/cascading/lib/cascading-core/*

Here I am adding two directories to the hadoop classpath:


We can check the hadoop classpath with the following command:

$ hadoop classpath


Apache Hive: dealing with Out of Memory and Garbage Collector errors.


This is the common error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

This error will occur in several Java environments, but, in particular, with Hive, is pretty common when big structures or several thousands objects are stored in memory.

According to Sun, the error will raise if too much time is being spent in garbage collection:

If more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown.

This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small.

If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit.

Also, we can increase the Heap Size, via “-Xmx1024m” option.

Another interesting option is the Concurrent Collector “UseConcMarkSweepGC“: It performs most of its work concurrently (i.e., while the application is still running) to keep garbage collection pauses short.

It is designed for applications with medium to large-sized data sets for which response time is more important than overall throughput, since the techniques used to minimize pauses can reduce application performance.

In Hive, we can set thes parameters with a  command like this:

SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit";


An alternative to avoid storing all the structure in memory:

Write intermediate results to a temporary table in the database instead of hashmaps, a database table table is not memory bound, so use an indexed table is a solution in many cases.

When the intermediate table is complete, execute a sql statement(s) from it instead of from memory.

HBase Basics


HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.

Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point – specifically, the size of a single database server – and for the best performance requires specialized hardware and storage devices. HBase features of note are:

  • Strongly consistent reads/writes: HBase is not an “eventually consistent” DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
  • Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
  • Automatic RegionServer failover
  • Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
  • MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
  • Java Client API: HBase supports an easy to use Java API for programmatic access.
  • Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
  • Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
  • Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

When Should I Use HBase?

HBase isn’t suitable for every problem.

1) Make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

2) Make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.)

An application built against an RDBMS cannot be “ported” to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

3) Make sure you have enough hardware. Even HDFS doesn’t do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

HBase can run quite well stand-alone on a laptop – but this should be considered a development configuration only.

How does HBase distribute the data across the cluster ?

HBase stores rows of data in tables. Tables are split into chunks of rows called “regions”. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process.

hbase-architectureA region is a continuous range within the key space, meaning all rows in the table that sort between the region’s start key and end key are stored in the same region. Regions are non-overlapping, i.e. a single row key belongs to exactly one region at any point in time. A region is only served by a single region server at any point in time, which is how HBase guarantees strong consistency within a single row#. Together with the -ROOT- and .META. regions, a table’s regions effectively form a 3 level B-Tree for the purposes of locating a row within a table.

HBase depends on HDFS for data storage. RegionServers collocate with the HDFS DataNodes. This enables data locality for the data served by the RegionServers, at least in the common case. Region assignment, DDL operations, and other book-keeping facilities are handled by the HBase Master process.

hbase-physical-architectureIt uses Zookeeper to maintain live cluster state. When accessing data, clients communicate with HBase RegionServers directly. That way, Zookeeper and the Master process don’t bottle-neck data throughput. No persistent state lives in Zookeeper or the Master. HBase is designed to recover from complete failure entirely from data persisted durably to HDFS.

What Is The Difference Between HBase and Hadoop/HDFS?

HDFS is a distributed file system that is well suited for the storage of large files. Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed “StoreFiles” that exist on HDFS for high-speed lookups.

More info: