monitoring HTTP requests on the fly


Install httpry:

sudo yum install httpry

or

$ sudo yum install gcc make git libpcap-devel
$ git clone https://github.com/jbittel/httpry.git
$ cd httpry
$ make
$ sudo make install

then run:

sudo httpry -i eth0

Output will be like:

httpry version 0.1.8 -- HTTP logging and information retrieval tool
Copyright (c) 2005-2014 Jason Bittel <jason.bittel@gmail.com>
Starting capture on eth0 interface
2016-07-27 14:20:59.598    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:20:59.599    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:22:02.034    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:22:02.034    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:23:04.640    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:23:04.640    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:24:07.122    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:24:07.123    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK

 

 

 

 

Indexing Common Crawl Metadata on Elasticsearch using Cascading


If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch

It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector directly from the S3 data source.

Cascading Source Code is available here.

Elasticsearch and Kibana on EMR Hadoop cluster


If you need to add Elasticsearch and Kibana on EMR, please have a look to this post I have written for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR

It contains all the steps to launch a cluster and perform the basic testings on both tools.

Additionally, here you will find the source code for the bootstrap actions used to configure Elasticsearch and Kibana on the EMR Hadoop cluster:

https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch


Versión en Español

Si necesitas Elasticsearch y Kibana instalado en un cluster  EMR, por favor, mira esta publicacion que he escrito para Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR

Contiene todos los pasos para crear un cluster y realizar las pruebas basicas en las dos herramientas.

Adicionalmente, aqui encontraras el codigo fuente para las bootstrap actions que uso para instalar Elasticsearch y Kibana en el EMR Hadoop cluster.

https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch

Create a really big file / Crear un archivo realmente grande


This is sometimes useful when playing with bigdata.

Instead of a dd command and wait the file being created block by clock, we can run:

$ fallocate -l 200G /mnt/reallyBigFile.csv

It essentially “allocates” all of the space you’re seeking, but it doesn’t bother to write anything. So, when you use fallocate to create a 200 GB virtual drive space, you really do get a 200 GB file.

Instalando Maven en instancia Amazon EC2


Maven es una herramienta de software para la gestión y construcción de proyectos Java

Obtenemos maven:

$ wget http://apache.saix.net/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz

Descomprimimos:

$ tar -xzvf apache-maven-3.2.3-bin.tar.gz

Movemos la carpeta a un directorio de instalación permanente:

$ sudo mv /home/ec2-user/apache-maven-3.2.3 /usr/local/maven

Creamos link simbólico a la version current (por si instalamos otras versiones):

$ sudo ln -s /usr/local/maven /usr/local/maven/current

Modificamos el inicio de sesión del usuario para hacer disponible maven:

$ sudo vi ~/.bashrc

Añadimos las siguientes entradas:

export MAVEN_HOME=/usr/local/maven/current
export M2_HOME=/usr/local/maven/current
export M2=/usr/local/maven/current/bin
export PATH=/usr/local/maven/current/bin:$PATH

Cargamos el bash profile con esta modifiación:

$ source ~/.bashrc

Chequeamos la instalación:

$ mvn -version
Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9;2014-02-14T17:37:52+00:00)
Maven home: /usr/local/maven/current
Java version: 1.7.0_55, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre

Hive: dealing with Out of Memory and Garbage Collector errors.


hive_logo

This is the common error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

This error will occur in several Java environments, but, in particular, with Hive, is pretty common when big structures or several thousands objects are stored in memory.

According to Sun, the error will raise if too much time is being spent in garbage collection:

If more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown.

This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small.

If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit.

Also, we can increase the Heap Size, via “-Xmx1024m” option.

Another interesting option is the Concurrent Collector “UseConcMarkSweepGC“: It performs most of its work concurrently (i.e., while the application is still running) to keep garbage collection pauses short.

It is designed for applications with medium to large-sized data sets for which response time is more important than overall throughput, since the techniques used to minimize pauses can reduce application performance.

In Hive, we can set thes parameters with a  command like this:

SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit";

 

An alternative to avoid storing all the structure in memory:

Write intermediate results to a temporary table in the database instead of hashmaps, a database table table is not memory bound, so use an indexed table is a solution in many cases.

When the intermediate table is complete, execute a sql statement(s) from it instead of from memory.

vi sudo save with root permissions / grabar cambios con permisos de root


Just:

:w !sudo tee %

% is current file.

!sudo tee calls tee with administrator privileges and writes to current file.  But not vi buffered file.

That’s why you will see a warning like this when using the command:

W12: Warning: File "/etc/myfile.txt" has changed and the buffer was changed in Vim as well

Thanks Mandus for this! I feel better now !