Creating Bigtop patches

Posted on August 11, 2016 by hvivani

To contribute to Bigtop project, we need to submit a patch.

We should follow this process for managing our proposed contributions:

Create a Jira ticket with the description of the problem. (Note: the ticket should be Minor priority for most things, and only Major if it is fixing a bug that prevents something from working as expected; also the component will be “deployment” for new charms or bundles, as well as for changes to the puppet manifests.)
Create branch in /hvivani/bigtop fork named BIGTOP-XXXX, where XXXX is the Jira ticket number, e.g. BIGTOP-2417.
Commit the change with a commit message of the form BIGTOP-XXXX: message, where message must match the title of your JIRA ticket. Then push it to the branch.
Open a pull request from your branch to the upstream Apache Bigtop repository, e.g. PR 138. Again, ensure the PR title exactly matches the title of your Jira ticket prefixed by the BIGTOP-XXXX ticket number.
Refresh your Jira ticket. You should see your GitHub PR linked to the ticket. You should also see a comment by ASF GitHub Bot with information about the PR which includes a link to the PR’s patch file, which is the PR URL with .patch appended to it. For example, BIGTOP-2417 contains a link to https://github.com/apache/bigtop/pull/138.patch.
Once you do it, click the Submit Patch button in the ticket to inform committers that this ticket has a Patch Available.

Building and Deploying Apache Bigtop Applications

Posted on August 9, 2016 by hvivani

If you want to explore how to build an application for Apache Bigtop and then deploy it using EMR, have a look at this blog post I wrote for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-with-Apache-Bigtop-and-Amazon-EMR

Java change default version / cambiar la version Java por defecto

Posted on August 5, 2016 by hvivani

If we have more than one Java version installed on your Linux server (Redhat flavor) you can change defaults using ‘alternatives’ command:

[hadoop@ip-172-31-36-252 ~]$ sudo /usr/sbin/alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
   2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number: 2
[hadoop@ip-172-31-36-252 ~]$ sudo /usr/sbin/alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*  1           /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
 + 2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number:
[hadoop@ip-172-31-36-252 ~]$

Show hidden files in MAC OS Finder / Mostrar archivos ocultos en Mac Finder

Posted on August 1, 2016 by hvivani

 hvivani$ defaults write com.apple.finder AppleShowAllFiles YES
 hvivani$ killall Finder

monitoring HTTP requests on the fly

Posted on July 27, 2016 by hvivani

Install httpry:

sudo yum install httpry

$ sudo yum install gcc make git libpcap-devel
$ git clone https://github.com/jbittel/httpry.git
$ cd httpry
$ make
$ sudo make install

then run:

sudo httpry -i eth0

Output will be like:

httpry version 0.1.8 -- HTTP logging and information retrieval tool
Copyright (c) 2005-2014 Jason Bittel <jason.bittel@gmail.com>
Starting capture on eth0 interface
2016-07-27 14:20:59.598    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:20:59.599    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:22:02.034    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:22:02.034    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:23:04.640    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:23:04.640    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:24:07.122    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:24:07.123    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK

Control Characters on vi Linux editor

Posted on July 25, 2016 by hvivani

Show hidden/control characters on vi:

:set list

Hide hidden/control characters on vi:

:set nolist

Replace hidden/control characters on vi:

:%s/^M//g

:%s/.$//g

FileInputFormat vs. CombineFileInputFormat

Posted on July 12, 2016 by hvivani

When you put a file into HDFS, it is converted to blocks of 128 MB. (Default value for HDFS on EMR) Consider a file big enough to consume 10 blocks. When you read that file from HDFS as an input for a MapReduce job, the same blocks are usually mapped, one by one, to splits.In this case, the file is divided into 10 splits (which implies means 10 map tasks) for processing. By default, the block size and the split size are equal, but the sizes are dependent on the configuration settings for the InputSplit class.

From a Java programming perspective, the class that holds the responsibility of this conversion is called an InputFormat, which is the main entry point into reading data from HDFS. From the blocks of the files, it creates a list of InputSplits. For each split, one mapper is created. Then each InputSplit is divided into records by using the RecordReader class. Each record represents a key-value pair.

`FileInputFormat` vs. `CombineFileInputFormat`

Before a MapReduce job is run, you can specify the InputFormat class to be used. The implementation of FileInputFormat requires you to create an instance of the RecordReader, and as mentioned previously, the RecordReader creates the key-value pairs for the mappers.

FileInputFormat is an abstract class that is the basis for a majority of the implementations of InputFormat. It contains the location of the input files and an implementation of how splits must be produced from these files. How the splits are converted into key-value pairs is defined in the subclasses. Some example of its subclasses are TextInputFormat, KeyValueTextInputFormat, and CombineFileInputFormat.

Hadoop works more efficiently with large files (files that occupy more than 1 block). FileInputFormat converts each large file into splits, and each split is created in a way that contains part of a single file. As mentioned, one mapper is generated for each split.

FileInputFormatLargeFile

However, when the input files are smaller than the default block size, many splits (and therefore, many mappers) are created. This arrangement makes the job inefficient. This Figure shows how too many mappers are created when FileInputFormat is used for many small files.

FileInputFormatManySmallFiles

To avoid this situation, CombineFileInputFormat is introduced. This InputFormat works well with small files, because it packs many of them into one split so there are fewer mappers, and each mapper has more data to process. This Figure shows how CombineFileInputFormat treats the small files so that fewer mappers are created.

CombineFileInputFormatSmallFiles

Start Hive in Debug Mode

Posted on June 22, 2016 by hvivani

Never go out without it:

hive --hiveconf hive.root.logger=DEBUG,console

mac: installing homebrew

Posted on May 6, 2016 by hvivani

To download install Homebrew run the install script on the command line:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

get the driver’s IP in spark yarn-cluster mode

Posted on March 30, 2016 by hvivani

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Sometimes we will have a bunch of logs for a terminated cluster and we need to find out which node was the driver in cluster mode.

Searching for “driverUrl” on the application/container logs, we will find it

find . -iname "*.gz" | xargs zgrep "driverUrl"
./container_1459071485818_0006_02_000001/stderr.gz:15/03/28 05:10:47 INFO YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://CoarseGrainedScheduler@172.31.16.15:47452,  executorHostname: ip-172-31-16-13.ec2.internal
...
./container_1459071485818_0006_02_000001/stderr.gz:15/03/28 05:10:47 INFO YarnAllocator: Launching ExecutorRunnable. driverUrl: spark://CoarseGrainedScheduler@172.31.16.15:47452,  executorHostname: ip-172-31-16-14.ec2.internal

On this case the driver was running on 172.31.16.15.

Hernan Vivani's Blog

Linux, Big Data, AWS, Astronomy, Running, Cycling… and more