Announcing the general availability of AWS Local Zones in Buenos Aires, Copenhagen, Helsinki, and Muscat.


AWS Local Zones are now available in four new metro areas—Buenos Aires, Copenhagen, Helsinki, and Muscat. You can now use these Local Zones to deliver applications that require single-digit millisecond latency or local data processing.

Source: https://aws.amazon.com/es/about-aws/whats-new/2022/11/aws-local-zones-buenos-aires-copenhagen-helsinki-muscat/?nc1=h_ls

The Amazon Distributed Computing Manifesto


Back in 1998, a group of senior engineers at Amazon wrote the Distributed Computing Manifesto, an internal document that would go on to influence the next two decades of system and architecture design at Amazon.

Source (Werner Vogels): https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html

Amazon Project Kuiper – job openings in my team


Are you looking for new challenges?
At Project Kuiper we are working to provide broadband internet service to tens of millions of people around the world who are currently underserved. Come join us!

Do you want to know more about this project? have a look to this video:


Here are some of our current openings. Feel free to reach out directly to me if you want to know more about these or other positions.

Software Engineer

Senior Systems Development Engineer

Senior Ecad Tools Application Engineer

Systems Dev Engineer Enterprise Engineering

Atlassian Support Engineer

Systems Engineer

Linux Kernel Tuning: task blocked for more than 120 seconds


This might be old school, and maybe even boring reading. But, if you concern about performance on Linux servers, at some point, you will have to have a look to the kernel messages.

The problem:

When we run very stressful jobs running on large servers (large number of CPU’s and RAM memory), where IO activity is very high. It is pretty common to start seeing these messages on the ‘dmesg’ kernel output:

[24169.372862] INFO: task kswapd1:1140 blocked for more than 120 seconds.
[24169.375623] Tainted: G E 4.9.51-10.52.amzn1.x86_64 #1
[24169.378445] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24169.382533] kswapd1 D 0 1140 2 0x00000000
[24169.385066] ffff8811605c5a00 0000000000000000 ffff8823645dc900 ffff882362844900
[24169.389208] ffff882371357c00 ffffc9001a13ba08 ffffffff8153896c ffffc9001a13ba18
[24169.393329] ffff881163ac92d8 ffff88115e87f400 ffff882362844900 ffff88115e87f46c
[24169.445313] Call Trace:
[24169.446981] [<ffffffff8153896c>] ? __schedule+0x23c/0x680
[24169.449454] [<ffffffff81538de6>] schedule+0x36/0x80
[24169.451790] [<ffffffff8153907e>] schedule_preempt_disabled+0xe/0x10
[24169.454509] [<ffffffff8153a8d5>] __mutex_lock_slowpath+0x95/0x110
[24169.457156] [<ffffffff8153a967>] mutex_lock+0x17/0x27
[24169.459546] [<ffffffffa0966d00>] xfs_reclaim_inodes_ag+0x2b0/0x330 [xfs]
[24169.462407] [<ffffffffa0967be3>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[24169.465203] [<ffffffffa0978f59>] xfs_fs_free_cached_objects+0x19/0x20 [xfs]
...

Why is this happening?

This indicates that the process requested a block device such as a disk/swap, and wasn’t able to be fulfilled for more than 120 seconds and subsequently abandoned by the kernel.

As mentioned before, the probability of observing this behavior will increase when we use instances with large number of vCores (e.g. 64+ vCores), given that the volume of IO requests could be higher, and, the kernel buffer queues not configured for such load.

This is also very common on clustered environments. Even though the framework will retry, on Hadoop/YARN environments, this will impact in performance, and, might also lead to application failures.

The Solution:

To solve this problem, we have to increase the ‘dirty_background_bytes‘ kernel setting  to higher values to be able to accommodate the throughput.

As a base formula, we usually consider a value of dirty_background_bytes=10MB for 40MB/Sec throughput.

The goal is to have the page cache to allow the OS to write asynchronously to disk whenever possible. Otherwise, when the write becomes synchronous, it would involve a waiting time.

Additionally to the dirty_background_bytes kernel parameter, we can also set:

  • dirty_background_ratio = 10 (represents the percentage of system memory which when dirty then system can start writing data to the disks. When the dirty pages exceed ratio 10, I/O starts, i.e they start getting flushed / written to the disk)
  • dirty_ratio = 15 (is percentage of system memory which when dirty, the process doing writes would block and write out dirty pages to the disks. When total number of dirty pages exceed ratio 15, all writes get blocked until some of the dirty pages get written to disk. This is like an upper limit)

We try to keep these two values low to avoid  I/O bottlenecks. We can experiment setting both down to zero, to force fast flushes on high stress environments.

Another parameter we can use is dirty_bytes, which represents the amount of dirty memory at which a process generating disk writes will itself start writeback. We should set dirty_ratio = 0 if we set dirty_bytes, as only one of these might be specified at a time.

e.g.

dirty_bytes = 2147483648

More details on these parameters is available here.

Linux Kernel Tuning: failed to alloc buffer for rx queue


If we put enough pressure over the ENA network driver, we’ll start seeing these “failed to alloc buffer for rx queue” messages on the ‘dmesg‘ output.

[56459.833033] ena 0000:00:05.0 eth0: failed to alloc buffer for rx queue 4
[56459.836477] ena 0000:00:05.0 eth0: refilled rx qid 4 with only 85 buffers (from 168)

This message will raise when the napi handler fails to refill new Rx descriptors, typically due to lack of memory. This situation might lead to performance decrease, given that some requests would have to be rescheduled. 

The code handling these situations can be found here.

The solution for this is related to a memory increase on “min_free_kbytes” kernel parameter. As example:

vm.min_free_kbytes = 1048576

Place the following commands in /etc/sysctl.conf. And load the new setting with:

sysctl -p

It is recommended to have at least 512MB, with a minimum of 128MB for constrained environment. On large instance types running stress jobs (e.g. 64+ vCores + 256GiB + RAM), this value can typically be set to 10MB.

 

 

 

 

s3:// vs s3n:// vs s3a:// vs EMRFS


s3://

Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.

s3n://

A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access files on S3 that were written with other tools, and conversely, other tools can access files written to S3N using Hadoop. S3N is stable and widely used, but it is not being updated with any new features. S3N requires a suitable version of the jets3t JAR on the classpath.

  • Uses jets3t

s3a://

Hadoop’s successor to the S3N filesystem. S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB, and it provides performance enhancements and other improvements. For Apache Hadoop, S3A is the successor to S3N and is backward compatible with S3N. Using Apache Hadoop, all objects accessible from s3n:// URLs should also be accessible from S3A by replacing the URL scheme.

  • Uses AWS SDK.
  • Amazon EMR does not currently support use of the Apache Hadoop S3A file system.

EMRFS:

On Amazon EMR, both the s3:// and s3n:// URIs are associated with the EMR filesystem and are functionally interchangeable in the context of Amazon EMR. For consistency sake, however, it is recommended to use the s3:// URI in the context of Amazon EMR.

EMRFS can be used by invoking the prefix s3n:// or s3:// or s3a:// depending on the client application implementation.

Source: https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/

Indexing Common Crawl Metadata on Elasticsearch using Cascading


If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch

It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector directly from the S3 data source.

Cascading Source Code is available here.