Apache Spark – How it Works – Performance Notes


This is an oldie but goldie…

Hernan Vivani's Blog

Apache Spark, is an open source cluster computing framework originally developed at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 faster for certain applications.

RDD’s:

Spark has a driver program where the application logic execution is started, with multiple workers which processing data in parallel.
The data is typically collocated with the worker and partitioned across the same set of machines within the cluster.  During the execution, the driver program will pass the code/closure into the worker machine where processing of corresponding partition of data will be conducted.
The data will undergoing different steps of transformation while staying in the same partition as much as possible (to avoid data shuffling across machines).  At the end of the execution, actions will be executed at…

View original post 867 more words

Announcing the general availability of AWS Local Zones in Buenos Aires, Copenhagen, Helsinki, and Muscat.


AWS Local Zones are now available in four new metro areas—Buenos Aires, Copenhagen, Helsinki, and Muscat. You can now use these Local Zones to deliver applications that require single-digit millisecond latency or local data processing.

Source: https://aws.amazon.com/es/about-aws/whats-new/2022/11/aws-local-zones-buenos-aires-copenhagen-helsinki-muscat/?nc1=h_ls

AWS Regions


AWS Global Infrastructure at a glance!

With a new region launched in Switzerland 🇨🇭 last week, AWS now have 28 active Regions and 90 Availability Zones! Let’s break down what it actually means ⬇️

𝐖𝐡𝐚𝐭 𝐢𝐬 𝐀𝐖𝐒 𝐑𝐞𝐠𝐢𝐨𝐧? It is a separate geographic area that consists of multiple separate Availability Zones (AZ). AWS offers Regions with a multiple AZ design.

𝐖𝐡𝐚𝐭 𝐢𝐬 𝐀𝐯𝐚𝐢𝐥𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐙𝐨𝐧𝐞? An Availability Zone (AZ) consists of one or more data centers at a location within an AWS Region. All AZs in an AWS Region are connected through redundant, ultra-low-latency networks.

𝐖𝐡𝐚𝐭 𝐢𝐬 𝐚 𝐃𝐚𝐭𝐚 𝐂𝐞𝐧𝐭𝐞𝐫? A Data Center (DC) is a physical facility that houses all the necessary IT equipment, such as servers, storage, network systems, routers, firewalls, etc. Many Data Centers -> AZ, Many AZs-> Region, Many Regions -> AWS Global Infrastructure! Image credit: AWSGeek

The Amazon Distributed Computing Manifesto


Back in 1998, a group of senior engineers at Amazon wrote the Distributed Computing Manifesto, an internal document that would go on to influence the next two decades of system and architecture design at Amazon.

Source (Werner Vogels): https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html