Posts

Discovering the Power of AWS Glue in Large Scale Data Analysis

Image
The Challenge A modern technology-driven business is awash in mountains of sorted and unsorted data that can be challenging to transform into something that allows for deep analysis. From a development perspective, defining, building, and normalizing or cataloging data in a way that allows for easier analysis is a daunting task, especially if new data sets are identified as needing analysis.   Developers or database administrators have to create processes to extract the information from whichever repository it is stored in, transform it into a common set of data types or even fields, and ultimately load that information in an indexed format to a location that can then be accessed by business intelligence toolsets.  It is this Extract/Transform/Load (ETL) process that typically is a bottleneck in the overall analysis process. Besides the need to create a new process each time a new dataset is identified, the need to set up routines to run these ELT processes at set times when certain co…

Basics of Spark Performance Tuning & Introducing SparkLens

Image
Apache Spark has a colossal importance in the Big Data field and unless one is living under a rock, every Big Data professional might have used Spark for data processing. Spark may sometimes appear to be a beast that’s difficult to tame, in terms of configuration and tuning queries. Mistakes like long running operations, inefficient queries, high concurrency will individually or collectively affect the Spark job.
Before going further let’s go through some basic jargons of Spark: Executor: An executor is a single JVM process which is launched for an application on a worker node. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. A single node can run multiple executors, and executors for an application can span multiple worker nodes. Task: A task is a unit of work that can be run on a partition of a distributed dataset, and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks…