A Fundamental guide about Setting up PySpark For ETL

Due to the massive volume of data, Spark is built to handle big data in many user cases. It is an open source project on Apache.

Spark can use data stored in a variety of formats, including parquet files.

What is Spark?

Spark is a general-purpose distributed data processing engine that is suitable for use.

On top of the Spark core data processing engine, there are libraries for SQL, machine learning, etc. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large data sets.

What Does Spark Do?

It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala. Its flexibility makes it well-suited for a range of use cases, for this blog, we will just talk about data integration. Python is revealed the Spark programming model to work with structured data by the Spark Python API which is called as PySpark.

Data produced by different application systems across a business needs to be processed for reporting and analysis. Spark is used to reduce the cost and time required for this ELT process.

How can we set up the PySpark?

There are heaps of ways to set up PySpark, including with VirtualBox, Databricks, AWS EMR, AWS EC2, Anaconda etc. This blog, i will just talk about setting up PySpark with Anaconda.

This blog we will talk about how to set up PySpark with Anaconda.

  1. Download Anaconda version according to your operation system and install it
  2. Create a new named environment
  3. Install pyspark through “Anaconda Prompt” terminal, just be careful the python environment needs to be set up 3.7 or lower because pyspark doesn’t support python 3.8.
conda create --name yournamedenvironmentconda create -n yournamedenvironment python=3.7conda create -n yournamedenvironment pyspark

Then we can launch the different IDEs on Anaconda home:

Image for post

JupyterLab is highly recommended here.

After we launch the JupyterLab, a.ipynb file can be created on locahost.

Spark DataFrame Basics

Data frame and spark sql are the things we need to get familiar with in PySpark. If we’ve worked with pandas in python, sql, R or Excel before, the data frame will become familiar.

Initiating a sparksession is essential in the beginning.

#start a simple Spark Session 
from pyspark.sql import SparkSession

After running it in a single block, we can see if the PySpark is installed successfully.

If you are interested in or have any problems with PySpark, feel free to contact me.

Or you can connect with me through my LinkedIn.

Image for post

Originally published at http://jacquiwucom.wordpress.com on August 31, 2020.

Written by

A current BI Analyst in a subsidiary under Webjet, with experience in applying data science techniques to business.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store