Now, with the introduction of Big Data, lets dive into how we work with Spark code wise using the Jupyter like IDE (one of the IDE’s using which we can write Python code for Spark Engine)
Jupyter notebooks or any other Databrick notebook is the IDE that is setup on top of AWS / Azure Clusters which is available in online format for working on PySpark (Spark framework exposed with Python API's to perform BigData operations)
Though, Spark is written in Scala, it also supports Python, R, and Java as the languages using which we can expose API's that help working with RDD's efficiently (declaratively programmable) and DataFrames is declared as Spark 2.0's default way of working with underlying RDD's.
Also, syntactically speaking, since not all computations are required to be run, we segregate working on dataframes into 2 parts: Transformations and Actions.
Transformations => Just define the recipe for performing on the Dataset
Actions => Actually perform the operation that the recipe defines.
We will now create a Azure notebook and get going to work with Azure Databricks service (as all setup is pre done here)
Initial step is to setup a Cluster to be defined where the large files / datasets are stored / hosted like below:
We will then create a notebook and set the language as Python in the Azure Notebook like below and attach it to the vamseesamplecluster we created before:
Next, we will just upload a local csv file (sample data) and convert it into DBFS (Databricks File System, which is Databricks way of defining Tabular data) as shown below:
Once the new table is created in DBFS with steps outlined with images like shown above, we can get that data using Spark Session using the PySpark API and load the table available in DBFS into our DataFrame using below line in out Notebook :
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Operations").getOrCreate()
df = spark.read.csv('/FileStore/tables/appl_stock.csv',inferSchema=True,header=True)
If we want to read it just as a list and do no other operations, then we can show that data with below line:
sqlContext.sql("SELECT * FROM appl_stock_csv").show()
The syntax shown above looks similar to SQL syntax and is termed as Spark SQL (there are minor differences in the workability of SQL and Spark SQL which is outside the scope of this blogpost)
If we want the DataFrame to show up the contents of it (i.e. computation result), then we perform the show operation like below:
df.show()
However, do note that the definition of DataFrame is 'Transformation' (or the collect of it) and the .show method is the action performed on the available DataFrame and this costs CPU resources for the Spark Job that gets run as it actually performs the get operation against the DataFrame data.
So, while working on large files, do ensure that the number of calls to .show() method is very minimal and is done only when all transformations are done and the computed result is truly required.
Some shortcuts while working in ADB Notebooks are listed below:
Shift enter to run a cell
Tab to check available method / properties of an object
Shift Tab to get the DocString for a given function.
Happy Coding :)
Comments