Manab's Notes

Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data in a Lakehouse at scale.

Download the above files from Fabric - Google Drive and write the spark code as below

df=spark.read.format("csv").option("header","false").load("Files/DataFiles/2019.csv")

display(df)

from pyspark.sql.types import *

orderDataSchema = StructType([

StructField("SalesOrderNumber", StringType()),

StructField("SalesOrderLineNumber", IntegerType()),

StructField("OrderDate", DateType()),

StructField("CustomerName", StringType()),

StructField("Email", StringType()),

StructField("Item", StringType()),

StructField("Quantity", IntegerType()),

StructField("UnitPrice", FloatType()),

StructField("Tax", FloatType())

])

df=spark.read.format("csv").schema(orderDataSchema).load("Files/DataFiles/*.csv")

display(df)

customers=df['CustomerName','Email']

print(customers.distinct().count())

display(customers.distinct())

sumUnitPrice = df.select("Item", "UnitPrice").groupBy("Item").sum()

display(sumUnitPrice)

Google Sites

Report abuse