Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data in a Lakehouse at scale.
Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data in a Lakehouse at scale.
Download the above files from Fabric - Google Drive and write the spark code as below
df=spark.read.format("csv").option("header","false").load("Files/DataFiles/2019.csv")
display(df)
from pyspark.sql.types import *
orderDataSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])
df=spark.read.format("csv").schema(orderDataSchema).load("Files/DataFiles/*.csv")
display(df)
customers=df['CustomerName','Email']
print(customers.distinct().count())
display(customers.distinct())
sumUnitPrice = df.select("Item", "UnitPrice").groupBy("Item").sum()
display(sumUnitPrice)