fqaiser94 / mse Goto Github PK

Make Structs Easy (MSE)

License: Apache License 2.0

Scala 74.62% Python 25.32% Shell 0.05%

spark nested struct pyspark scala python

mse's Issues

Cannot operate on a struct inside an array

Consider the following spark-shell session, which has loaded the latest published version of both this library, and spark-hofs.

import com.github.fqaiser94.mse.methods._
import za.co.absa.spark.hofs._
import spark.implicits._

val jsonText = """{
     |   "data1": [
     |     {
     |       "vlan": {
     |         "id": "195",
     |         "name": "Subnet-54.14.195"
     |       }
     |     },
     |     {
     |       "vlan": {
     |         "id": "195",
     |         "name": "Subnet-54.14.193"
     |       }
     |     }
     |   ]
     | }""".stripMargin

val df = spark.read.option("multiline", "true").json(Seq(jsonText).toDS())

# this works fine
df.withColumn("data1", transform($"data1", col => col.withFieldRenamed("vlan", "vlan_renamed"))).printSchema

# but what if I want to rename "name" to "device_name" ?
# I thought it would be something like this
df.withColumn("data1", transform($"data1", col => col.withField("vlan", $"data1.vlan".withFieldRenamed("name", "device_name")))).printSchema
# but it fails with:
org.apache.spark.sql.AnalysisException: cannot resolve 'rename_fields(`data1`.`vlan`, 'name', 'device_name')' due to data type mismatch: Only struct is allowed to appear at first position, got: array.;;
'Project [transform(data1#44, lambdafunction(add_fields(lambda elm#58, vlan, rename_fields(data1#44.vlan, name, device_name)), lambda elm#58, false)) AS data1#57]
+- LogicalRDD [data1#44], false

I believe that the problem might be here. If this is an array, then we should descend into the elementType to see if it's a StructType. Does this seem on the right track, or am I off the mark?

SyntaxError: invalid syntax

Hello All,
I am using spark-submit to submit a job by modifying a nested struct column by adding a new field using withField()

Following is the spark-submit command:

spark-submit --master local --packages com.github.fqaiser94:mse_2.11:0.2.4 --py-files mse.zip write_data.py

I have ziped mse, since I dont want to install it globally and also I am going to execute it in AWS EMR.

I am getting following error:

Traceback (most recent call last):
File "/Users/fkj/DataLakeManagement/archive/write_data.py", line 3, in
from mse import *
File "/Users/fkj/DataLakeManagement/archive/mse.zip/mse/init.py", line 1, in
File "/Users/fkj/DataLakeManagement/archive/mse.zip/mse/methods.py", line 5
def __withField(self: Column, fieldName: str, fieldValue: Column):
^
SyntaxError: invalid syntax

Following is my mse.zip:

mse.zip

write_data.py:

from datetime import datetime

from mse import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StringType, TimestampType, IntegerType


def spark():
    """
    This function is invoked to create the Spark Session.

    :return: the spark session
    """
    spark_session = (SparkSession
                     .builder
                     .appName("Data_Experimentation_Framework")
                     .getOrCreate())

    spark_session.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
    return spark_session


address = StructType().add("state", StringType())
schema = StructType().add("_id", IntegerType()) \
    .add("employer", StringType()) \
    .add("created_at", TimestampType()) \
    .add("name", StringType())\
    .add("address", address)

employees = [{'_id': 1,
              'employer': 'Microsoft',
              'created_at': datetime.now(),
              'name': 'Noel',
              'address': {
                  "state": "Pennsylvania"
              }

              },
             {'_id': 2,
              'employer': 'Apple',
              'created_at': datetime.now(),
              'name': 'Steve',
              'address': {
                  "state": "New York"
              }
              }
             ]
df = spark().createDataFrame(employees, schema=schema)

df.withColumn("address", f.col("address").withField("country", f.lit("USA")))

df.printSchema()

df.write \
    .format("parquet") \
    .mode("append") \
    .save("/Users/felixkizhakkeljose/Downloads/test12")

Could you help me to identify what am I doing wrong?

I absolutely love this package! I'm far from being a Spark/Scala pro but it makes my live so much easier working with jsons and transforming them.
However, executing it with Spark 3.1 and Scala 2.12 I am getting the following error:
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
How can I make it available for Spark 3.x? Thx!

fqaiser94 / mse Goto Github PK

mse's Issues

Cannot operate on a struct inside an array

SyntaxError: invalid syntax

Availability for Spark 3.x

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent