Code Monkey home page Code Monkey logo

mse's Issues

Availability for Spark 3.x

I absolutely love this package! I'm far from being a Spark/Scala pro but it makes my live so much easier working with jsons and transforming them.
However, executing it with Spark 3.1 and Scala 2.12 I am getting the following error:
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
How can I make it available for Spark 3.x? Thx!

Cannot operate on a struct inside an array

Consider the following spark-shell session, which has loaded the latest published version of both this library, and spark-hofs.

import com.github.fqaiser94.mse.methods._
import za.co.absa.spark.hofs._
import spark.implicits._

val jsonText = """{
     |   "data1": [
     |     {
     |       "vlan": {
     |         "id": "195",
     |         "name": "Subnet-54.14.195"
     |       }
     |     },
     |     {
     |       "vlan": {
     |         "id": "195",
     |         "name": "Subnet-54.14.193"
     |       }
     |     }
     |   ]
     | }""".stripMargin

val df = spark.read.option("multiline", "true").json(Seq(jsonText).toDS())

# this works fine
df.withColumn("data1", transform($"data1", col => col.withFieldRenamed("vlan", "vlan_renamed"))).printSchema

# but what if I want to rename "name" to "device_name" ?
# I thought it would be something like this
df.withColumn("data1", transform($"data1", col => col.withField("vlan", $"data1.vlan".withFieldRenamed("name", "device_name")))).printSchema
# but it fails with:
org.apache.spark.sql.AnalysisException: cannot resolve 'rename_fields(`data1`.`vlan`, 'name', 'device_name')' due to data type mismatch: Only struct is allowed to appear at first position, got: array.;;
'Project [transform(data1#44, lambdafunction(add_fields(lambda elm#58, vlan, rename_fields(data1#44.vlan, name, device_name)), lambda elm#58, false)) AS data1#57]
+- LogicalRDD [data1#44], false

I believe that the problem might be here. If this is an array, then we should descend into the elementType to see if it's a StructType. Does this seem on the right track, or am I off the mark?

SyntaxError: invalid syntax

Hello All,
I am using spark-submit to submit a job by modifying a nested struct column by adding a new field using withField()

Following is the spark-submit command:

spark-submit --master local --packages com.github.fqaiser94:mse_2.11:0.2.4 --py-files mse.zip write_data.py

I have ziped mse, since I dont want to install it globally and also I am going to execute it in AWS EMR.

I am getting following error:

Traceback (most recent call last):
File "/Users/fkj/DataLakeManagement/archive/write_data.py", line 3, in
from mse import *
File "/Users/fkj/DataLakeManagement/archive/mse.zip/mse/init.py", line 1, in
File "/Users/fkj/DataLakeManagement/archive/mse.zip/mse/methods.py", line 5
def __withField(self: Column, fieldName: str, fieldValue: Column):
^
SyntaxError: invalid syntax

Following is my mse.zip:

mse.zip

write_data.py:

from datetime import datetime

from mse import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StringType, TimestampType, IntegerType


def spark():
    """
    This function is invoked to create the Spark Session.

    :return: the spark session
    """
    spark_session = (SparkSession
                     .builder
                     .appName("Data_Experimentation_Framework")
                     .getOrCreate())

    spark_session.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
    return spark_session


address = StructType().add("state", StringType())
schema = StructType().add("_id", IntegerType()) \
    .add("employer", StringType()) \
    .add("created_at", TimestampType()) \
    .add("name", StringType())\
    .add("address", address)

employees = [{'_id': 1,
              'employer': 'Microsoft',
              'created_at': datetime.now(),
              'name': 'Noel',
              'address': {
                  "state": "Pennsylvania"
              }

              },
             {'_id': 2,
              'employer': 'Apple',
              'created_at': datetime.now(),
              'name': 'Steve',
              'address': {
                  "state": "New York"
              }
              }
             ]
df = spark().createDataFrame(employees, schema=schema)

df.withColumn("address", f.col("address").withField("country", f.lit("USA")))

df.printSchema()

df.write \
    .format("parquet") \
    .mode("append") \
    .save("/Users/felixkizhakkeljose/Downloads/test12")

Could you help me to identify what am I doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.