Code Monkey home page Code Monkey logo

Comments (46)

shawnxzq avatar shawnxzq commented on July 28, 2024 3

This is a known issue and we have work item to track this, will update here once it's resolved.

For now, if you choose "SetOfObjects" pattern in copy sink, the generated json will be able to consume in spark and have better performance than using "ArrayOfObjects" pattern

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024 3

@vignz Please make sure the "document per line" is checked as the document form in dataflow json settings.

from azure-datafactory.

hmayer1980 avatar hmayer1980 commented on July 28, 2024 3

I think it would be better to keep it open and you actually update it when your Work item is done - and a solution available to us.

from azure-datafactory.

fhljys avatar fhljys commented on July 28, 2024 2

I think the current status is that @shawnxzq provided a workaround.

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024 2

@mguard11 It will be fully available by end of next month.

For now, you can use it by manually editing the encoding of the json dataset to "UTF-8 without BOM".

from azure-datafactory.

LockTar avatar LockTar commented on July 28, 2024 2

Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

@yew is this going to be fixed? It's a bit strange that we must create 2 duplicate datasets... We work a lot with json datasets and there this is a real issue.

Since this issue is closed. Will there be a new one, or will this one be opened again?

from azure-datafactory.

jonarmstrong avatar jonarmstrong commented on July 28, 2024 1

I've run into this same problem. If a use a dataset with the following definition, then the json file that is written is in UTF8-BOM encoding, which makes it incompatible with data flows.

{
    "name": "Raw_Deals_json",
    "properties": {
        "description": "This is the full export of the Deals w/o any additional processing",
        "linkedServiceName": {
            "referenceName": "datawarehouse_int",
            "type": "LinkedServiceReference"
        },
        "folder": {
            "name": "DataLake"
        },
        "annotations": [],
        "type": "Json",
        "typeProperties": {
            "location": {
                "type": "AzureBlobFSLocation",
                "fileName": "deals.json",
                "folderPath": "staged",
                "fileSystem": "int"
            }
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

from azure-datafactory.

hmayer1980 avatar hmayer1980 commented on July 28, 2024 1

I did run into the same issue.
You should be able to specifiy either both verson UTF-8 without BOM and UTF-8 with BOM or not writing a BOM as default at all...

from azure-datafactory.

rahash123 avatar rahash123 commented on July 28, 2024 1

I was facing the same issue. It got resolved when I set the Encoding to US-ASCII while defining the dataset.
image

from azure-datafactory.

ma185360 avatar ma185360 commented on July 28, 2024 1

@rocque57 Any Update on this??

from azure-datafactory.

Eniemela avatar Eniemela commented on July 28, 2024 1

Currently manually setting encoding to "UTF-8 without BOM" did solve this issue for me

from azure-datafactory.

zsponaugle avatar zsponaugle commented on July 28, 2024 1

Hello @yew @shawnxzq @fhljys

I think this should be re-opened as well. The dual-dataset work-around is not ideal. Like many above me, I am trying to use a Data Flow after a Copy Activity that contains a JSON sink dataset. At the time of this writing, using Azure IR, I tried various settings and here are their outcomes:

one dataset using ISO-8859-1 encoding:
Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the ISO-8859-1 encoding

one dataset using US-ASCII encoding:
Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the US-ASCII encoding

Set of objects / Document per line and two datasets (UTF-8 without BOM, UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Set of objects / Document per line and one dataset (UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Array of objects / Array of documents and two datasets (UTF-8 without BOM, UTF-8 default) --> WORKS

Array of objects / Array of documents and one dataset (UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

So, as you can see, the only one that worked for me was using the two JSON datasets (copy activity sink set to UTF-8 without BOM and Data Flow source one set to Default(UTF-8) and setting copy sink file pattern to Array of objects and setting Data Flow source options --> JSON settings --> Document Form: Array of documents

Thank you.

from azure-datafactory.

jenilChristo avatar jenilChristo commented on July 28, 2024

Facing the same issue in ADF json blob dataset.Is there a workaround to get rid of the BOM , while writing json dataset to Blob?. I am using the default (utf-8) encoding.

from azure-datafactory.

jashwood avatar jashwood commented on July 28, 2024

Actually, this is really not cool. This is a workaround.
The trouble is that the Azure Data Factory json components fail because spark cannot handle the byte order mark.

from azure-datafactory.

vignz avatar vignz commented on July 28, 2024

@shawnxzq : Even after changing to "SetOfObjects" pattern in copy sink, Still ADF Data Flow is not able to read the UTF-8 JSON file. Its throwing the below error.

Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('' (code 65279 / 0xfeff)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.InputStreamReader@372ac23c; line: 1, column: 2]

from azure-datafactory.

valfirst avatar valfirst commented on July 28, 2024

@fhljys was this issue fixed?

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

@fbdntm As this is a known issue and we have work item to track this, are you OK to close the issue here? Thanks!

from azure-datafactory.

rocque57 avatar rocque57 commented on July 28, 2024

The workaround that was suggested did not work. Whether you select SetOfObjects or ArrayOfObjects for your sink the resulting file still has a BOM.
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'SinkJSON': org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 10.139.64.11, executor 0): org.apache.spark.SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST.\n\tat

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

@rocque57 You are right that json files with utf-8 encoding generated by copy will always have BOM, however, if you set pattern as "SetOfObjects" in copy sink, and set documetForm to "document per line" in dataflow source, you won't get the error mentioned above and this is also the recommended setting from perf perspective.

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

@rocque57 As discussed, we have backlog items to track this, no clear ETA at moment, thanks!

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

@rocque57 This is still in backlog as a new ask, thanks for your patience, thx!

from azure-datafactory.

evogelpohl avatar evogelpohl commented on July 28, 2024

FWiW: Had the same error w/ Azure Data Factory reading a path full of jSON files written by PowerShell's approach, ala:
$res = Invoke-RestMethod ...
$res | ConvertTo-Json | Out-File .../some.json

As per above, trying to open those json files in Azure Data Factory Data Flow (using any option of json reading, Document Per line or Array of documents (which it is)) produced the FASTFAIL Malformed BOM error.

To Out-File, I added -Encoding utf8NoBOM. Azure Data Factory now reads the json files as expected.

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

Latest update, this is included in our acting plan now, ETA to get the fix will be 1~2 months later, thanks!

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

Most work done, pending deployment now.

from azure-datafactory.

LockTar avatar LockTar commented on July 28, 2024

@shawnxzq what will be the end result? Must we update our datasets, pipelines etc? Or will the new default be without BOM?
Thanks

from azure-datafactory.

mguard11 avatar mguard11 commented on July 28, 2024

@shawnxzq is there a tentative deployment date ?

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

Thanks @Eniemela for confirm, later we will have full UX support for this as well, it will come next month.

from azure-datafactory.

mguard11 avatar mguard11 commented on July 28, 2024

Hi @shawnxzq is this being deployed next week with full UX support ?

Thanks
Danesh

from azure-datafactory.

mguard11 avatar mguard11 commented on July 28, 2024

I'm setting the default file encoding to utf-8 in my sink settings as shown in screenshot but the output file still has utf-8 BOM encoding, is there a way to have a quick fix for this ?

Screen Shot 2021-10-22 at 8 57 31 AM

Screen Shot 2021-10-22 at 8 55 38 AM

?

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@mguard11 , currently, the feature is not supported in our UI, please follow @shawnxzq 's suggestion: manually edit the encoding of the json dataset to "UTF-8 without BOM".
image
When the feature is available in our UI, you will see it like this:
image

from azure-datafactory.

mguard11 avatar mguard11 commented on July 28, 2024

@yew I set it manually and it's throwing below error,
Screen Shot 2021-10-25 at 11 27 48 AM

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@mguard11 did you used Azure IR or SHIR? If the latter, please make sure the version is above 5.10.7918.2. You can download the latest version here.

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@mguard11 , in our telemetry, I can see your SHIR version is 4.15.7611.1, please upgrade it to use the new feature.

from azure-datafactory.

shawnxzq avatar shawnxzq commented on July 28, 2024

Close it since issue is already fixed.

from azure-datafactory.

Eniemela avatar Eniemela commented on July 28, 2024

This "fix" apparently breaks the functionality again in Synapse! Synapse UI now enables selecting UTF-8 without BOM -but it fails to validate!
image

image

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@Eniemela the feature is only supported in copy activity, @shawnxzq , is there a way to write a JSON file without BOM in dataflow?

from azure-datafactory.

Eniemela avatar Eniemela commented on July 28, 2024

We use JSON dataset as Sink for Copy activity (-> stores JSON with BOM if Default(UTF-8) is selected) and use the same dataset as source in mapping data flow which breaks if Default(UTF-8) is selected since it expects no BOM. This worked for us when we could manually enter "UTF-8 without BOM" to dataset but broke again when this UI support came out since it now prevents to use this encoding in a dataset used in mapping data flow. We currently use ISO-8859-1 as a workaround for dataset but it is not an optimal solution

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@Eniemela , oh, I see. Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

from azure-datafactory.

chr0m avatar chr0m commented on July 28, 2024

No offence, but that is a ridiculous solution. My ADF is config driven from the database, I'm expected to create a whole new dataset just to use UTF-8 without BOM? What's the point of being able to make the encoding dynamic if it doesn't work? Even worse, it used to work and now it doesn't. Not sure how we can rely on this crap in a production environment.

from azure-datafactory.

chr0m avatar chr0m commented on July 28, 2024

Is there a fix for this? It shouldn't be this hard to generate a UTF-8 CSV without BOM.
I'm using ADF Copy Data activity, I select "UTF-8 without BOM" from the encoding type dropdown and when I run I get an error saying that it isn't a valid encoding type. Ridiculous!

Why has this been closed when it clearly hasn't been fixed?

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

from azure-datafactory.

chr0m avatar chr0m commented on July 28, 2024

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

I since discovered that it's the Get MetaData activity throwing the error when the dataset is set to UTF-8 without BOM

from azure-datafactory.

yew avatar yew commented on July 28, 2024

@chr0m Many thanks for your feedback, I can repro the error, we will fix it and update here.

from azure-datafactory.

bramvanhautte avatar bramvanhautte commented on July 28, 2024

@yew any update on this (closed) issue? Is there a new issue created?

I'm facing similar issues when using the copy activity with the 'UTF 8 without BOM' and trying to pick up that file in a Data Flow source using the same dataset.

It makes no sense to double up this dataset..

EDIT: tried it using doubling up datasets: did not work. Still receiving the "Malformed records are detected in schema inference. Parse Mode: FAILFAST" error.
Sink dataset: UTF8 without BOM
Source dataset: UTF8

Thanks!

from azure-datafactory.

Krumelur avatar Krumelur commented on July 28, 2024

Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)

from azure-datafactory.

SNY019 avatar SNY019 commented on July 28, 2024

US-ASCII encoding work around works for me. (July 2023 Issue is very much alive)

from azure-datafactory.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.