When using Data Factory V2 with an output Dataset being a Json on an Azure Storage Blo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I think the current status is that <a class="user-mention notranslate" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Azure Data Factory generates UTF-8 blob with a BOM,about azure/azure-datafactory

shawnxzq commented on July 28, 2024 3

This is a known issue and we have work item to track this, will update here once it's resolved.

For now, if you choose "SetOfObjects" pattern in copy sink, the generated json will be able to consume in spark and have better performance than using "ArrayOfObjects" pattern

from azure-datafactory.

shawnxzq commented on July 28, 2024 3

@vignz Please make sure the "document per line" is checked as the document form in dataflow json settings.

from azure-datafactory.

hmayer1980 commented on July 28, 2024 3

I think it would be better to keep it open and you actually update it when your Work item is done - and a solution available to us.

from azure-datafactory.

fhljys commented on July 28, 2024 2

I think the current status is that @shawnxzq provided a workaround.

from azure-datafactory.

shawnxzq commented on July 28, 2024 2

@mguard11 It will be fully available by end of next month.

For now, you can use it by manually editing the encoding of the json dataset to "UTF-8 without BOM".

from azure-datafactory.

LockTar commented on July 28, 2024 2

Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

@yew is this going to be fixed? It's a bit strange that we must create 2 duplicate datasets... We work a lot with json datasets and there this is a real issue.

Since this issue is closed. Will there be a new one, or will this one be opened again?

from azure-datafactory.

jonarmstrong commented on July 28, 2024 1

I've run into this same problem. If a use a dataset with the following definition, then the json file that is written is in UTF8-BOM encoding, which makes it incompatible with data flows.

{
    "name": "Raw_Deals_json",
    "properties": {
        "description": "This is the full export of the Deals w/o any additional processing",
        "linkedServiceName": {
            "referenceName": "datawarehouse_int",
            "type": "LinkedServiceReference"
        },
        "folder": {
            "name": "DataLake"
        },
        "annotations": [],
        "type": "Json",
        "typeProperties": {
            "location": {
                "type": "AzureBlobFSLocation",
                "fileName": "deals.json",
                "folderPath": "staged",
                "fileSystem": "int"
            }
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

from azure-datafactory.

hmayer1980 commented on July 28, 2024 1

I did run into the same issue.
You should be able to specifiy either both verson UTF-8 without BOM and UTF-8 with BOM or not writing a BOM as default at all...

from azure-datafactory.

rahash123 commented on July 28, 2024 1

I was facing the same issue. It got resolved when I set the Encoding to US-ASCII while defining the dataset.

from azure-datafactory.

ma185360 commented on July 28, 2024 1

@rocque57 Any Update on this??

from azure-datafactory.

Eniemela commented on July 28, 2024 1

Currently manually setting encoding to "UTF-8 without BOM" did solve this issue for me

from azure-datafactory.

zsponaugle commented on July 28, 2024 1

Hello @yew @shawnxzq @fhljys

I think this should be re-opened as well. The dual-dataset work-around is not ideal. Like many above me, I am trying to use a Data Flow after a Copy Activity that contains a JSON sink dataset. At the time of this writing, using Azure IR, I tried various settings and here are their outcomes:

one dataset using ISO-8859-1 encoding:
Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the ISO-8859-1 encoding

one dataset using US-ASCII encoding:
Job failed due to reason: at Source 'source1': requirement failed: The lineSep option must be specified for the US-ASCII encoding

Set of objects / Document per line and two datasets (UTF-8 without BOM, UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Set of objects / Document per line and one dataset (UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

Array of objects / Array of documents and two datasets (UTF-8 without BOM, UTF-8 default) --> WORKS

Array of objects / Array of documents and one dataset (UTF-8 default):
Job failed due to reason: at Source 'source1': Malformed records are detected in schema inference. Parse Mode: FAILFAST

So, as you can see, the only one that worked for me was using the two JSON datasets (copy activity sink set to UTF-8 without BOM and Data Flow source one set to Default(UTF-8) and setting copy sink file pattern to Array of objects and setting Data Flow source options --> JSON settings --> Document Form: Array of documents

Thank you.

from azure-datafactory.

jenilChristo commented on July 28, 2024

Facing the same issue in ADF json blob dataset.Is there a workaround to get rid of the BOM , while writing json dataset to Blob?. I am using the default (utf-8) encoding.

from azure-datafactory.

jashwood commented on July 28, 2024

Actually, this is really not cool. This is a workaround.
The trouble is that the Azure Data Factory json components fail because spark cannot handle the byte order mark.

from azure-datafactory.

vignz commented on July 28, 2024

@shawnxzq : Even after changing to "SetOfObjects" pattern in copy sink, Still ADF Data Flow is not able to read the UTF-8 JSON file. Its throwing the below error.

Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('' (code 65279 / 0xfeff)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.InputStreamReader@372ac23c; line: 1, column: 2]

from azure-datafactory.

valfirst commented on July 28, 2024

@fhljys was this issue fixed?

from azure-datafactory.

shawnxzq commented on July 28, 2024

@fbdntm As this is a known issue and we have work item to track this, are you OK to close the issue here? Thanks!

from azure-datafactory.

rocque57 commented on July 28, 2024

The workaround that was suggested did not work. Whether you select SetOfObjects or ArrayOfObjects for your sink the resulting file still has a BOM.
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: at Source 'SinkJSON': org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 10.139.64.11, executor 0): org.apache.spark.SparkException: Malformed records are detected in schema inference. Parse Mode: FAILFAST.\n\tat

from azure-datafactory.

shawnxzq commented on July 28, 2024

@rocque57 You are right that json files with utf-8 encoding generated by copy will always have BOM, however, if you set pattern as "SetOfObjects" in copy sink, and set documetForm to "document per line" in dataflow source, you won't get the error mentioned above and this is also the recommended setting from perf perspective.

from azure-datafactory.

shawnxzq commented on July 28, 2024

@rocque57 As discussed, we have backlog items to track this, no clear ETA at moment, thanks!

from azure-datafactory.

shawnxzq commented on July 28, 2024

@rocque57 This is still in backlog as a new ask, thanks for your patience, thx!

from azure-datafactory.

evogelpohl commented on July 28, 2024

FWiW: Had the same error w/ Azure Data Factory reading a path full of jSON files written by PowerShell's approach, ala:
$res = Invoke-RestMethod ...
$res | ConvertTo-Json | Out-File .../some.json

As per above, trying to open those json files in Azure Data Factory Data Flow (using any option of json reading, Document Per line or Array of documents (which it is)) produced the FASTFAIL Malformed BOM error.

To Out-File, I added -Encoding utf8NoBOM. Azure Data Factory now reads the json files as expected.

from azure-datafactory.

shawnxzq commented on July 28, 2024

Latest update, this is included in our acting plan now, ETA to get the fix will be 1~2 months later, thanks!

from azure-datafactory.

shawnxzq commented on July 28, 2024

Most work done, pending deployment now.

from azure-datafactory.

LockTar commented on July 28, 2024

@shawnxzq what will be the end result? Must we update our datasets, pipelines etc? Or will the new default be without BOM?
Thanks

from azure-datafactory.

mguard11 commented on July 28, 2024

@shawnxzq is there a tentative deployment date ?

from azure-datafactory.

shawnxzq commented on July 28, 2024

Thanks @Eniemela for confirm, later we will have full UX support for this as well, it will come next month.

from azure-datafactory.

mguard11 commented on July 28, 2024

Hi @shawnxzq is this being deployed next week with full UX support ?

Thanks
Danesh

from azure-datafactory.

mguard11 commented on July 28, 2024

I'm setting the default file encoding to utf-8 in my sink settings as shown in screenshot but the output file still has utf-8 BOM encoding, is there a way to have a quick fix for this ?

?

from azure-datafactory.

yew commented on July 28, 2024

@mguard11 , currently, the feature is not supported in our UI, please follow @shawnxzq 's suggestion: manually edit the encoding of the json dataset to "UTF-8 without BOM".

When the feature is available in our UI, you will see it like this:

from azure-datafactory.

mguard11 commented on July 28, 2024

@yew I set it manually and it's throwing below error,

from azure-datafactory.

yew commented on July 28, 2024

@mguard11 did you used Azure IR or SHIR? If the latter, please make sure the version is above 5.10.7918.2. You can download the latest version here.

from azure-datafactory.

yew commented on July 28, 2024

@mguard11 , in our telemetry, I can see your SHIR version is 4.15.7611.1, please upgrade it to use the new feature.

from azure-datafactory.

shawnxzq commented on July 28, 2024

Close it since issue is already fixed.

from azure-datafactory.

Eniemela commented on July 28, 2024

This "fix" apparently breaks the functionality again in Synapse! Synapse UI now enables selecting UTF-8 without BOM -but it fails to validate!

from azure-datafactory.

yew commented on July 28, 2024

@Eniemela the feature is only supported in copy activity, @shawnxzq , is there a way to write a JSON file without BOM in dataflow?

from azure-datafactory.

Eniemela commented on July 28, 2024

We use JSON dataset as Sink for Copy activity (-> stores JSON with BOM if Default(UTF-8) is selected) and use the same dataset as source in mapping data flow which breaks if Default(UTF-8) is selected since it expects no BOM. This worked for us when we could manually enter "UTF-8 without BOM" to dataset but broke again when this UI support came out since it now prevents to use this encoding in a dataset used in mapping data flow. We currently use ISO-8859-1 as a workaround for dataset but it is not an optimal solution

from azure-datafactory.

yew commented on July 28, 2024

@Eniemela , oh, I see. Please kindly create 2 datasets to avoid the issue: in copy sink, select "UTF-8 without BOM", and in dataflow source just use "Default" or select "UTF-8".

from azure-datafactory.

chr0m commented on July 28, 2024

No offence, but that is a ridiculous solution. My ADF is config driven from the database, I'm expected to create a whole new dataset just to use UTF-8 without BOM? What's the point of being able to make the encoding dynamic if it doesn't work? Even worse, it used to work and now it doesn't. Not sure how we can rely on this crap in a production environment.

from azure-datafactory.

chr0m commented on July 28, 2024

Is there a fix for this? It shouldn't be this hard to generate a UTF-8 CSV without BOM.
I'm using ADF Copy Data activity, I select "UTF-8 without BOM" from the encoding type dropdown and when I run I get an error saying that it isn't a valid encoding type. Ridiculous!

Why has this been closed when it clearly hasn't been fixed?

from azure-datafactory.

yew commented on July 28, 2024

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

from azure-datafactory.

chr0m commented on July 28, 2024

@chr0m "UTF-8 without BOM" is supported in ADF copy data activity. Are you using self hosted IR? If yes, please making sure its version is above 5.12.7976.1.

I since discovered that it's the Get MetaData activity throwing the error when the dataset is set to UTF-8 without BOM

from azure-datafactory.

yew commented on July 28, 2024

@chr0m Many thanks for your feedback, I can repro the error, we will fix it and update here.

from azure-datafactory.

bramvanhautte commented on July 28, 2024

@yew any update on this (closed) issue? Is there a new issue created?

I'm facing similar issues when using the copy activity with the 'UTF 8 without BOM' and trying to pick up that file in a Data Flow source using the same dataset.

It makes no sense to double up this dataset..

EDIT: tried it using doubling up datasets: did not work. Still receiving the "Malformed records are detected in schema inference. Parse Mode: FAILFAST" error.
Sink dataset: UTF8 without BOM
Source dataset: UTF8

Thanks!

from azure-datafactory.

Krumelur commented on July 28, 2024

Confirmed that @zsponaugle approach also worked here (note that it's meanwhile November 2022 and the issue still exists)

from azure-datafactory.

SNY019 commented on July 28, 2024

US-ASCII encoding work around works for me. (July 2023 Issue is very much alive)

from azure-datafactory.

Azure Data Factory generates UTF-8 blob with a BOM about azure-datafactory HOT 46 CLOSED

Comments (46)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent