salsify / arc-furnace Goto Github PK

Need to melt, weave, and meld information together? Arc furnace will fuse anything you've got.

License: MIT License

Ruby 100.00%

arc-furnace's Issues

Add logging for failed expectations when processing rows

In many cases the ArcFurnace nodes expect rows that come in to have certain fields--this is especially true for Hash and Equijoin nodes. These nodes should be resilient to missing data and properly log when expectations are not met instead of failing with a stacktrace (which they often do).

Much larger output file sizes than input file sizes

One of our projects has a ProductPipeline that is a relatively simple implementation of the library. Here's a snippet:

require_relative 'constants'
require 'arc-furnace/pipeline'
require 'arc-furnace/excel_source'
require 'arc-furnace/all_fields_csv_sink'

class ProductsPipeline < ArcFurnace::Pipeline

  include Constants

  # create products source

  source :products_source,
         type: ArcFurnace::ExcelSource,
         params: {
             filename: :product_filename,
             encoding: 'ISO-8859-1'
         }

  transform :products_transform, params: { source: :products_source } do |hash|
    result = hash.deep_dup
    result[SALSIFY_ID] = result.delete(BLAH_ID)
    result
  end

  filter :filtered_products, params: { source: :products_transform, observed_products: :observed_products } do |row, params|
    params.fetch(:observed_products).add(row[BLAH_ID])
  end

  sink type: ArcFurnace::AllFieldsCSVSink,
       source: :filtered_products,
       params: { filename: "#{Dir.pwd}/products_import.csv" }

end

The source file is a 14 MB XLSX file, but the output file is a 71 MB CSV file. The output file is five times larger and XLSX files tend to be larger than CSV files relative to the information contained. I tried removing the filter and the file size was the same, and I spot checked the file and they look identical. I feel like something is going wrong with the AllFieldsCSVSink.

Queue @dspangen

If a source and a hash are missing values for key column an Error should be thrown

If there are no columns found matching the join key column, arc furnace should throw an error and not continue to process the file.

In the case of an equijoin the transform process finishes with no errors but it's dropped all rows on the floor and won't produce an output.

No documentation on implementation

There's no documentation on how to use or add to arc-furnace aside from the one basic example

salsify / arc-furnace Goto Github PK

arc-furnace's People

Contributors

Stargazers

Watchers

Forkers

arc-furnace's Issues

Add logging for failed expectations when processing rows

Much larger output file sizes than input file sizes

If a source and a hash are missing values for key column an Error should be thrown

No documentation on implementation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent