datahq / dataflows Goto Github PK

View Code? Open in Web Editor NEW

188.0 11.0 39.0 850 KB

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.

Home Page: https://dataflows.org

License: MIT License

Makefile 0.24% Python 87.51% Jupyter Notebook 12.25%

dataflows's Introduction

DataFlows

DataFlows is a simple and intuitive way of building data processing flows.

It's built for small-to-medium-data processing - data that fits on your hard drive, but is too big to load in Excel or as-is into Python, and not big enough to require spinning up a Hadoop cluster...
It's built upon the foundation of the Frictionless Data project - which means that all data produced by these flows is easily reusable by others.
It's a pattern not a heavy-weight framework: if you already have a bunch of download and extract scripts this will be a natural fit

Read more in the Features section below.

QuickStart

Install dataflows via pip install.

(If you are using minimal UNIX OS, run first sudo apt install build-essential)

Then use the command-line interface to bootstrap a basic processing script for any remote data file:

# Install from PyPi
$ pip install dataflows

# Inspect a remote CSV file
$ dataflows init https://raw.githubusercontent.com/datahq/dataflows/master/data/academy.csv
Writing processing code into academy_csv.py
Running academy_csv.py
academy:
#     Year           Ceremony  Award                                 Winner  Name                            Film
      (string)      (integer)  (string)                            (string)  (string)                        (string)
----  ----------  -----------  --------------------------------  ----------  ------------------------------  -------------------
1     1927/1928             1  Actor                                         Richard Barthelmess             The Noose
2     1927/1928             1  Actor                                      1  Emil Jannings                   The Last Command
3     1927/1928             1  Actress                                       Louise Dresser                  A Ship Comes In
4     1927/1928             1  Actress                                    1  Janet Gaynor                    7th Heaven
5     1927/1928             1  Actress                                       Gloria Swanson                  Sadie Thompson
6     1927/1928             1  Art Direction                                 Rochus Gliese                   Sunrise
7     1927/1928             1  Art Direction                              1  William Cameron Menzies         The Dove; Tempest
...

# dataflows create a local package of the data and a reusable processing script which you can tinker with
$ tree
.
├── academy_csv
│   ├── academy.csv
│   └── datapackage.json
└── academy_csv.py

1 directory, 3 files

# Resulting 'Data Package' is super easy to use in Python
[adam] ~/code/budgetkey-apps/budgetkey-app-main-page/tmp (master=) $ python
Python 3.6.1 (default, Mar 27 2017, 00:25:54)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datapackage import Package
>>> pkg = Package('academy_csv/datapackage.json')
>>> it = pkg.resources[0].iter(keyed=True)
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': None, 'Name': 'Richard Barthelmess', 'Film': 'The Noose'}
>>> next(it)
{'Year': '1927/1928', 'Ceremony': 1, 'Award': 'Actor', 'Winner': '1', 'Name': 'Emil Jannings', 'Film': 'The Last Command'}

# You now run `academy_csv.py` to repeat the process
# And obviously modify it to add data modification steps

Features

Trivial to get started and easy to scale up
Set up and run from command line in seconds ...
- dataflows init => flow.py
- python flow.py
Validate input (and esp source) quickly (non-zero length, right structure, etc.)
Supports caching data from source and even between steps
- so that we can run and test quickly (retrieving is slow)
Immediate test is run: and look at output ...
- Log, debug, rerun
Degrades to simple python
Conventions over configuration
Log exceptions and / or terminate
The input to each stage is a Data Package or Data Resource (not a previous task)
- Data package based and compatible
Processors can be a function (or a class) processing row-by-row, resource-by-resource or a full package
A pre-existing decent contrib library of Readers (Collectors) and Processors and Writers

Learn more

Dive into the Tutorial to get a deeper glimpse into everything that dataflows can do. Also review this list of Built-in Processors, which also includes an API reference for each one of them.

dataflows's People

Contributors

Stargazers

Watchers

dataflows's Issues

Comments on tutorial here as of 16 July 2018

Why is processing code in academy_csv.py rather than flow.py

Allow using singer.io taps and targets as processors

This would allow us to leverage a great set of existing integrations and greatly expand list of processors.

Tasks

Is there an easy way to port them over?
If so, do a trial one
Generify it

Tutorial / example on scraping html including use of cache processor

I wouldn’t go around saving html files - you can create a processor that loads the data using requests and parses it (e.g. using https://github.com/yuanxu-li/html-table-extractor or using pyquery).

You can then wrap that step with the cache processor to avoid re-fetching the data on every run.

Add 'select_fields' processor

As using 'concatenate' for that is not intuitive.

Split into sub packages for modularity

we need at least:

dataflows-cli
dataflows-core
dataflows-stdlib

installing dataflows would install all of the sub packages (+will import everything from all of them for compatibility and usability)

It could be just like set_primary_key processor, however, adding foreign keys is probably less common. At the moment, the only option I can see is to use update_resource processor by providing the entire schema of a resource. Is there a way to get generated schema so that I could just add a new key into it (e.g., foreignKeys)?

allow generators to pass a schema on the first row

I think this will be useful

from dataflows import Flow, printer

def generator():
  yield {"__dataflows_schema": True,
            "name": "my-resource", 
            "path": "my-resource.csv", 
            "schema": {"fields": [{"name": "i", "type": "string"]}}
  for i in [1,'two', 'three']:
    yield {"i": i}

Flow(generator(), printer()).process()

add stream and unstream processors

For more accurate streaming of data - dump_to_path removes some information (e.g. date timezones) and modifies the schema (e.g. file format).
Also preparing for possible integration with dpp processors.

child of #59

Checkpoints don't use schema for formatting

It looks like the checkpoint formatters uses static unconfigurable serializers.

This creates serious issues - even if the datetime formatter was timezone aware, it would then be adding timezone data to naive datetimes

exceptions from generators shouldn't be silently ignored

reproduction

from dataflows import Flow, printer

def generator():
  for i in range(5):
    raise Exception()
    yield {"i": i}

Flow(generator(), printer()).process()

Expected

Exception

Actual

no exception, empty resource

Unpivoting with regex

How would I unpivot the following table using regex:

2000,2001,2002
a1,b1,c1,d1
a2,b2,c2,d2

I'd call unpivot like below:

unpivoting_fields = [
    {'name': r'\d{4}', 'keys': {'year': r'\d{4}'}}
]
extra_keys = [
    {'name': 'year', 'type': 'year'}
]
extra_value = {'name': 'value', 'type': 'string'}

unpivot(unpivoting_fields, extra_keys, extra_value)

but this results to:

year,value
\\d{4},a1
\\d{4},a2
\\d{4},b1
...

am I missing something?

Once we figure it out, I will update the docs as it would be great to have an example for this one 😄

Can't name/rename resources

By default, resources are named like following: res_1, res_2 etc.. and paths to the resources look similar res_1.csv, res_2.csv...

As a dataflows user, I want to name/rename resource(s) with choice of mine, so that I'm able to reuse resource(s) and find them by name, or just look nice

Acceptance Criteria

I can name however I want
Paths to the files are appropriate

Analysis

I tried to create a custom processor that changes the name of the resource but does not really work.

Option one: modify `resource` object descriptor:

def name_resource(package):
    package.pkg.resources[0].descriptor['name'] = 'countries'
    package.pkg.resources[0].descriptor['path'] = 'countries.csv'
    package.pkg.resources[0].commit()
    yield package.pkg
    yield from package

f = Flow(
      [{'hello': 'world'}],
      name_resource,
      dump_to_path('data'),
)

This kind of work as the output file is named countries.csv, but nothing is changed inside datapackage.json

$ cat data/nato_countries_official/datapackage.json 
{
  "name": "m-package",
  "resources": [
    {
      "name": "res_1",
      "path": "res_1.csv",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
          {
            "format": "default",
            "name": "country_name",
            "type": "string"
          }
        ]
      }
    }
  ]
}

Option 2: modify `pkg` object descriptor:

def name_resource(package):
    package.pkg.descriptor['resources'][0]['name'] = 'countries'
    package.pkg.descriptor['resources'][0]['path'] = 'countries.csv'
    package.pkg.commit()
    yield package.pkg
    yield from package

f = Flow(
      [{'hello': 'world'}],
      name_resource,
      dump_to_path('data'),
)

This results in the error, thinking the resource is gone at all

Traceback (most recent call last):
  File "flows/run_all.py", line 4, in <module>
    nato_countries_official.process()
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/flow.py", line 15, in process
    return self._chain().process()
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 66, in process
    for res in ds.res_iter:
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 57, in <genexpr>
    res_iter = (it if isinstance(it, ResourceWrapper) else ResourceWrapper(res, it)
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/processors/dumpers/dumper_base.py", line 80, in process_resources
    for resource in resources:
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 54, in <genexpr>
    res_iter = (ResourceWrapper(get_res(rw.res.name), rw.it)
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 57, in <genexpr>
    res_iter = (it if isinstance(it, ResourceWrapper) else ResourceWrapper(res, it)
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 31, in process_resources
    for res in resources:
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 55, in <genexpr>
    for rw in res_iter)
  File "/home/.virtualenvs/fedex/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 51, in get_res
    assert ret is not None
AssertionError

Suggested feature: Dataflows REPL

see #48 and #49

Example use-case of dataflows REPL for Kubernetes management

$ pip install dataflows-kubernetes
$ pip install dataflows-prometheus

$ dataflows --checkpoints

> kubernetes.get:deployments --namespace=demo5 
Retrieving deployments for namespace demo5
Saved checkpoint 1

> prometheus.get:full_deployment_stats --namespace=demo5 --since=2018-10-01
Loading checkpoint 1
Retrieving full pod stats since 2018-10-01 for pod AAA namespace demo5
Retrieving full pod stats since 2018-10-01 for pod BBB namespace demo5
Saved checkpoint 2

> --inline-row-step 'row["interesting"] = row["avg_cpu_load"] > 90 or row["uptime_percent"] < 90'
Loading checkpoint 2
Saved checkpoint 3

> filter_rows --equals={"interesting":true}
Loading checkpoint 3
Saved checkpoint 4

> printer
deployment_name | datetime | avg_cpu_load | uptime_percent
------------------------------------------------------------------------------------
foobar | 2018-10-15 03:44 | 98 | 5

An example with biggish data e.g. 1GB or something

Would be a nice to have such an example ...

Split out "data.py" library and follow data lib design pattern

here's the design pattern http://okfnlabs.org/blog/2018/02/15/design-pattern-for-a-core-data-library.html

We implemented this largely in JS here https://github.com/datahq/data.js

[proposal] package-set level operations

Right now a dataflows and datapackage-pipelines can only do operations within one package at a time, but I think adding another layer to the api for handling a stream of multiple packages would make sense. This would be useful for:

splitting complex packages for multiple consumers
streaming multiple, self contained packages (say, bundles of user-wise data)

Save data as json not working

Traceback (most recent call last):
  File "exercise.py", line 24, in <module>
    dump_to_path(out_path='data', format='json')
  ...
  File "/home/zelima/anaconda3/envs/dataflows/lib/python3.7/site-packages/dataflows/processors/dumpers/file_formats.py", line 30, in __init__
    self.headers = [f.name for f in schema.fields]
AttributeError: 'NoneType' object has no attribute 'fields'

Add tests for this scenario
Make sure they pass

Excel output processor

Do we have one already? Where do I check this sort of stuff?

As a Developer I want to output the Data Package as an Excel file with each resource as a seperate tab so that I can share the excel file with people who use Excel

Targetting modern Excel (xlsx)
Bonus: output metadata in a separate sheet e.g. each resource with its metadata followed by 2 blank lines
Bonus: metadata per resource as extra rows at the top of a sheet (?)

Install fails as awesome-slugify and inquirer are missing and needed for cli

0.0.51 breaks passing in named resources to Flow

I need some time to figure out what's up here and to provide you better documention (it's possible this is on my end, if so I will close this issue) but there seems to be an issue with passing in a name for a resource to load since 0.0.51. Instead of only using the passed in name, it creates two resources:

One using the passed in name, containing the correct headers, the correct types, and no rows.
Another using the default empty name (aka the file name), containing the correct headers, the correct rows and no types.

I'll come back to this asap to give you more details, for now I'm using v0.0.50

Browsable elegant list of processors

dataflow vs dataflows

I don't think a problem if we have to have plural for PyPI but i think we should keep singular for command line (or even package import) because it is simpler and makes more sense (you are building a flow not flows).

how to change field name

Is there a method to change field name?

head / tail processors

def head(num_rows=10):
    
    def step(rows):
        for rownum, row in enumerate(rows):
            if rownum >= num_rows:
                break
            yield row
    
    return step

def tail(num_rows=10):

    def step(rows):
        for row in deque(rows, maxlen=num_rows):
            yield row
    
    return step

concurrency?

can we please have concurrency? dataflows works great for small datasets, but it's rather time consuming when it comes to a slightly larger dataset

ImportError: No module named dataflows

Fresh install on macOS 10.13.6 High Sierra. I'm new to Python, following the tutorial at http://okfnlabs.org/blog/2018/08/29/data-factory-data-flows-introduction.html

09:08:25 jonathan:~/Documents/projects/datafactory (master) $ dataflows --help

Usage: dataflows [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  init  Bootstrap a processing pipeline script.

09:08:29 jonathan:~/Documents/projects/datafactory (master) $ dataflows init https://rawgit.com/datahq/demo/_/first.csv

Writing processing code into first_csv.py
Running first_csv.py
Processing failed, here's the error:
python: VERSIONER_PYTHON_VERSION environment variable error (ignored)
Traceback (most recent call last):
  File "first_csv.py", line 1, in <module>
    from dataflows import Flow, load, dump_to_path, dump_to_zip, printer, add_metadata
ImportError: No module named dataflows

09:16:13 jonathan:~/Documents/CatalystIT/projects/datafactory (master) $ dataflows init

Hi There!
    DataFlows will now bootstrap a data processing flow based on your needs.

    Press any key to start...
    
[?] What is the source of your data?: File
 ❯ File
   Remote URL
   SQL Database
   Other

At first I thought maybe the rawgit.com URL is bad (it doesn't serve data now), but
dataflows init https://raw.githubusercontent.com/datahq/demo/master/first.csv
also generates the ImportError: No module named dataflows message.

Why not combining the data resources and the datapackage.json into one json file?

I've been using dataflows to process data and dumping to s3 using a custom dumper I wrote based on the datapackage-pipelines-aws package. Everything works pretty well however when it comes to version control, I've encountered issues.
because the data file(usually a csv) and the datapacakge files are dumped separately, it makes it difficult to compare existing versions (using md5 checksum). so I would create a new version of the datapackage.json but not the csv. With the current structure, it's difficult to say if we are creating a new datapackage.json, cache a new csv too.
I was wondering if it would be beneficial to dump data resources with datapackage.json in one big json file?

How would I add minimum/maximum constraints?

E.g., I have a table with time series and I wish to have constraints.minimum and constraints.maximum in the schema so that I have aggregated information about the table.

Is there a standard approach using existing processors or should I go for a custom processor?

Write up a concrete use case or two

Give an example of a project you did where you used (or now would use) DataFlows and what it replaced.

Makes this tangible and makes clear the benefit.

Supports arguments to python flow.py

The following commands should work - i.e. you can pass command line options to flow.py

python flow.py help
python flow.py --debug
python flow.py --start-at=...

This implies we need a special cli runner by default in templates that parses cli arguments ...

if __name__ == __main__:
    dataflows.runcli(...)

suggested features: dataflows CLI + auto-numbered checkpoints

Example shell session using suggested features for dataflows CLI and checkpoints:

$ dataflows load "/foo/bar/datapackage.json" | dataflows ./my-flow.py:my_step "arg_a" "arg_b" | dataflows printer
FOO | BAR
-------|-------
aaa  | ccc
^^^^^^^^^^^

$ dataflows ./my-flow.py:my_other_step | dataflows checkpoint
Saved checkpoint 1

$ dataflows checkpoint 1 | dataflows join --source_name=foo --source_key='["my_id"]' --source_delete=false --target_name=bar --target_key='["my_id"] --fields='{"baz": {}}' | dataflows checkpoint
Loading from checkpoint 1
Saving to checkpoint 2

$ dataflows checkpoint last | dataflows printer
Loading from checkpoint 2
FOO | BAR
-------|-------
aaa  | ccc
^^^^^^^^^^^

Could be used to support integration with singer (#16)

$ dataflows singer exchangerates --coin=BTC | dataflows printer
Date | Coin | Rate
------------------------
2017 | BTC | 5000$
2018 | BTC | 20000$
2019 | BTC | 5$

Join takes too long time (or hangs) to process the data

I'm trying to solve this exercise https://github.com/ViderumGlobal/programming-exercise but join needs so big time to process that I thought it just hang and could not finish the task. Don't see any while loops in join.py so I doubt I'm getting in an infinite loop, making me think that it's just slow.

I simplified the code

from dataflows import Flow, load, join, printer, filter_rows, 

def filter_over_10(rows):
    for row in rows:
        if row.get('order') is not None and row.get('order') > 10:
            continue
        yield row

res = Flow(
        load('data/movies/datapackage.json'),
        load('data/credits/datapackage.json'),
        filter_over_10,
        filter_rows(not_equals=[{'revenue': 0}], resources=['tmdb_5000_movies']),
        filter_rows(not_equals=[{'gender': 0}], resources=['tmdb_5000_credits']),
        join('tmdb_5000_movies', ['id'], 'tmdb_5000_credits', ['id'], fields={'revenue':{}}, full=False),
        printer(),
).results()

movies is ~4000 rows
credits ~40000 after the filter

Comparison with Meltano, Mara, Airflow and other ETL tools

As a potential User of dataflows I want to understand how it compares to other tools so that I get what use cases it was designed for and why (or why not) i should use it (and also deepen respect for its creators because i know they know their stuff).

As an example of this done very well see VuePress https://vuepress.vuejs.org/guide/#why-not (short) and VueJS https://vuejs.org/v2/guide/comparison.html (long)

Tasks

How is this similar and different to meltano. Is there anything we could contribute there or learn from?
Ditto for mara - https://github.com/mara/data-integration (python)
Ditto for Apache Beam and Google DataFlow (based on Beam) - https://beam.apache.org/

Add filter rows with callable

Filter rows using a callable in Python's built in filter.
This is very useful and can be reused. It might make sense to merge with dataflows.filter_rows.
h/t @shevron

def filter_rows_callable(cb):
    def f(rows):
        yield from filter(cb, rows)
    return f

[cli] Misc holding page for ideas from Rufus

Options for dataflow init

--interactive - default = false
--package - produce a full data package layout. default=false

Options for run

run looks for flow.py in current directory or in script (flow)/flow.py
--step

Nice home page for the project

e.g. https://datahub.io/data-factory/dataflows

Nice home pages are a semi-conscious signal of the quality of the project.

Normalize resource

Add ability to normalize a resource - i.e.

extract a few columns out of a resource into a deduplicated 'lookup table' resource
add proper cross indexes and pointers (i.e. replace extracted columns with the index of the
create foreign key relationships between the two resources.

when using dump_to_path, it gives blank lines between each row

Other problem occurred when using dump_to_path, it gives blank lines between each row.
Possible solution: https://stackoverflow.com/questions/3348460/csv-file-written-with-python-has-blank-lines-between-each-row

Blank lines between the rows have been fixed by doing next steps:

file: https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/file_dumper.py
line: 92
add newline='' as argument in tempfile.NamedTemporaryFile
from this: temp_file = tempfile.NamedTemporaryFile(mode="w+", delete=False)
to this: temp_file = tempfile.NamedTemporaryFile(mode="w+", delete=False, newline='')
This fix should be checked Linux and Mac how it behaves.

Originally posted by @svetozarstojkovic in #57 (comment)

Exception occurred while running dataflows in Windows OS

As a developer who is using Windows operating system, I want to use dataflows for data wrangling so that I can datapackage data the easier way.

Example error when running https://gitlab.com/datopian/datasets-fedex/blob/master/flows/country_currencies.py :

Add computed field with callable

Add a computed field that is computed using a Python callable
This is very useful and can be reused. It might make sense to merge with dataflows.add_computed_field.
h/t @shevron

def add_computed_field_callable(name, type, callback, **options):
    def func(package):
        # Alter the schema to add a field
        for resource in package.pkg.descriptor['resources']:
            resource['schema']['fields'].append(dict(name=name, type=type, **options))
        yield package.pkg
        
        def value_setter(rows):
            for row in rows:
                row[name] = callback(row)
                yield row

        for resource in package:
            yield value_setter(resource)

    return func

Runtime execution contract documentation

From what I can tell, when a flow is processed, each row goes through the entire pipeline before the next is processed (for the most part). Also as discussed in gitter, resources are only available after the package.pkg has been yielded (in a package level processor).

datapackages with invalid date/time formats should be dumped and loaded successfully

Seems there is a problem with datapackages created with the non-standard date/time format

The packages are loaded fine but are saved incorrectly and can't be loaded

This affects the checkpoint as it depends on dump_to_path and load - which makes it "misbehave" in some cases

see example here -
https://gist.github.com/OriHoch/5e5b608f31916f27beafb40376860761

[cli] help command

How is this different from X ...

E.g. Airflow, Pentaho, X ETL tool

How would I add an "id" column

Any suggestion for getting from this

a,b,c
a1,b1,c1
a2,b2,c2

to this:

id,a,b,c
1,a1,b1,c1
2,a2,b2,c2

Suggested feature: Dataflows DSL

$ dataflows -c '
load "/foo/bar/datapackage.json"
./my-flow.py:my_step "arg_a" "arg_b"
printer
'
FOO | BAR
-------|-------
aaa  | ccc
^^^^^^^^^^^

Create file: my-flow.dataflow

#!/usr/bin/env dataflows

my_module.steps:my_step "${1}" "${2}" '{"baz":"bax"}'
checkpoint

Run it:

$ chmod +x my-flow.dataflow
$ ./my-flow.dataflow "PARAM_1" "PARAM_2"

Saving checkpoint 1

$ dataflows -c '
checkpoint 1
join --source_name=foo --source_key=["my_id"] \
       --source_delete=false --target_name=bar --target_key=["my_id"] \
       --fields={"baz": {}}
checkpoint
'
Loading from checkpoint 1
Saving to checkpoint 2

Related: #48

Traceback (most recent call last):
  File "date.py", line 32, in <module>
    Calendar_Date_Dimension()
  File "date.py", line 28, in Calendar_Date_Dimension
    flow.process()
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/base/flow.py", line 15, in process
    return self._chain().process()
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 83, in process
    ds = self._process()
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 72, in _process
    datastream = self.source._process()
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 72, in _process
    datastream = self.source._process()
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/base/datastream_processor.py", line 75, in _process
    self.datapackage = self.process_datapackage(self.datapackage)
  File "/Users/anuarustayev/Desktop/repos/sandbox-cubes/cubes/lib/python3.6/site-packages/dataflows/processors/load.py", line 88, in process_datapackage
    if self.load_source.startswith('env://'):
AttributeError: 'list' object has no attribute 'startswith'

datahq / dataflows Goto Github PK

dataflows's Introduction

DataFlows

QuickStart

Features

Learn more

dataflows's People

Contributors

Stargazers

Watchers

Forkers

dataflows's Issues

Tasks

reproduction

Expected

Actual

Acceptance Criteria

Analysis

Option one: modify resource object descriptor:

Option 2: modify pkg object descriptor:

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Option one: modify `resource` object descriptor:

Option 2: modify `pkg` object descriptor: