gretelai / gretel-blueprints Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 30.0 34.05 MB

Public blueprints for data use cases

Home Page: gretel-blueprints.vercel.app

License: Apache License 2.0

Python 14.89% Shell 3.76% JavaScript 4.72% Jupyter Notebook 76.64%

gretel-blueprints's People

Stargazers

Watchers

gretel-blueprints's Issues

Will the inference API ever allow greater than 50 rows?

Hello!

In the file docs/inference_api_beta/js/example.js the header comment read:

/*
REQUIREMENTS
...
3. Maximum of 50 rows. Use the standard /models batch API for more.
...

Is there any intention to increase the limit, or will 50 rows always be the greatest amount of rows possible with the streaming API endpoint at v1/inference/tabular/stream?

Is it possible to stream results from the /models batch API as the suggestion states? Or is that a situation where we must wait for the record to be created and then retrieve them.

Thank you!

CHECKPOINT_DIR Issue

Getting a KeyError on account of ""

Can't access the path checkpoint_dir = str(Path.cwd() / "checkpoints-synthetics")

ModuleNotFoundError: No module named 'gretel_helpers'

I got this error while trying out your blueprint for creating synthetic data (link). I'm not sure if the code is out-of-date or if I'm missing a package, but I already have gretel-client gretel-synthetics pandas.

Update: gretel-synthetics version: 0.15.5, gretel-client Version: 0.7.11

"Invalid Gretel API key. Please check your configuration and try again" when running local_classify.ipynb

Hi,

I am running the local_classify example but I get an error message when running the below cell:

# Create a project and train the synthetic data model

project = create_or_get_unique_project(name="synthetic-data-local")
model = project.create_model_obj(model_config=config)
run = submit_docker_local(model, output_dir="tmp/")

---------------------------------------------------------------------------
GretelClientConfigurationError            Traceback (most recent call last)
[<ipython-input-11-3eb3777a3233>](https://localhost:8080/#) in <module>
      1 # Create a project and train the synthetic data model
      2 
----> 3 project = create_or_get_unique_project(name="synthetic-data-local")
      4 model = project.create_model_obj(model_config=config)
      5 run = submit_docker_local(model, output_dir="tmp/")

3 frames
[/usr/local/lib/python3.7/dist-packages/gretel_client/projects/projects.py](https://localhost:8080/#) in create_or_get_unique_project(name, desc, display_name)
    469         params will have no affect.
    470     """
--> 471     current_user_dict = get_me()
    472     unique_suffix = current_user_dict["_id"][9:]
    473     target_name = f"{name}-{unique_suffix}"

[/usr/local/lib/python3.7/dist-packages/gretel_client/users/users.py](https://localhost:8080/#) in get_me(as_dict)
     17             the only option available.
     18     """
---> 19     api = get_session_config().get_api(UsersApi)
     20     resp = api.users_me()
     21     if as_dict:

[/usr/local/lib/python3.7/dist-packages/gretel_client/config.py](https://localhost:8080/#) in get_api(self, api_interface, max_retry_attempts, backoff_factor)
    216                 to determine the time between attempts.
    217         """
--> 218         return api_interface(self._get_api_client(max_retry_attempts, backoff_factor))
    219 
    220     def _check_project(self, project_name: str = None) -> Optional[str]:

[/usr/local/lib/python3.7/dist-packages/gretel_client/config.py](https://localhost:8080/#) in _get_api_client(self, max_retry_attempts, backoff_factor)
    166         if not self.api_key.startswith("grt"):
    167             raise GretelClientConfigurationError(
--> 168                 "Invalid Gretel API key. Please check your configuration and try again."
    169             )
    170 

GretelClientConfigurationError: Invalid Gretel API key. Please check your configuration and try again.

I tried inputting an API key when running gretel configure as well.

Btw, great project! I'm hoping to use it to generate a data warehouse for a personal data engineering project in GCP (ideally using the multi table example).

Model doesn't capture latitude/longitude properly even though accuracy is high.

I am trying to generate synthetic lat/long data.
Table : user_id, timestamp, latitude and longitude.

I though accuracy is 80 percent the results are not useable at all.

Code

train_data has 5442 rows with 10 unique user_id.

config_template = {
"epochs": 1000,
"early_stopping": True,
"vocab_size": 200000,
"reset_states": True,
"checkpoint_dir" : "/content/sample_data/check2/",
"overwrite": True
}

from gretel_helpers.series_models import TimeseriesModel
model = TimeseriesModel(
training_df=test_data,
time_column="ts",
trend_columns = ['latitude', 'longitude'],
other_seed_columns=["user_id"],
synthetic_config=config_template)
model.train()
synthetic_df= model.generate().df

Standard deviation is too much in synthetic data

Should I train more increase dataset size to get better results?

Red is synthetic data, blue is real

Too few records

No matter the length of my dataset (1000, 4000, 10000, 100000) the following error is thrown

RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again

Load dataset and build synthetic model

from gretel_helpers.synthetics import SyntheticDataBundle

Specify dataset

dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/healthcare-analytics-vidhya/train_data.csv'
nrows = 10000
config_template = {
"checkpoint_dir": "/content/sample_data/checkpoints3",
"dp": True, # enable differential privacy in training
"epochs": 25, # recommend 15-30 epochs to train production models
"gen_lines": nrows, # number of lines to generate in first batch
# "vocab_size": 20000
}

Gretel helpers to optimize the synthetic model

training_df = pd.read_csv(dataset_path)
bundle = SyntheticDataBundle(
training_df=training_df,
auto_validate=False, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original
synthetic_config=config_template, # the config for Synthetics
)

Create Synthetic Data blueprint fails on differential privacy training

I'm trying to train a dataset on the Create Synthetic Data blueprint. In the config template I have set "dp": True.

I get this error:

100%|██████████| 67453/67453 [00:01<00:00, 50095.59it/s]
WARNING dp_model.py: Experimental: Differentially private training enabled
WARNING dp_model.py: ******* Patching TensorFlow to utilize new Keras code paths, see: https://github.com/tensorflow/tensorflow/issues/44917 *******
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (64, None, 256)           5120000   
_________________________________________________________________
dropout (Dropout)            (64, None, 256)           0         
_________________________________________________________________
lstm (LSTM)                  (64, None, 256)           525312    
_________________________________________________________________
dropout_1 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
lstm_1 (LSTM)                (64, None, 256)           525312    
_________________________________________________________________
dropout_2 (Dropout)          (64, None, 256)           0         
_________________________________________________________________
dense (Dense)                (64, None, 20000)         5140000   
=================================================================
Total params: 11,310,624
Trainable params: 11,310,624
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gretel_synthetics/tensorflow/train.py in train_rnn(params)
    235                   callbacks=_callbacks,
--> 236                   validation_data=validation_dataset
    237                   )

15 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1182               callbacks.on_train_batch_begin(step)
-> 1183               tmp_logs = self.train_function(iterator)
   1184               if data_handler.should_sync:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    888       with OptionalXlaContext(self._jit_compile):
--> 889         result = self._call(*args, **kwds)
    890 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    932       initializers = []
--> 933       self._initialize(args, kwds, add_initializers_to=initializers)
    934     finally:

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _initialize(self, args, kwds, add_initializers_to)
    763         self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
--> 764             *args, **kwds))
    765 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
   3049     with self._lock:
-> 3050       graph_function, _ = self._maybe_define_function(args, kwargs)
   3051     return graph_function

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   3443           self._function_cache.missed.add(call_context_key)
-> 3444           graph_function = self._create_graph_function(args, kwargs)
   3445           self._function_cache.primary[cache_key] = graph_function

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   3288             override_flat_arg_shapes=override_flat_arg_shapes,
-> 3289             capture_by_value=self._capture_by_value),
   3290         self._function_attributes,

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    998 
--> 999       func_outputs = python_func(*func_args, **func_kwargs)
   1000 

/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in wrapped_fn(*args, **kwds)
    671         with OptionalXlaContext(compile_with_xla):
--> 672           out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    673         return out

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    985             if hasattr(e, "ag_error_metadata"):
--> 986               raise e.ag_error_metadata.to_exception(e)
    987             else:

ValueError: in user code:

    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:855 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:845 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1285 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2833 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3608 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:838 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:799 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:529 minimize
        loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
    /usr/local/lib/python3.7/dist-packages/tensorflow_privacy/privacy/optimizers/dp_optimizer_keras.py:88 _compute_gradients
        tf.reshape(loss, [self._num_microbatches, -1]), axis=1)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
        return target(*args, **kwargs)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py:195 reshape
        result = gen_array_ops.reshape(tensor, shape, name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py:8398 reshape
        "Reshape", tensor=tensor, shape=shape, name=name)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py:601 _create_op_internal
        compute_device)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:3565 _create_op_internal
        op_def=op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:2042 __init__
        control_input_ops, op_def)
    /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:1883 _create_c_op
        raise ValueError(str(e))

    ValueError: Dimension size must be evenly divisible by 64 but is 1 for '{{node Reshape}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32](loss/weighted_loss/value, Reshape/shape)' with input shapes: [], [2] and with input tensors computed as partial shapes: input[1] = [64,?].

From my understanding, the issue is from the tensors being processed by the differential privacy algorithm. I can train without issue when "dp": False, also worth knowing that I could train with "dp": True on same blueprint notebook without issue 3 months ago.

Minimum training samples for 'create_synthetic_data_from_csv_or_df'

Hi! thanks for the repo, May I ask what is the minimum number of training samples required for creating synthetic_data_from_csv_dataframe? My CSV had less than 100 samples and i received an error 'RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again' when running the cell model.train() from a create_synthetic_data_from_csv_or_df notebook.

Question : how to load trained model

While training checkpoint is used to store trained model weights. How to restore and retrain and reuse them?

And how to add constraints to max and min value range to a column

I am not clear about it from the docs, could nt find example

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.