gretelai / gretel-blueprints Goto Github PK
View Code? Open in Web Editor NEWPublic blueprints for data use cases
Home Page: gretel-blueprints.vercel.app
License: Apache License 2.0
Public blueprints for data use cases
Home Page: gretel-blueprints.vercel.app
License: Apache License 2.0
Hello!
In the file docs/inference_api_beta/js/example.js
the header comment read:
/*
REQUIREMENTS
...
3. Maximum of 50 rows. Use the standard /models batch API for more.
...
Is there any intention to increase the limit, or will 50 rows always be the greatest amount of rows possible with the streaming API endpoint at v1/inference/tabular/stream
?
Is it possible to stream results from the /models
batch API as the suggestion states? Or is that a situation where we must wait for the record to be created and then retrieve them.
Thank you!
I got this error while trying out your blueprint for creating synthetic data (link). I'm not sure if the code is out-of-date or if I'm missing a package, but I already have gretel-client gretel-synthetics pandas
.
Update: gretel-synthetics version: 0.15.5
, gretel-client Version: 0.7.11
Hi,
I am running the local_classify example but I get an error message when running the below cell:
# Create a project and train the synthetic data model
project = create_or_get_unique_project(name="synthetic-data-local")
model = project.create_model_obj(model_config=config)
run = submit_docker_local(model, output_dir="tmp/")
---------------------------------------------------------------------------
GretelClientConfigurationError Traceback (most recent call last)
[<ipython-input-11-3eb3777a3233>](https://localhost:8080/#) in <module>
1 # Create a project and train the synthetic data model
2
----> 3 project = create_or_get_unique_project(name="synthetic-data-local")
4 model = project.create_model_obj(model_config=config)
5 run = submit_docker_local(model, output_dir="tmp/")
3 frames
[/usr/local/lib/python3.7/dist-packages/gretel_client/projects/projects.py](https://localhost:8080/#) in create_or_get_unique_project(name, desc, display_name)
469 params will have no affect.
470 """
--> 471 current_user_dict = get_me()
472 unique_suffix = current_user_dict["_id"][9:]
473 target_name = f"{name}-{unique_suffix}"
[/usr/local/lib/python3.7/dist-packages/gretel_client/users/users.py](https://localhost:8080/#) in get_me(as_dict)
17 the only option available.
18 """
---> 19 api = get_session_config().get_api(UsersApi)
20 resp = api.users_me()
21 if as_dict:
[/usr/local/lib/python3.7/dist-packages/gretel_client/config.py](https://localhost:8080/#) in get_api(self, api_interface, max_retry_attempts, backoff_factor)
216 to determine the time between attempts.
217 """
--> 218 return api_interface(self._get_api_client(max_retry_attempts, backoff_factor))
219
220 def _check_project(self, project_name: str = None) -> Optional[str]:
[/usr/local/lib/python3.7/dist-packages/gretel_client/config.py](https://localhost:8080/#) in _get_api_client(self, max_retry_attempts, backoff_factor)
166 if not self.api_key.startswith("grt"):
167 raise GretelClientConfigurationError(
--> 168 "Invalid Gretel API key. Please check your configuration and try again."
169 )
170
GretelClientConfigurationError: Invalid Gretel API key. Please check your configuration and try again.
I tried inputting an API key when running gretel configure as well.
Btw, great project! I'm hoping to use it to generate a data warehouse for a personal data engineering project in GCP (ideally using the multi table example).
I am trying to generate synthetic lat/long data.
Table : user_id, timestamp, latitude and longitude.
I though accuracy is 80 percent the results are not useable at all.
train_data has 5442 rows with 10 unique user_id.
config_template = {
"epochs": 1000,
"early_stopping": True,
"vocab_size": 200000,
"reset_states": True,
"checkpoint_dir" : "/content/sample_data/check2/",
"overwrite": True
}
from gretel_helpers.series_models import TimeseriesModel
model = TimeseriesModel(
training_df=test_data,
time_column="ts",
trend_columns = ['latitude', 'longitude'],
other_seed_columns=["user_id"],
synthetic_config=config_template)
model.train()
synthetic_df= model.generate().df
Standard deviation is too much in synthetic data
Should I train more increase dataset size to get better results?
No matter the length of my dataset (1000, 4000, 10000, 100000) the following error is thrown
RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again
from gretel_helpers.synthetics import SyntheticDataBundle
dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/healthcare-analytics-vidhya/train_data.csv'
nrows = 10000
config_template = {
"checkpoint_dir": "/content/sample_data/checkpoints3",
"dp": True, # enable differential privacy in training
"epochs": 25, # recommend 15-30 epochs to train production models
"gen_lines": nrows, # number of lines to generate in first batch
# "vocab_size": 20000
}
training_df = pd.read_csv(dataset_path)
bundle = SyntheticDataBundle(
training_df=training_df,
auto_validate=False, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original
synthetic_config=config_template, # the config for Synthetics
)
I'm trying to train a dataset on the Create Synthetic Data blueprint. In the config template I have set "dp": True
.
I get this error:
100%|██████████| 67453/67453 [00:01<00:00, 50095.59it/s]
WARNING dp_model.py: Experimental: Differentially private training enabled
WARNING dp_model.py: ******* Patching TensorFlow to utilize new Keras code paths, see: https://github.com/tensorflow/tensorflow/issues/44917 *******
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (64, None, 256) 5120000
_________________________________________________________________
dropout (Dropout) (64, None, 256) 0
_________________________________________________________________
lstm (LSTM) (64, None, 256) 525312
_________________________________________________________________
dropout_1 (Dropout) (64, None, 256) 0
_________________________________________________________________
lstm_1 (LSTM) (64, None, 256) 525312
_________________________________________________________________
dropout_2 (Dropout) (64, None, 256) 0
_________________________________________________________________
dense (Dense) (64, None, 20000) 5140000
=================================================================
Total params: 11,310,624
Trainable params: 11,310,624
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/gretel_synthetics/tensorflow/train.py in train_rnn(params)
235 callbacks=_callbacks,
--> 236 validation_data=validation_dataset
237 )
15 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1182 callbacks.on_train_batch_begin(step)
-> 1183 tmp_logs = self.train_function(iterator)
1184 if data_handler.should_sync:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
888 with OptionalXlaContext(self._jit_compile):
--> 889 result = self._call(*args, **kwds)
890
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
932 initializers = []
--> 933 self._initialize(args, kwds, add_initializers_to=initializers)
934 finally:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in _initialize(self, args, kwds, add_initializers_to)
763 self._stateful_fn._get_concrete_function_internal_garbage_collected( # pylint: disable=protected-access
--> 764 *args, **kwds))
765
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
3049 with self._lock:
-> 3050 graph_function, _ = self._maybe_define_function(args, kwargs)
3051 return graph_function
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
3443 self._function_cache.missed.add(call_context_key)
-> 3444 graph_function = self._create_graph_function(args, kwargs)
3445 self._function_cache.primary[cache_key] = graph_function
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
3288 override_flat_arg_shapes=override_flat_arg_shapes,
-> 3289 capture_by_value=self._capture_by_value),
3290 self._function_attributes,
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
998
--> 999 func_outputs = python_func(*func_args, **func_kwargs)
1000
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py in wrapped_fn(*args, **kwds)
671 with OptionalXlaContext(compile_with_xla):
--> 672 out = weak_wrapped_fn().__wrapped__(*args, **kwds)
673 return out
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
985 if hasattr(e, "ag_error_metadata"):
--> 986 raise e.ag_error_metadata.to_exception(e)
987 else:
ValueError: in user code:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:855 train_function *
return step_function(self, iterator)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:845 step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:1285 run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:2833 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py:3608 _call_for_each_replica
return fn(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:838 run_step **
outputs = model.train_step(data)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py:799 train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:529 minimize
loss, var_list=var_list, grad_loss=grad_loss, tape=tape)
/usr/local/lib/python3.7/dist-packages/tensorflow_privacy/privacy/optimizers/dp_optimizer_keras.py:88 _compute_gradients
tf.reshape(loss, [self._num_microbatches, -1]), axis=1)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206 wrapper
return target(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py:195 reshape
result = gen_array_ops.reshape(tensor, shape, name)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_array_ops.py:8398 reshape
"Reshape", tensor=tensor, shape=shape, name=name)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
attrs=attr_protos, op_def=op_def)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py:601 _create_op_internal
compute_device)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:3565 _create_op_internal
op_def=op_def)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:2042 __init__
control_input_ops, op_def)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py:1883 _create_c_op
raise ValueError(str(e))
ValueError: Dimension size must be evenly divisible by 64 but is 1 for '{{node Reshape}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32](loss/weighted_loss/value, Reshape/shape)' with input shapes: [], [2] and with input tensors computed as partial shapes: input[1] = [64,?].
From my understanding, the issue is from the tensors being processed by the differential privacy algorithm. I can train without issue when "dp": False
, also worth knowing that I could train with "dp": True
on same blueprint notebook without issue 3 months ago.
Hi! thanks for the repo, May I ask what is the minimum number of training samples required for creating synthetic_data_from_csv_dataframe? My CSV had less than 100 samples and i received an error 'RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again'
when running the cell model.train()
from a create_synthetic_data_from_csv_or_df notebook.
While training checkpoint is used to store trained model weights. How to restore and retrain and reuse them?
And how to add constraints to max and min value range to a column
I am not clear about it from the docs, could nt find example
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.