arcelien / pba Goto Github PK

Efficient Learning of Augmentation Policy Schedules

Home Page: https://bair.berkeley.edu/blog/2019/06/07/data_aug/

License: Apache License 2.0

Python 37.18% Shell 1.36% Jupyter Notebook 61.46%

machine-learning data-science artificial-intelligence deep-learning python automl data-augmentation augmentation automated-machine-learning tensorflow

pba's People

Contributors

Stargazers

Watchers

Forkers

mc261670164 kmfeng interesting-repos jaedukseo pinglmlcv guibo79 ye-man prpankajsingh cclauss ch-shin stjordanis hdjsjyl moyulization rotorliu vyorick hyzcn saitamandd tobyclh luenix wangkanger yongfeng-li hkksimple bennnun zhangtianer521 mercileesb yenenehyyimer tienduchoang cactuslei alexandrebarral simonjhy github-luffy skyuuka rex-hou panxipeng xuejianjia wang3702 abcp4 simon821 serereuk typecastingsg prabhupradhan tomarraj008 alessandroscoppio phamcuong92 nininininini zhtian327 jonkoi peipei-pig chrislee2012 triper1022 prob1995 blankxyz janghyun1230 robot-ai-machinelearning arui1 richardnus gsyn77 abhinavm24 youtang1993 saber5433 bill007bill foreverps nusaalexa lilujunai mimbres lianzhouhui michael-wzhu sts-sadr swansealeo hoololo zhongtb gptcod aryabhata-archive mldl yawudede ymw123 mrnac jamestszhim linh-amped shaoshitong evdcush casuy ayerzcc amminadab-db

pba's Issues

Jupyter Notebook demo

Hi.

Any Jupyter Notebook available for demonstration in one place?

inspect.getargspec() deprecated in Python 3

Hi PBA Team,

Spotted this warning in my IDE when looking through your code:

pba/pba/augmentation_transforms.py

Line 224 in a9cc308

if 'image_size' in inspect.getargspec(self.xform).args:

getargspec() has been deprecated since Python 3.0 in favour of getfullargspec()

It may be worthwhile adding a try/except block for this import because getfullargspec() isn't available in Python 2.x

https://docs.python.org/2.7/library/inspect.html#inspect.getargspec
https://docs.python.org/3.7/library/inspect.html#inspect.getargspec
https://docs.python.org/3.7/library/inspect.html#inspect.getfullargspec

Kind regards,
Michael

Bounding box dataset

Hi everyone, did you tried PBA on a dataset with bounding boxes? If not what part of the code should I change to work with this type of data?

Quetion about searching reasults pbt_global.txt

Hello, I run the searching algorithm and get the results a pbt_global.txt.
In the pbt_global.txt, for example:
["2", "0", 2, 3, {"aug_policy": "cifar10", "train_size": 4000, "wrn_size": 32, "explore": "cifar10", "data_path": "/home/jinlin3/pba2/datasets/cifar-10-batches-py", "batch_size": 128, "dataset": "cifar10", "no_aug": false, "recompute_dset_stats": false, "test_batch_size": 25, "gradient_clipping_by_global_norm": 5.0, "use_hp_policy": true, "validation_size": 46000, "num_epochs": 200, "hp_policy": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], "weight_decay_rate": 0.0005, "lr": 0.1, "no_cutout": false, "wrn_depth": 40, "model_name": "wrn"}, {"aug_policy": "cifar10", "train_size": 4000, "wrn_size": 32, "explore": "cifar10", "data_path": "/home/jinlin3/pba2/datasets/cifar-10-batches-py", "batch_size": 128, "validation_size": 46000, "no_aug": false, "recompute_dset_stats": false, "wrn_depth": 40, "test_batch_size": 25, "gradient_clipping_by_global_norm": 5.0, "use_hp_policy": true, "dataset": "cifar10", "num_epochs": 200, "hp_policy": [0, 3, 4, 0, 4, 6, 3, 0, 0, 1, 0, 8, 0, 1, 3, 0, 0, 1, 1, 3, 3, 1, 0, 0, 0, 0, 9, 3, 5, 5, 2, 0, 0, 0, 0, 2, 1, 0, 0, 2, 2, 0, 6, 0, 3, 4, 0, 3, 3, 5, 3, 0, 0, 0, 0, 4, 0, 9, 0, 3], "model_name": "wrn", "lr": 0.1, "no_cutout": false, "weight_decay_rate": 0.0005}]
What do "2" and "0" denote? What do 2 and 3 denote ?

Possible unhandled error from worker

@arcelien I came across an issue when extending PBA to my own network structure.

The error is like this:
ERROR worker.py:1612 -- Possible unhandled error from worker: ray_RayModel:restore_from_object() (pid=49849, host=XXX)
UnreconstructableError: Object 553134abdaa61cb3ed9fa555fcee22eb01000000 is lost (either evicted or explicitly deleted) and cannot be reconstructed.

Will this affect the final performance?

Can't reproduce the results in the table3

Hi, thanks for providing the nice code.
I've tried to reproduce the results in table3 with "bash scripts/table_3_rcifar10.sh wrn_28_10".
The highest test acc is 87.42% among twice trial, which is much lower than the result in the paper.
My tensorflow version is 1.14, ray0.7, cuda10.0, cudnn7.
What would be the reason for the low performance?

use for own dataset

hello everyone, can you show me any guide to run augmentation on my own dataset.

requirements.txt and non-square images handling

I see you have clearly stated that pba has been tested with tensorflow 1.10.0 and 1.11.0 and so is the case with ray which has been tested with 0.7.0 version. It'd be great, it can be added to the requirements.txt file as, for example, tensorflow-gpu=1.11.0. Also, does it make sense to add tensorflow in the requirements file as well for cpu based execution?
In pil_unwrap of augmentation_transforms.py, the code takes in only square images. Is there a possibility to handle images of unequal width and height?

TuneError when running PBA on Google Colab

Hi PBA Team,

I am trying to run PBA on Google Colab but im running into the below error when running any of the included scripts, any suggestions on what is going on here? I've searched for related errors but im not coming across anything apart from Ray github links to where the message is implemented in code.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From pba/train.py:87: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From pba/train.py:87: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From pba/train.py:88: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From /content/drive/My Drive/Colab Notebooks/ResearchProject/pba/pba/setup.py:102: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0229 23:32:49.458530 140466326476672 module_wrapper.py:139] From /content/drive/My Drive/Colab Notebooks/ResearchProject/pba/pba/setup.py:102: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Namespace(aug_policy='cifar10', bs=128, checkpoint_freq=0, cpu=4.0, data_path='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/datasets/', dataset='svhn', epochs=160, explore='cifar10', flatten=False, gpu=1.0, hp_policy='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/schedules/rsvhn_16_wrn.txt', hp_policy_epochs=160, local_dir='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/results/', lr=0.05, model_name='wrn_28_10', name='eval_svhn_wrn_28_10', no_aug=False, no_cutout=True, num_samples=1, proportion=1.0, recompute_dset_stats=False, restore=None, test_bs=25, train_size=1000, use_hp_policy=True, val_size=0, wd=0.01)
I0229 23:32:49.458748 140466326476672 setup.py:102] Namespace(aug_policy='cifar10', bs=128, checkpoint_freq=0, cpu=4.0, data_path='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/datasets/', dataset='svhn', epochs=160, explore='cifar10', flatten=False, gpu=1.0, hp_policy='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/schedules/rsvhn_16_wrn.txt', hp_policy_epochs=160, local_dir='/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/results/', lr=0.05, model_name='wrn_28_10', name='eval_svhn_wrn_28_10', no_aug=False, no_cutout=True, num_samples=1, proportion=1.0, recompute_dset_stats=False, restore=None, test_bs=25, train_size=1000, use_hp_policy=True, val_size=0, wd=0.01)
INFO:tensorflow:data path: /content/drive/My Drive/Colab Notebooks/ResearchProject/pba/datasets/
I0229 23:32:49.458855 140466326476672 setup.py:119] data path: /content/drive/My Drive/Colab Notebooks/ResearchProject/pba/datasets/
INFO:tensorflow:overwriting with custom epochs
I0229 23:32:49.458990 140466326476672 setup.py:203] overwriting with custom epochs
INFO:tensorflow:epochs: 160, lr: 0.05, wd: 0.01
I0229 23:32:49.459068 140466326476672 setup.py:207] epochs: 160, lr: 0.05, wd: 0.01
2020-02-29 23:32:49,470	INFO resource_spec.py:212 -- Starting Ray with 6.54 GiB memory available for workers and up to 3.27 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-02-29 23:32:49,929	INFO services.py:1078 -- View the Ray dashboard at localhost:8265
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 185, in from_json
    exp = cls(name, run_value, **spec)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 110, in __init__
    self._run_identifier = Experiment.register_if_needed(run)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 218, in register_if_needed
    register_trainable(name, run_object)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/registry.py", line 67, in register_trainable
    _global_registry.register(TRAINABLE_CLASS, name, trainable)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/registry.py", line 106, in register
    self._to_flush[(category, key)] = pickle.dumps(value)
  File "/usr/local/lib/python3.6/dist-packages/ray/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.6/dist-packages/ray/cloudpickle/cloudpickle_fast.py", line 617, in dump
    return Pickler.dump(self, obj)
TypeError: can't pickle _LazyLoader objects

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pba/train.py", line 88, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "pba/train.py", line 83, in main
    run_experiments({FLAGS.name: train_spec})
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 396, in run_experiments
    experiments = convert_to_experiment_list(experiments)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 265, in convert_to_experiment_list
    for name, spec in experiments.items()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 265, in <listcomp>
    for name, spec in experiments.items()
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/experiment.py", line 187, in from_json
    raise TuneError("Improper argument from JSON: {}.".format(spec))
ray.tune.error.TuneError: Improper argument from JSON: {'resources_per_trial': {'cpu': 4.0, 'gpu': 1.0}, 'stop': {'training_iteration': 160}, 'config': {'train_size': 1000, 'validation_size': 0, 'dataset': 'svhn', 'data_path': '/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/datasets/', 'batch_size': 128, 'gradient_clipping_by_global_norm': 5.0, 'explore': 'cifar10', 'aug_policy': 'cifar10', 'no_cutout': True, 'recompute_dset_stats': False, 'lr': 0.05, 'weight_decay_rate': 0.01, 'test_batch_size': 25, 'proportion': 1.0, 'no_aug': False, 'use_hp_policy': True, 'hp_policy': '/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/schedules/rsvhn_16_wrn.txt', 'hp_policy_epochs': 160, 'flatten': False, 'model_name': 'wrn', 'wrn_size': 160, 'wrn_depth': 28, 'num_epochs': 160}, 'local_dir': '/content/drive/My Drive/Colab Notebooks/ResearchProject/pba/results/', 'checkpoint_freq': 0, 'num_samples': 1}.

Text

Excuse me if I missed this in the paper, but is it conceivable to adapt this to text data?

GPU utilization

I wanted to test the search with the test search script. Since GPUs 0 and 1 are being used for another process, I ran this command:

CUDA_VISIBLE_DEVICES=2,3 bash scripts/test_search.sh

Even though the following log status shows in the resssources requested 1.96/2 GPUs which makes perfect sense and every model is taking 2 CPUs and 0.49 GPU
== Status ==
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 8.0/20 CPUs, 1.96/2 GPUs
Memory usage on this node: 91.1/269.9 GB
Result logdir: /home/sgharbi/pba/results/test_search
Number of trials: 4 ({'RUNNING': 4})
RUNNING trials:

RayModel_0: RUNNING, [2.0 CPUs, 0.49 GPUs], [pid=425228], 87 s, 1 iter
RayModel_1: RUNNING, [2.0 CPUs, 0.49 GPUs], [pid=425227], 87 s, 1 iter
RayModel_2: RUNNING, [2.0 CPUs, 0.49 GPUs], [pid=425241], 87 s, 1 iter
RayModel_3: RUNNING, [2.0 CPUs, 0.49 GPUs], [pid=425233], 173 s, 2 iter

But when i look with nvidia-smi there no usage of GPUs 2 and 3!

Am I doing something wrong here please? If so, how to make it really utilize the GPUs, thank you so much!

when I run train.py there is only one GPU has utilization

I have 4 GPUs, when I run train.py with --num_samples 1 --gpu 4，there is only one GPU has utilization.
Is it because the model does not support multiple GPUs?
But when I run search.py with --num_samples 16 --gpu 0.25 , all GPUs has utilization.

Question regarding search result

Hi,

Firstly, thanks for sharing the codes for your research. Currently I've been trying to reproduce the results using PyTorch. For comparisons, I ran your search codes given scripts for svhn. I found that after finishing the search, there are multiple search policy outcomes. I'm sure that I should select the best validation results policy outcomes. Interestingly, the difference of the validation results from search outcomes are not that high. Can you briefly explain the reasons or some insights ? Second question is it seems that depending on learning schedule, the searched policy could be different. Can you tell some observations or experiences you have ?

Thanks in advance.

Multiple GPU running

Does this code work on multi-gpu?

Parse search results to get schedules

I try to run the search example, the result is show in as follows:
experiment_state-2019-07-04_11-18-12.json
pbt_global.txt
pbt_policy_0.txt
pbt_policy_10.txt
pbt_policy_11.txt
pbt_policy_12.txt
pbt_policy_13.txt
pbt_policy_14.txt
pbt_policy_15.txt
pbt_policy_1.txt
pbt_policy_2.txt
pbt_policy_3.txt
pbt_policy_4.txt
pbt_policy_5.txt
pbt_policy_6.txt
pbt_policy_7.txt
pbt_policy_8.txt
pbt_policy_9.txt
RayModel_0_2019-07-04_11-18-12X5ljyD
RayModel_10_2019-07-04_11-19-03P76U0K
RayModel_11_2019-07-04_11-19-05drsWQC
RayModel_1_2019-07-04_11-18-12AArW21
RayModel_12_2019-07-04_11-19-06UP0tWZ
RayModel_13_2019-07-04_11-19-076AxfVX
RayModel_14_2019-07-04_11-19-070uF1rg
RayModel_15_2019-07-04_11-19-26GJtjgM
RayModel_2_2019-07-04_11-18-12QQx2i_
RayModel_3_2019-07-04_11-18-12KnTGqv
RayModel_4_2019-07-04_11-18-12Wdvaqo
RayModel_5_2019-07-04_11-18-38bTvB_m
RayModel_6_2019-07-04_11-18-39YjVHQe
RayModel_7_2019-07-04_11-18-39e7EF_T
RayModel_8_2019-07-04_11-18-39YhkcS0
RayModel_9_2019-07-04_11-18-39_Q4q5a

The readme file hasn't mentioned how to use these generated results to get the schedules. The log file in RayModel*** is empty. Can you mention more details how to use the pba/util.py to parse the result file? @arcelien Many Thanks!

BUG?py3 malformed node or string

I use
bash scripts/table_1_cifar10.sh wrn_28_10

I got:
File "/opt/conda/lib/python3.6/site-packages/ray/tune/trainable.py", line 88, in init
self._setup(copy.deepcopy(self.config))
File "pba/train.py", line 25, in _setup
self.trainer = ModelTrainer(self.hparams)
File "/pba/pba/model.py", line 213, in init
self.data_loader = data_utils.DataSet(hparams)
File "/pba/pba/data_utils.py", line 66, in init
self.parse_policy(hparams)
File "/pba/pba/data_utils.py", line 123, in parse_policy
hparams.hp_policy_epochs)
File "/pba/pba/utils.py", line 77, in parse_log_schedule
policy = parse_log(file_path, epochs)
File "/pba/pba/utils.py", line 28, in parse_log
raw_policy = [ast.literal_eval(line) for line in raw_policy]
File "/pba/pba/utils.py", line 28, in
raw_policy = [ast.literal_eval(line) for line in raw_policy]
File "/opt/conda/lib/python3.6/ast.py", line 85, in literal_eval
return _convert(node_or_string)
File "/opt/conda/lib/python3.6/ast.py", line 84, in _convert
raise ValueError('malformed node or string: ' + repr(node))
ValueError: malformed node or string: b"['9', '8', 11, 12, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 3, 3, 7, 1, 9, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 3, 0, 0, 2, 0, 2, 0, 0, 0, 0, 1, 0, 3, 1, 0, 0
, 1, 0, 3, 0, 2, 0, 0, 6, 0, 0, 0, 3, 0, 2, 0, 0, 6, 0, 3, 4, 0, 0, 0, 0, 3, 1, 0, 3]]\n"

Find out which policy performed best after search

After running the search, is there an easy way to find out which of the pbt_policy_n.txt files stores the best policy?

Thanks

RayOutOfMemoryError while doing PBA search.

Hi,
First of all thanks for a well-written code for this research work. I am using PBA search to find augmentation policy for depth estimation task. After training for some epochs Ray complains about RayOutOfMemoryError .
Package version:

ray==0.8.4
tensorflow-gpu=1.11.0
python==3.5.6

I initialize ray in the main script search.py with dedicated memory for workers.

  ray.init(
        webui_host='127.0.0.1',
        memory=1024 * 1024 * 1024 * 30,    # setting 30 GB for ray workers
        object_store_memory=1024 * 1024 * 1024 * 6  # setting 6 GB object store
    )

The reported error looks something like this:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 381, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/site-packages/ray/worker.py", line 1513, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ^[[36mray::RayModel.train()^[[39m (pid=80, ip=10.244.16.90)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/site-packages/ray/tune/trainable.py", line 261, in train
    result = self._train()
  File "/ceph/amanraj/codes/pba-signet/pba/train.py", line 42, in _train
    eval_preds = self.trainer.run_model(self._iteration)
  File "/ceph/amanraj/codes/pba-signet/pba/model.py", line 165, in run_model
    self._run_training_loop(epoch)
  File "/ceph/amanraj/codes/pba-signet/pba/model.py", line 155, in _run_training_loop
    self.session, self.m, self.dataset, self.train_size, curr_epoch, self.comet_exp
  File "/ceph/amanraj/codes/pba-signet/pba/helper_utils.py", line 312, in run_epoch_training
    tgt_img, src_img_stack, tgt_img_aug, src_img_stack_aug, intrinsic = dataset.next_batch(batch_indxs, curr_epoch)
  File "/ceph/amanraj/codes/pba-signet/pba/data_utils.py", line 276, in next_batch
    ) for idx in indexes
ray.exceptions.RayTaskError(RayOutOfMemoryError): ^[[36mray::IDLE^[[39m (pid=83, ip=10.244.16.90)
File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task
  File "/usr/local/lib/python3.6/site-packages/ray/memory_monitor.py", line 120, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node amraj-pba-search5k-t4jfrl6-dm74m is used (45.88 / 48.0 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
75      5.28GiB ray::RayModel.save_to_object()
81      3.96GiB ray::RayModel.train()
84      3.95GiB ray::RayModel.save()
80      3.94GiB ray::RayModel.train()
7       0.18GiB python pba/search.py --local_dir /ceph/amanraj/results/ --kitti_root /ceph/amanraj/data/kitti_proces
78      0.13GiB ray::pba.data_utils.augment_sample()
71      0.12GiB ray::IDLE
76      0.12GiB ray::IDLE
82      0.12GiB ray::IDLE
83      0.12GiB ray::IDLE

In addition, up to 18.59 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray.

My machine has a memory of 48GiB, I don't understand how come Ray object store size is 18.59 GiB when I have already set in ray.init() it to be 6GiB

Signature of relevant methods to give a gist of logic for reading and augmenting next batch of samples defined in data_utils.py , full code is omitted.

class TrainDataSet(object):
     def __init__(self, hparams):
          self.hparams = hparams
          ......
    def next_batch(self, indexes, iteration):
        batch_size = len(indexes)
        res = ray.get([
            augment_sample.remote(
                idx,
                iteration,
                self.data_loader,
                self.hparams.no_aug_policy,
                self.hparams.use_hp_policy,
                self.good_policies,
                self.policy,
                self.augmentation_transforms,
                self.input_height,
                self.input_width,
                self.hparams.flatten
            ) for idx in indexes
        ])
       for idx in range(0, batch_size):
          tgt_, src_img_stack_, tgt_aug_, src_img_stack_aug_, intrinsic_ = res[idx]
          # create batch from the returne output of each function call and return batch of samples.
           ...


        # batch_size = 8
        # tgt_img,  tgt_img_aug = (batch_size, 412, 126, 3)
        # src_img_stack,  src_img_stack_aug = (batch_size, 412, 126, 6)
        # intrinsic_batch = (batch_size, 3,3)
        return tgt_img_batch, src_img_stack_batch, tgt_img_aug_batch, src_img_stack_aug_batch, intrinsic_batch

@ray.remote
def augment_sample(
        sample_idx, iteration, data_loader, no_aug_policy, use_hp_policy,
        good_policies, policy, augmentation_transforms, input_height, input_width, flatten,
):
    """
    :param sample_idx: index of sample to be read
    :param iteration: current epoch of model
    :param data_loader: dataloader implementing __getitem__
    :param no_aug_policy: whether to use any policy or not
    :param use_hp_policy: whether to use hp policy or not
    :param good_policies: autoaugment policy
    :param policy: parsed policy
    :param augmentation_transforms: augmentation function
    :param input_height: height of input image
    :param input_width: width of input image
    :param flatten: randomly select an aug policy from schedule
    :return: return original and augmented sample
    """
    # read sample from disk
    tgt_img, src_img_1, src_img_2, intrinsic = data_loader[sample_idx]
    
    # apply augmentations and return original and augmented form of sample as numpy arrays
    ...

    return tgt_img, src_img_stack, tgt_img_aug, src_img_stack_aug, intrinsic

My remote function returns NumPy array, I suspect memory leak something related to Ray, but I am not able to figure out where this is happening. Any suggestions or input will be highly appreciated. @ericl @neocxi

PS: My previous data-loader was torch.data.Dataset based and It was working fine without any such crash but due to delay in augmentation, I adopted a purely ray-based code and parallelized it using ray.remote function calls.

Eval with the schedules

Hi,

As I know, in python 3.5.X or python 2.X , 'dict' structure has random keys order. For example,

If I do this

a = {'one' : 1, 'two' : 2, 'three', 3}
a.keys()
['one', 'two', 'three'] <- this was what I expected
['three', 'one', 'two'] <- However, this parts are always different when python session renewed.

I saw this kind of code in 'augmentation_transforms_hp.py' at almost last sentence.

Is this Random ordering on purpose to perturb?

I found this when I tried to visualize the schedule with pba.ipynb. When I renew the session in Ipython notebook, it changed the schedule. For example,

('Brightness', 0.2, 3), ('Auto_contrast', 0.4, 3) .... <- when first tried
('sharpeness', 0.2, 3) ... <- same prob, mag, but the op changed.

Thank you!

Evaluation part param set?

Hi everyone, I've done some PBA searches for some datasets with the model wide resnet_40_2.
and I'm wondering that in evaluation part, (which uses wide_resnet_28_10 with the schedule from the search pbt_policy_ *.txt ) do I have to use the same parameter setting that I used in search part (lr, wd, epochs)?

Script for visualizing the probability and magnitude hyperparameters

Could you please publish the script for drawing these fancy visualizations?

Probability Hyperparameters over Time	Magnitude Hyperparameters over Time

Kindly update to use newer Python

As noted at https://pythonclock.org/ , Python 2.7 expires at the end of the year. It is therefore incomprehensible that it was used for this otherwise interesting project. Kindly consider updating it to use a current version of Python, i.e. 3.7 or maybe 3.6 at the oldest. The repo's users shouldn't have to independently go through the pain of converting it. Failing this, it's not possible to integrate the code in a modern codebase. Thanks.

The actor died unexpectedly before finishing this task.

Hi, I run "CUDA_VISIBLE_DEVICES=0 bash scripts/search.sh rsvhn", but reporting the error:
Traceback (most recent call last): File "/home/xiaopang/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 446, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/xiaopang/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 316, in fetch_result result = ray.get(trial_future[0]) File "/home/xiaopang/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 2197, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Why remove SamplePairing?

Any particular reasons?

python3 and cPickle

File "pba/autoaugment/data_utils.py", line 185, in unpickle File "pba/autoaugment/data_utils.py", line 185, in unpickle
d = cPickle.load(fo)
NameError: name 'cPickle' is not defined

d = cPickle.load(fo)

NameError: name 'cPickle' is not defined

Note Python version requirement in Getting Started

The readme does currently note the Python 2 requirement, but it's in a section that users are less likely to read than the Getting Started section. I request also noting the Python 2.7 requirement in the Getting Started section. Thanks.

Tensorflow version issue?

Hi I tried to search schedule with the datasets with Resnet-20.

However I got one error with Namespace issues and one error with tensorflow issue.

First, in setup.py line 180, I think the code make the resnet_size to FLAGS and it pass to num_filters, but it doesn't work. When I change 'FLAGS.resnet_size' to just 20, it works.

Second, I use tensorflow-gpu 1.4.1. In resnet.py line 483, tf.reduce_mean doesn't work with wrong parameter errors. It works when I change keepdims -> keep_dims