Comments (10)
能把组网代码也大体贴一下吗?
from plsc.
@sandyhouse 主要想通过下面3个函数实现对不同数据集采用对应的分类层,实现同一网络的联合训练
-
build_program_multi_branch是参考plsc/entry.py中build_program函数编写,主要实现功能:
将emb特征按照固定比例拆分,比如训练时webface和vggface2这两个数据集的batchsize都是10,那么拆分后前10个emb是webface数据集中图片特征,后10个emb是vggface2数据集中图片的特征。然后分别针对webface和vggface2计算各自的分类损失 -
minimize_multi_branches函数是在paddle/fluid/incubate/fleet/collective/init.py重新定义的函数,参考该py文件中minimize函数修改,使之可以输入多个分类损失,分别求其梯度,然后聚合(还没有实现)
-
compute_gradient_multi_branches函数是在plsc/models/dist_algo.py中参考minimize函数修改的,使之能针对不同的分类损失求梯度
def build_program_multi_branch(self, is_train=True, use_parallel_test=False, dist_strategy=None): #此处省略部分,与build_program函数相同 emb = model.build_network(input=image, label=label, is_train=True) emb_split = fluid.layers.split(emb, batch_size_multi_branch, dim=0) label_split = fluid.layers.split(label, batch_size_multi_branch, dim=0) loss_split = [] name_split = [] for ind in range(len(batch_size_multi_branch)): if self.loss_type == "dist_arcface": avg_loss = dist_algo.distributed_arcface_classify(emb_split[ind], label_split[ind], int(self.datasets_info[ind][-2]), num_trainers, trainer_id, self.margin, self.scale, self.param_attr, self.datasets_info[ind][0]) loss_split.append(avg_loss) name_split.append(self.datasets_info[ind][0]) optimizer = None if is_train: # initialize optimizer optimizer = self._get_optimizer() if self.num_trainers > 1: dist_optimizer = fleet.distributed_optimizer( optimizer, strategy=dist_strategy) #dist_optimizer.minimize(loss_split[ind], self.datasets_info[ind][0]) dist_optimizer.minimize_multi_branches(loss_split, name_split)
def minimize_multi_branches(self, losses, names, startup_program=None, parameter_list=None, no_grad_set=None): for ind, name in enumerate(names): loss = losses[ind] main_program = loss.block.program if startup_program is None: startup_program = fluid.default_startup_program() fleet.startup_program = startup_program self._loss = loss self._check_collective_mode(main_program, self._optimizer, self._strategy) param_grads = self._optimizer.compute_gradient_multi_branches( loss, name, startup_program=startup_program, parameter_list=parameter_list, no_grad_set=no_grad_set) fleet._origin_program = main_program.clone(for_test=False) fleet._transpiled_program = main_program fleet.main_program = self._try_to_compile(startup_program, main_program)
def compute_gradient_multi_branches(self, loss, dataset_name, startup_program=None, parameter_list=None, no_grad_set=None, callbacks=None): assert loss._get_info('shard_logit_{}'.format(dataset_name)) shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name)) shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name)) shard_label = loss._get_info('shard_label_{}'.format(dataset_name)) shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name)) op_maker = fluid.core.op_proto_and_checker_maker op_role_key = op_maker.kOpRoleAttrName() op_role_var_key = op_maker.kOpRoleVarAttrName() backward_role = int(op_maker.OpRole.Backward) loss_backward_role = int(op_maker.OpRole.Loss) | int( op_maker.OpRole.Backward) # minimize a scalar of reduce_sum to generate the backward network scalar = fluid.layers.reduce_sum(shard_logit) block = loss.block if not self._use_fp16: #ret = self._optimizer.minimize(scalar) params_grads = self._optimizer.backward(scalar) print(loss, scalar, dataset_name) # remove the unnecessary ops index = 0 """ for i, op in enumerate(block.ops): #print(i, op) if op.all_attrs()[op_role_key] == loss_backward_role: index = i break """ for i,op in enumerate(block.ops): print(i, dataset_name, block.ops[i]) """ assert block.ops[index - 1].type == 'reduce_sum' assert block.ops[index].type == 'fill_constant' assert block.ops[index + 1].type == 'reduce_sum_grad' block._remove_op(index + 1) block._remove_op(index) block._remove_op(index - 1) self.insert_commom_backward_op(block, index, shard_logit, shard_prob, shard_label, shard_dim, op_role_key, backward_role, loss_backward_role) """ return params_grads
from plsc.
多出OP的原因是:为第一个loss调用backward会在program自动插入反向操作(OP),此时的program成为program1;为第二个loss调用backward会在program1的基础上插入反向操作(OP),所以会多出很多OP。我考虑下这种场景需求怎么可以满足。
from plsc.
好的,多谢。我在很多应用场景都要用到多种数据源来联合训练。你们做的联邦学习PaddleFL也会使用不同数据源训练不同任务
from plsc.
想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。
from plsc.
@sandyhouse 好的,我先试试,到时候再请教
from plsc.
请问你的问题是否解决 @gobigrassland
from plsc.
@sandyhouse 有两个思路:
1.还是将多个分支都写在一张graph中,在dist_algo.py中将多个分类层引入更多的op都按照原始代码的方式先删除再插入。这个能运行,目前正在验证指标值是否正常
2.参考你们的PALM写多任务,还在进行中
您还有其他更好的解决办法没?
from plsc.
想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。
目前看还是觉得这种方案比较简单一些,但不知道实现中是否存在其它问题。
from plsc.
我开始也看program_guard和unique_name的API,可能对这些理解不到位,还是没有实现您说的。我这段时间再试一试这种方法。如果您方便的话,可不可以给我写个简单的示例,我参考一下
from plsc.
Related Issues (20)
- dynamic model export onnx error HOT 8
- MobilefaceNet_128_arcface_dynamic_0.1_fp16_NHWC resume KeyERROR HOT 2
- PLSC训练得到的模型转paddle和ONNX,同一张图片,二者输出结果不一致问题?
- AMP 支持哪些算子 HOT 1
- TypeError: __init__() got an unexpected keyword argument 'data_format' HOT 2
- 输出模型是如何设计的? HOT 4
- inference.py推理结果的含义是? HOT 9
- Lr过小时会导致Loss为nan HOT 9
- 训练报错 HOT 5
- 两台服务器,每台4张卡,训练出错 HOT 6
- Face Recognition inference模型 HOT 2
- ValueError: Flag FLAGS_cudnn_exhaustive_search cannot set its value through this functio HOT 1
- 请问当前工程版本对应的最新的paddlepaddle是啥? HOT 1
- 多机多卡并行训练,多个节点通信如何配置 HOT 1
- 分类数目变大,尽管可以将参数拆分到各个GPU上,但是各个GPU上的隐层特征allgather也带来显存消耗 HOT 1
- PLSC只能使用python2.7?
- issues of training with dynamic graph HOT 2
- 最新版本的 plsc 对paddle版本的要求有误 HOT 3
- 咱这个MobileFace-Paddle的pretrained model哪里可以下载呢? HOT 7
- 预训练模型预测示例图片错误率高达83.33%,请教可能出现问题的地方 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from plsc.