Code Monkey home page Code Monkey logo

Comments (10)

sandyhouse avatar sandyhouse commented on May 19, 2024

能把组网代码也大体贴一下吗?

from plsc.

gobigrassland avatar gobigrassland commented on May 19, 2024

@sandyhouse 主要想通过下面3个函数实现对不同数据集采用对应的分类层,实现同一网络的联合训练

  1. build_program_multi_branch是参考plsc/entry.py中build_program函数编写,主要实现功能:
    将emb特征按照固定比例拆分,比如训练时webface和vggface2这两个数据集的batchsize都是10,那么拆分后前10个emb是webface数据集中图片特征,后10个emb是vggface2数据集中图片的特征。然后分别针对webface和vggface2计算各自的分类损失

  2. minimize_multi_branches函数是在paddle/fluid/incubate/fleet/collective/init.py重新定义的函数,参考该py文件中minimize函数修改,使之可以输入多个分类损失,分别求其梯度,然后聚合(还没有实现)

  3. compute_gradient_multi_branches函数是在plsc/models/dist_algo.py中参考minimize函数修改的,使之能针对不同的分类损失求梯度

def build_program_multi_branch(self,
                  is_train=True,
                  use_parallel_test=False,
                  dist_strategy=None):
            #此处省略部分,与build_program函数相同
            emb = model.build_network(input=image, label=label, is_train=True)
            emb_split = fluid.layers.split(emb, batch_size_multi_branch, dim=0)
            label_split = fluid.layers.split(label, batch_size_multi_branch, dim=0)
            loss_split = []
            name_split = []
            for ind in range(len(batch_size_multi_branch)):
                if self.loss_type == "dist_arcface":
                    avg_loss = dist_algo.distributed_arcface_classify(emb_split[ind], label_split[ind], int(self.datasets_info[ind][-2]), num_trainers, trainer_id, self.margin, self.scale, self.param_attr, self.datasets_info[ind][0])
                loss_split.append(avg_loss)
                name_split.append(self.datasets_info[ind][0])

            optimizer = None
            if is_train:
                # initialize optimizer
                optimizer = self._get_optimizer()
                if self.num_trainers > 1:
                    dist_optimizer = fleet.distributed_optimizer(
                        optimizer, strategy=dist_strategy)
                    #dist_optimizer.minimize(loss_split[ind], self.datasets_info[ind][0])
                    dist_optimizer.minimize_multi_branches(loss_split, name_split)
def minimize_multi_branches(self,
             losses,
             names,
             startup_program=None,
             parameter_list=None,
             no_grad_set=None):

    for ind, name in enumerate(names):
        loss = losses[ind]
        main_program = loss.block.program
        if startup_program is None:
            startup_program = fluid.default_startup_program()
        fleet.startup_program = startup_program

        self._loss = loss

        self._check_collective_mode(main_program, self._optimizer,
                                    self._strategy)

        param_grads = self._optimizer.compute_gradient_multi_branches(
            loss,
            name,
            startup_program=startup_program,
            parameter_list=parameter_list,
            no_grad_set=no_grad_set)

        fleet._origin_program = main_program.clone(for_test=False)
        fleet._transpiled_program = main_program
        fleet.main_program = self._try_to_compile(startup_program, main_program)
def compute_gradient_multi_branches(self,
             loss,
             dataset_name,
             startup_program=None,
             parameter_list=None,
             no_grad_set=None,
             callbacks=None):
    assert loss._get_info('shard_logit_{}'.format(dataset_name))

    shard_logit = loss._get_info('shard_logit_{}'.format(dataset_name))
    shard_prob = loss._get_info('shard_prob_{}'.format(dataset_name))
    shard_label = loss._get_info('shard_label_{}'.format(dataset_name))
    shard_dim = loss._get_info('shard_dim_{}'.format(dataset_name))

    op_maker = fluid.core.op_proto_and_checker_maker
    op_role_key = op_maker.kOpRoleAttrName()
    op_role_var_key = op_maker.kOpRoleVarAttrName()
    backward_role = int(op_maker.OpRole.Backward)
    loss_backward_role = int(op_maker.OpRole.Loss) | int(
        op_maker.OpRole.Backward)

    # minimize a scalar of reduce_sum to generate the backward network
    scalar = fluid.layers.reduce_sum(shard_logit)
    block = loss.block

    if not self._use_fp16:
        #ret = self._optimizer.minimize(scalar)
        params_grads = self._optimizer.backward(scalar)
        print(loss, scalar, dataset_name)
        # remove the unnecessary ops
        index = 0
        """
        for i, op in enumerate(block.ops):
            #print(i, op)
            if op.all_attrs()[op_role_key] == loss_backward_role:
                index = i
                break
        """
        for i,op in enumerate(block.ops):
            print(i, dataset_name, block.ops[i])

        """
        assert block.ops[index - 1].type == 'reduce_sum'
        assert block.ops[index].type == 'fill_constant'
        assert block.ops[index + 1].type == 'reduce_sum_grad'
        block._remove_op(index + 1)
        block._remove_op(index)
        block._remove_op(index - 1)

        self.insert_commom_backward_op(block, index, shard_logit, shard_prob,
                                        shard_label, shard_dim, op_role_key,
                                        backward_role, loss_backward_role)
        """
        return params_grads

from plsc.

sandyhouse avatar sandyhouse commented on May 19, 2024

多出OP的原因是:为第一个loss调用backward会在program自动插入反向操作(OP),此时的program成为program1;为第二个loss调用backward会在program1的基础上插入反向操作(OP),所以会多出很多OP。我考虑下这种场景需求怎么可以满足。

from plsc.

gobigrassland avatar gobigrassland commented on May 19, 2024

好的,多谢。我在很多应用场景都要用到多种数据源来联合训练。你们做的联邦学习PaddleFL也会使用不同数据源训练不同任务

from plsc.

sandyhouse avatar sandyhouse commented on May 19, 2024

想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。

from plsc.

gobigrassland avatar gobigrassland commented on May 19, 2024

@sandyhouse 好的,我先试试,到时候再请教

from plsc.

sandyhouse avatar sandyhouse commented on May 19, 2024

请问你的问题是否解决 @gobigrassland

from plsc.

gobigrassland avatar gobigrassland commented on May 19, 2024

@sandyhouse 有两个思路:
1.还是将多个分支都写在一张graph中,在dist_algo.py中将多个分类层引入更多的op都按照原始代码的方式先删除再插入。这个能运行,目前正在验证指标值是否正常
2.参考你们的PALM写多任务,还在进行中

您还有其他更好的解决办法没?

from plsc.

sandyhouse avatar sandyhouse commented on May 19, 2024

想到一个比较粗糙的方案,为不同的数据集构建不同的program(结合使用with program_guard和unique_name),这样在作反向的时候相当于只对各自loss对应的program执行相应操作。

目前看还是觉得这种方案比较简单一些,但不知道实现中是否存在其它问题。

from plsc.

gobigrassland avatar gobigrassland commented on May 19, 2024

我开始也看program_guard和unique_name的API,可能对这些理解不到位,还是没有实现您说的。我这段时间再试一试这种方法。如果您方便的话,可不可以给我写个简单的示例,我参考一下

from plsc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.