Code Monkey home page Code Monkey logo

Comments (12)

ivanprado avatar ivanprado commented on September 1, 2024

Instances could keep an state, so reusing instances could be dangerous
because state can mutate between different Combiners. Three options:

  • Keep it as it is now
  • Ignore that and consider that instances can be reused without danger
  • Keep a cache with the binary information, and deserialize from memory on
    each new Combiner

When happen the case where there are many combiners on the same JVM so than
the instance SerDe is not negligible?

I don't understand the point 3). According to what I understand, always the
methods getLocalCache should be used.

Looking to the documentation, the methods purgeCache() (
http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#purgeCache(org.apache.hadoop.conf.Configuration))
and releaseCache() (
http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#releaseCache(java.net.URI,
org.apache.hadoop.conf.Configuration)) seems relevant for this case.
Probable we are not releasing the cached files properly after a job
execution.

2012/12/25 Pere Ferrera [email protected]

Pangool serializes everything (InputFormat, Combiner, Mapper, Reducer,
OutputFormat...) so that we can use instances instead of classes. We try to
use the DistributedCache when possible (it is no possible for InputFormat
and OutputFormat). The current approach works fine, however, instance files
may be accumulated in "hadoop.tmp.dir" and, if not deleted, the cost of
loading one increases. There are two things that must be revised:

  1. Deleting instance files after Job finalization, if it's possible
  2. DCUtils: if(path.toString().endsWith(filePostFix)) -> Doesn't look like
    we need to do a postfix checking here. Getting the file straight should be
    possible and so the cost would be constant (we don't need to iterate over
    all instance files).
  3. Using the local files from DistributedCache when possible, through a
    boolean parameter. So InputFormat and OutputFormat will pass "false" and
    Combiner, Mapper and Reducer "true".

Currently the main bottleneck happens in the Combiner, if there are a lot
of files. The cost of loading the instance for everything else is
negligible. Which makes me thing of another idea:

  • We could implement a JVM-local cache for Combiner instances. A
    static HashMap, for instance. Because Combiners are executed in the same
    JVM than Mappers, caching the object instances would allow us to avoid
    looking for instance cache for every Combiner call. Does it make sense?

In any case, the other things should be done as well.


Reply to this email directly or view it on GitHubhttps://github.com//issues/17.

Iván de Prado
CEO & Co-founder
www.datasalt.com

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

After looking at the DCUtils class and the calls to it I've got the conclusion that the DistributedCache is not used for nothing in the code. The method DCUtil.serializeToDC() calls the method DistributedCache.addCacheFile()... but no part of the code is using the DC cached version for nothing.

Instead, it seems that Pangool is always loading instances directly from the HDFS. (See method DC.locateFileInDC).

If I remember properly, we are not using the DC because files are not always present at the proper moment (InputFormat and OutputFormat instantiation?).

Therefore, the problems are:

  • Not using the DC when possible, which is not as efficient as it could be. Many file downloads per each node.
  • Lack of system for cleaning up Pangool instances cache properly.

The difficult part to solve here is the cache cleanup. This cache is located in the HDFS, and keeps the instances needed for each job. Instances are identified by a UUID generated + some fixed prefix. The difficulty is how to know if an instance is not needed anymore:

  • Name does not tell nothing about which Job is the owner of the instance, and if the job has finished.
  • Timestamp is not enough to know that.

A simple idea would be to have a configurable time to live for each instance, but that is far from perfection.

Another possibility would be to associate instances with JobID's. That way for every instance we could check if the job it belongs to is still alive. But JobID is not present until job.submit() is called.

Other possibility is to delete the files after the job has finished (successfully or not). At this point, you know all instance filenames so you can remove them. Although this sounds good, it is an unsafe procedure, as the process that launched the job could be killed or could fail and in this case files wouldn't be deleted. Also, it is not very elegant, as would require a call to a method from the user. Something like:

job = MRBuilder.build();
try {
   job.submit();
} finally {
   MRBuilder.cleanJob(job);
}

Any smart idea?

(In conclusion, the three points exposed by Pere remains relevant)

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

I did some minor changes:

  1. Renamed the class DCUtils as InstancesDistributor. Updating comments
  2. Remove the postfix (point 2 of initial proposal)

from pangool.

pereferrera avatar pereferrera commented on September 1, 2024

A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

That seems the most reasonable solution, but it is still not safe. It seems
that is not possible to solve this problem properly without changing Hadoop
codebase.

2013/1/4 Pere Ferrera [email protected]

A possible solution would be to not give a Job instance to the user but
wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, - Blocking run
/ non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.


Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11876468.

Iván de Prado
CEO & Co-founder
www.datasalt.com

from pangool.

epalace avatar epalace commented on September 1, 2024

Two things :

1 - I think we should mantain the createJob in any case. There are multiple
APIs that deal directly with Job instead of Configuration, and they would
be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema
AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs
called AvroMultipleOutputs that deal directly with Job. Besides, any
user-custom library that implements Job execution retrials accepting Job
would break.
2- About the cleaning, why not just create a couple of static methods that
deal with Job and after completion (or error) they clean the garbage ?

Eric.

On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado [email protected]:

That seems the most reasonable solution, but it is still not safe. It
seems
that is not possible to solve this problem properly without changing
Hadoop
codebase.

2013/1/4 Pere Ferrera [email protected]

A possible solution would be to not give a Job instance to the user but
wrap it into a Pangool class that should handle file cleaning
underneath.

This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, - Blocking
run
/ non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.


Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.

Iván de Prado
CEO & Co-founder
www.datasalt.com

Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11877098.

from pangool.

pereferrera avatar pereferrera commented on September 1, 2024

I see the point in Eric's concerns.

So I think by now the best solution is to implement the static method and make its usefulness clear by highlighting it in the documentation.

Everyone agree?

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

I agree with Eric. I would add some static methods to Job cleanup, and
would keep an eye on this issue in the future to try to find better
solutions.

Iván

2013/1/4 Eric Palacios [email protected]

Two things :

1 - I think we should mantain the createJob in any case. There are
multiple
APIs that deal directly with Job instead of Configuration, and they would
be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema
AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs
called AvroMultipleOutputs that deal directly with Job. Besides, any
user-custom library that implements Job execution retrials accepting Job
would break.
2- About the cleaning, why not just create a couple of static methods that
deal with Job and after completion (or error) they clean the garbage ?

Eric.

On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado [email protected]:

That seems the most reasonable solution, but it is still not safe. It
seems
that is not possible to solve this problem properly without changing
Hadoop
codebase.

2013/1/4 Pere Ferrera [email protected]

A possible solution would be to not give a Job instance to the user
but
wrap it into a Pangool class that should handle file cleaning
underneath.

This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, -
Blocking
run
/ non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.


Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.

Iván de Prado
CEO & Co-founder
www.datasalt.com

Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11877098>.


Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878872.

Iván de Prado
CEO & Co-founder
www.datasalt.com

from pangool.

pereferrera avatar pereferrera commented on September 1, 2024

And the instance files should have a JobID since jobs could be executed in parallel and only the files from a certain JobID will be cleaned.

I can take care of this issue.

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

+1

2013/1/4 Pere Ferrera [email protected]

I see the point in Eric's concerns.

So I think by now the best solution is to implement the static method and
make its usefulness clear by highlighting it in the documentation.

Everyone agree?


Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878941.

Iván de Prado
CEO & Co-founder
www.datasalt.com

from pangool.

ivanprado avatar ivanprado commented on September 1, 2024

Instance files cannot have the JobID, as it is not available until the job
is executed. The UUID trick is there for solving the issue with parallel
jobs.

2013/1/4 Pere Ferrera [email protected]

And the instance files should have a JobID since jobs could be executed in
parallel and only the files from a certain JobID will be cleaned.

I can take care of this issue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878988.

Iván de Prado
CEO & Co-founder
www.datasalt.com

from pangool.

pereferrera avatar pereferrera commented on September 1, 2024

I have added a non-static method in TupleMRBuilder and MapOnlyJobBuilder that deletes the files generated by the builders. So, the builders keep track of all the files that are being generated per each Job. In this way there is no concurrency issues.

from pangool.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.