Comments (12)
Instances could keep an state, so reusing instances could be dangerous
because state can mutate between different Combiners. Three options:
- Keep it as it is now
- Ignore that and consider that instances can be reused without danger
- Keep a cache with the binary information, and deserialize from memory on
each new Combiner
When happen the case where there are many combiners on the same JVM so than
the instance SerDe is not negligible?
I don't understand the point 3). According to what I understand, always the
methods getLocalCache should be used.
Looking to the documentation, the methods purgeCache() (
http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#purgeCache(org.apache.hadoop.conf.Configuration))
and releaseCache() (
http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#releaseCache(java.net.URI,
org.apache.hadoop.conf.Configuration)) seems relevant for this case.
Probable we are not releasing the cached files properly after a job
execution.
2012/12/25 Pere Ferrera [email protected]
Pangool serializes everything (InputFormat, Combiner, Mapper, Reducer,
OutputFormat...) so that we can use instances instead of classes. We try to
use the DistributedCache when possible (it is no possible for InputFormat
and OutputFormat). The current approach works fine, however, instance files
may be accumulated in "hadoop.tmp.dir" and, if not deleted, the cost of
loading one increases. There are two things that must be revised:
- Deleting instance files after Job finalization, if it's possible
- DCUtils: if(path.toString().endsWith(filePostFix)) -> Doesn't look like
we need to do a postfix checking here. Getting the file straight should be
possible and so the cost would be constant (we don't need to iterate over
all instance files).- Using the local files from DistributedCache when possible, through a
boolean parameter. So InputFormat and OutputFormat will pass "false" and
Combiner, Mapper and Reducer "true".Currently the main bottleneck happens in the Combiner, if there are a lot
of files. The cost of loading the instance for everything else is
negligible. Which makes me thing of another idea:
- We could implement a JVM-local cache for Combiner instances. A
static HashMap, for instance. Because Combiners are executed in the same
JVM than Mappers, caching the object instances would allow us to avoid
looking for instance cache for every Combiner call. Does it make sense?In any case, the other things should be done as well.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17.
Iván de Prado
CEO & Co-founder
www.datasalt.com
from pangool.
After looking at the DCUtils class and the calls to it I've got the conclusion that the DistributedCache is not used for nothing in the code. The method DCUtil.serializeToDC() calls the method DistributedCache.addCacheFile()... but no part of the code is using the DC cached version for nothing.
Instead, it seems that Pangool is always loading instances directly from the HDFS. (See method DC.locateFileInDC).
If I remember properly, we are not using the DC because files are not always present at the proper moment (InputFormat and OutputFormat instantiation?).
Therefore, the problems are:
- Not using the DC when possible, which is not as efficient as it could be. Many file downloads per each node.
- Lack of system for cleaning up Pangool instances cache properly.
The difficult part to solve here is the cache cleanup. This cache is located in the HDFS, and keeps the instances needed for each job. Instances are identified by a UUID generated + some fixed prefix. The difficulty is how to know if an instance is not needed anymore:
- Name does not tell nothing about which Job is the owner of the instance, and if the job has finished.
- Timestamp is not enough to know that.
A simple idea would be to have a configurable time to live for each instance, but that is far from perfection.
Another possibility would be to associate instances with JobID's. That way for every instance we could check if the job it belongs to is still alive. But JobID is not present until job.submit() is called.
Other possibility is to delete the files after the job has finished (successfully or not). At this point, you know all instance filenames so you can remove them. Although this sounds good, it is an unsafe procedure, as the process that launched the job could be killed or could fail and in this case files wouldn't be deleted. Also, it is not very elegant, as would require a call to a method from the user. Something like:
job = MRBuilder.build();
try {
job.submit();
} finally {
MRBuilder.cleanJob(job);
}
Any smart idea?
(In conclusion, the three points exposed by Pere remains relevant)
from pangool.
I did some minor changes:
- Renamed the class DCUtils as InstancesDistributor. Updating comments
- Remove the postfix (point 2 of initial proposal)
from pangool.
A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.
This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().
This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.
from pangool.
That seems the most reasonable solution, but it is still not safe. It seems
that is not possible to solve this problem properly without changing Hadoop
codebase.
2013/1/4 Pere Ferrera [email protected]
A possible solution would be to not give a Job instance to the user but
wrap it into a Pangool class that should handle file cleaning underneath.This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, - Blocking run
/ non-blocking run, - isComplete(), isSuccessful().This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11876468.
Iván de Prado
CEO & Co-founder
www.datasalt.com
from pangool.
Two things :
1 - I think we should mantain the createJob in any case. There are multiple
APIs that deal directly with Job instead of Configuration, and they would
be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema
AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs
called AvroMultipleOutputs that deal directly with Job. Besides, any
user-custom library that implements Job execution retrials accepting Job
would break.
2- About the cleaning, why not just create a couple of static methods that
deal with Job and after completion (or error) they clean the garbage ?
Eric.
On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado [email protected]:
That seems the most reasonable solution, but it is still not safe. It
seems
that is not possible to solve this problem properly without changing
Hadoop
codebase.2013/1/4 Pere Ferrera [email protected]
A possible solution would be to not give a Job instance to the user but
wrap it into a Pangool class that should handle file cleaning
underneath.This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, - Blocking
run
/ non-blocking run, - isComplete(), isSuccessful().This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.—
Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.Iván de Prado
CEO & Co-founder
www.datasalt.com—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11877098.
from pangool.
I see the point in Eric's concerns.
So I think by now the best solution is to implement the static method and make its usefulness clear by highlighting it in the documentation.
Everyone agree?
from pangool.
I agree with Eric. I would add some static methods to Job cleanup, and
would keep an eye on this issue in the future to try to find better
solutions.
Iván
2013/1/4 Eric Palacios [email protected]
Two things :
1 - I think we should mantain the createJob in any case. There are
multiple
APIs that deal directly with Job instead of Configuration, and they would
be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema
AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs
called AvroMultipleOutputs that deal directly with Job. Besides, any
user-custom library that implements Job execution retrials accepting Job
would break.
2- About the cleaning, why not just create a couple of static methods that
deal with Job and after completion (or error) they clean the garbage ?Eric.
On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado [email protected]:
That seems the most reasonable solution, but it is still not safe. It
seems
that is not possible to solve this problem properly without changing
Hadoop
codebase.2013/1/4 Pere Ferrera [email protected]
A possible solution would be to not give a Job instance to the user
but
wrap it into a Pangool class that should handle file cleaning
underneath.This new class should allow for the same functionality that Job allows
such as: - Being able to run the Job in verbose mode or not, -
Blocking
run
/ non-blocking run, - isComplete(), isSuccessful().This change would be non-backwards compatible. But we can deprecate
createJob() and remove it in future versions.—
Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.Iván de Prado
CEO & Co-founder
www.datasalt.com—
Reply to this email directly or view it on GitHub<
https://github.com/datasalt/pangool/issues/17#issuecomment-11877098>.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878872.
Iván de Prado
CEO & Co-founder
www.datasalt.com
from pangool.
And the instance files should have a JobID since jobs could be executed in parallel and only the files from a certain JobID will be cleaned.
I can take care of this issue.
from pangool.
+1
2013/1/4 Pere Ferrera [email protected]
I see the point in Eric's concerns.
So I think by now the best solution is to implement the static method and
make its usefulness clear by highlighting it in the documentation.Everyone agree?
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878941.
Iván de Prado
CEO & Co-founder
www.datasalt.com
from pangool.
Instance files cannot have the JobID, as it is not available until the job
is executed. The UUID trick is there for solving the issue with parallel
jobs.
2013/1/4 Pere Ferrera [email protected]
And the instance files should have a JobID since jobs could be executed in
parallel and only the files from a certain JobID will be cleaned.I can take care of this issue.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/17#issuecomment-11878988.
Iván de Prado
CEO & Co-founder
www.datasalt.com
from pangool.
I have added a non-static method in TupleMRBuilder and MapOnlyJobBuilder that deletes the files generated by the builders. So, the builders keep track of all the files that are being generated per each Job. In this way there is no concurrency issues.
from pangool.
Related Issues (20)
- The methods of test1() and test2() in TestRollup can't clean output folder HOT 2
- TestSolrOutputFormat test can't run succussfully HOT 7
- It looks strange to create files in HDFS using a path from disk HOT 6
- Add getString() method to Tuple HOT 3
- Do you have plan to support ProtocolBuffer serialization in pangool? HOT 4
- The TestTopicFingerprint.test() can't run successfully! HOT 12
- Deserializer with buffer size
- work with hadoop 0.23 ? HOT 2
- Version for Hadoop 0.23 HOT 1
- MOUT not properly handled HOT 7
- Mutator class for evolving Schemas conveniently HOT 1
- TupleTextOutputFormat should allow nulls as well
- Update pangool.net and javadoc HOT 2
- User needs to do super.setup() in MapOnlyMapper, otherwise a NPE happens at cleanup() HOT 1
- Closing the loop of serialization (lists, sets, maps)
- MapOnlyJobBuilder.createJob() shoulnd throw URISyntaxException
- "Streams" or different group-by's HOT 5
- HCatalog failing with non string partition columns
- Enhance multipleInputs to allow more than one mapping for the same input path
- MultipleOutputs is very slow when trying to use an inexistent multiple output
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pangool.