Code Monkey home page Code Monkey logo

Comments (12)

lewismc avatar lewismc commented on September 15, 2024

This issue is confirmed as I can reproduce it with alternative files within the Usergrid project. Namely;

lmcgibbn@LMC-032857 /usr/local/incubator-usergrid(master) $ find . -name "test.html"
./sdks/html5-javascript/examples/persistence/test.html
./sdks/html5-javascript/examples/test/test.html
./sdks/html5-javascript/tests/test.html

where these only show up once in the DRAT output.

  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/roles.html
  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/shell.html
  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/test.html
  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/users-activities.html
  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/users-feed.html
  /usr/local/drat/deploy/data/jobs/rat/1407484852545/input/users-graph.html

I am pretty positive that is a bug in RAT and not DRAT... we most likely need to file it over there.
Hows about discussion here first?

from drat.

chrismattmann avatar chrismattmann commented on September 15, 2024

Sounds good, thanks Lewis. One thing to realize is that DRAT partitions jobs by MIME type and into diff sub directories of size 100 (configurable as well). That said, are you sure that there actually aren't copies of e.g., index.html or test.html, but that they aren't in e.g., some other DRAT job directory for RAT? For example you can do a file manager query and see how many files come back for test.html?

from drat.

lewismc avatar lewismc commented on September 15, 2024

Based on the above I'll fire back up FileMgr and try to get to the bottom. Thanks for comments.

from drat.

chrismattmann avatar chrismattmann commented on September 15, 2024

@lewismc did you get a chance to check this?

from drat.

lewismc avatar lewismc commented on September 15, 2024

hi @chrismattmann no I didn't get a chance to check this out however I will update this issue once I run DRAT over the HTrace - http://incubator.apache.org/projects/htrace.html codebase. We are just making sure that the Software grant from Cloudera check's out OK, then I assume the podling will continue with setting up codebase @apache.
I'll update once I use DRAT next.

from drat.

chrismattmann avatar chrismattmann commented on September 15, 2024

thanks @lewismc OK!

from drat.

lewismc avatar lewismc commented on September 15, 2024

@chrismattmann just been able to verify this bug

from drat.

lewismc avatar lewismc commented on September 15, 2024

If you look at
https://github.com/apache/oodt/blob/trunk/profile/src/main/java/org/apache/oodt/profile/handlers/lightweight/package.html
you will see that there is no license header.
If you then look at
https://github.com/apache/oodt/blob/trunk/xmlquery/src/main/java/org/apache/oodt/xmlquery/package.html
you will also notice no license header.
Within the DRAT report I posted for the OODT 0.8.1 RC#1 results, you will see the following results

 21 Unapproved licenses:
 22
 23   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/async.html
 24   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/images.html
 25   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/large.html
 26   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/package.html
 27   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/prerendered.html
 28   /usr/local/drat/deploy/data/jobs/rat/1422589552487/input/simple.html

This is incorrect, there should be two such entries.

from drat.

lewismc avatar lewismc commented on September 15, 2024

I don't know where this bug actually resides to be honest. I have a feeling it is possibly within RAT itself...

from drat.

lewismc avatar lewismc commented on September 15, 2024

The same is true for

 23   /usr/local/drat/deploy/data/jobs/rat/1422589552402/input/build.xml

which actually relates to both

lmcgibbn@LMC-032857 /usr/local/oodt_trunk(master) $ find . -name "build.xml"

./tools/pdi_plugin/build/build.xml
./tools/pdi_plugin/build.xml

from drat.

chrismattmann avatar chrismattmann commented on September 15, 2024

so the File Manager catalog currently contains multiple files that have the same name. So that's correct. So then I checked out the MimePartitioner. What it does it then looks up all files by type, consisting of the following types:

   mimeTypes = ['x-java-source', 'x-c', 'javascript', 'xml', 'html', 'css', 'x-json', 'x-sh', 'x-fortran', 'csv' 'tab-separated-values', 'x-tex', 'x-asm', 'x-diff', 'x-python']

For each one of those types, it queries Solr, gets the info filename, file location, joins these together and sets the metadata field InputFiles to a multi-valued set for each Rat PGE task.
The RAT PGE task should write a script file in $DRAT_HOME/data/jobs/rat//sciPgeExeScript_RatCodeAudit. An example of which is:

[usc-secure-wireless-062-114:jobs/rat/1425608188401] mattmann% more sciPgeExeScript_RatCodeAudit 
#!sh
export PATH=$HOME/bin/:${PATH}
shopt -s expand_aliases
alias rat="java -jar /Users/mattmann/drat/deploy/rat/lib/apache-rat-0.9.jar"
echo "Creating working dirs"
mkdir /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/input ; mkdir /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/output; mkdir /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/logs
echo "Staging input to /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/input"
cp -R `python -c "print ' '.join('/Users/mattmann/drat/src/pge/src/main/resources/bin/mime_partitioner/mime_rat_partitioner.py,/Users/mattmann/drat/src/pge/src/main/resources/bin/rat_aggregator/rat_aggregator.py,/Users/mattmann/drat/src/pge/target/classes/bin/mime_partitioner/mime_rat_partitioner.py,/Users/mattmann/drat/src/pge/target/classes/bin/rat_aggregator/rat_aggregator.py'.split(','))"` /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/input
echo "Running Apache RAT on /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/input"
rat /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/input > /Users/mattmann/drat/deploy/data/jobs/rat/1425608188401/output/rat_x-python_1425608188401.log

[usc-secure-wireless-062-114:jobs/rat/1425608188401] mattmann% 

There is the error, @lewismc - it's on the copy step in the RAT PGE job. Looks like we need to check if the files have the same name, and if so, dedup them. Maybe just replacing '/' with '_'. Let me think.

from drat.

chrismattmann avatar chrismattmann commented on September 15, 2024

fixed and tested in #34

from drat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.