Code Monkey home page Code Monkey logo

quac's People

Contributors

aronwc avatar gfairchild avatar jtbates avatar reidpr avatar tmills avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

quac's Issues

implement makereduce --update

The basic idea is pretty simple:

  1. Dump command line to a file in the job dir
  2. Merge saved command line from file with new one (can't re-specify any options except args.inputs, args.jobdir is the same by definition):
    a. Parse new command line (to discover --update).
    b. Save args.inputs.
    c. Parse old command line.
    d. Merge args.inputs.
  3. Rebuild makefile.

Need to update both quacreduce

Make test.sh verify imports correctly

Currently it is having trouble with packages. Possibly:

  1. Create a module parallel to the module being tested that contains only the imports.
  2. Try to import it (should be silent).
  3. Delete it.

Location_Estimate.likelihood_polygons

Given a Location_Estimate and a list of Polygons, output <polygon_id, probability> tuples, where the probability is the likelihood that the true point is within that polygon. Threshold return list by probability.

Note that likelihood_polygon (without the 's') is already implemented and tested, and seems to do what we want for an individual polygon.

cmdtests should do all work in a subdirectory

Currently, the cmdtests tend to leave crud in the test directory after they are done. Rather than using a shared directory, each test should create a subdirectory for itself and work there. Perhaps something that could be done in environment.sh, rewriting $DESTDIR?

Support absolute paths in config files

Currently, paths in the config files must be relative to the location of the config file specified on the command line (specifically, absolute paths don’t work).

Put docs on the web

Looks like some moderately hairy scripting (involving submodules?) to get the docs onto the gh-pages branch.

Change Twitter library

Tweetstream isn't very well used all that much and the maintainer isn't responsive. It would be nice if we could use something that we didn't have to maintain (e.g., I added OAuth support myself).

Sphinx theme is kind of terrible

Currently we use the sphinxdoc, which is sort of OK but doesn't have a full table of contents in the sidebar (and the page TOC is mislabeled Table of contents) and there's no way to make the sidebar sticky.

I tried agogo, which has a real table of contents in the sidebar, but it's kind of ugly. Font too large, ugly page title.

make web is broken

error excerpt:

cd ../doc && git push
Counting objects: 101, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (55/55), done.
Writing objects: 100% (75/75), 43.04 KiB, done.
Total 75 (delta 42), reused 45 (delta 20)
To [email protected]:reidpr/quac.git
   c792935..fc95ab6  gh-pages -> gh-pages
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to '[email protected]:reidpr/quac.git'
hint: Updates were rejected because a pushed branch tip is behind its remote
hint: counterpart. If you did not intend to push that branch, you may want to
hint: specify branches to push or set the 'push.default' configuration
hint: variable to 'current' or 'upstream' to push only the current branch.
make: *** [web] Error 1

Add "how to contribute" to the docs

  • versions aren't stable, they're more of a goal-setting tool
    • crazy stuff happens on branches other than master, master should be in general reasonably stable, we try to do decent testing
  • reporting bugs (probably in README)
  • pick an issue
    • if it's complex, converse before taking it on
    • also submit an issue if you want something we haven't thought if (which is awesome)
    • please don't submit pull requests without a corresponding issue
  • fork & pull requests
  • update What's New
  • PEP 8 plus:
    • 3 spaces per indent
    • parent around if, for, etc.
    • follow surrounding code...
  • branching model (see "branching model taht works")
    • merge feature branches with --no-ff --no-commit

Simplify `verbose` arguments

There are several places where a verbose argument could be eliminated by interrogating the logger instead (i.e., are we at DEBUG log level?).

test.sh invocation of find doesn't work on mac

The stock Mac find fails with:

find: -xtype: unknown primary or operator

Fix the find call to be compatible (-type may work, except for something about symlinks, maybe -follow helps?), or document installing findutils with homebrew: this works, except the command is installed as gfind instead of find.

Organize top level better

e.g., add lib (Python code), bin, etc. to reduce the number of files at the top level to less than a dozen or so.

`.gitignore` policy

Should we put a .gitignore file with e.g. .pyc in it, or depend on devs to have common stuff like that in a global .gitignore?

u.py tests fail on Mac OS

$ ./test.sh u.py
+ u... 
**********************************************************************
File "/Users/reidpr/Documents/quac/u.py", line 403, in __main__.memory_use
Failed example:
    big = memory_use()
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__.memory_use[1]>", line 1, in <module>
        big = memory_use()
      File "/Users/reidpr/Documents/quac/u.py", line 414, in memory_use
        ss = open('/proc/self/status').read()
    IOError: [Errno 2] No such file or directory: '/proc/self/status'
**********************************************************************
1 items had failures:
   6 of   8 in __main__.memory_use
***Test Failed*** 6 failures.

hashsplit is slow

Part of the QUACreduce implementation requires splitting the output of mappers into multiple partitions. The right approach, AFAICT, is to hash each key and take modulus number of partitions.

I'd thought that there would be existing tools to do this, but I haven't been able to find any. I implemented a script hashsplit does this splitting, but it's pretty slow. On my box, it seems to top out at about 30 MB/s, and despite a fair bit of tinkering, I haven't been able to move it past that. In contrast, split breaking the same file into the same number of pieces with round-robin line distribution runs at over 200 MB/s; this is nearly as fast as dding /dev/zero onto the disk (230 MB/s).

Options to move forward that I've come up with so far:

  • Implement hashsplit in C.
  • Implement hashsplit in PyPy or Cython.
  • Path split in GNU coreutils? (There's already a function to distribute lines round-robin; perhaps this could be modified to do hashing.)

The Python API of QUACreduce could split internally rather than writing to stdout, but this would break extensibility to Hadoop Streaming and might not be a huge win anyway, since the same slow code from hashsplit would be used.

I also tried a Python 3 implementation which wasn't any faster.

I'm putting this on milestone v0.3 to keep it under close consideration for a little while, but it's possible deferring it will be appropriate.

add tweet reader that does not parse timestamp?

tweet.Tweet.from_list() is slowed by the time_.iso8601utc_parse() call.

Currently, tweet.Reader operates at about 35,000 tweets/second (on my workstation); if we replace the timestamp parsing with a simple assignment (leaving the time as a string), that goes up to about 130,000 tweets/second (~4x faster).

It's unclear to me whether this matters and/or is worth the complexity to add the no-parse option, but it did seem worth documenting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.