quac,reidpr

The basic idea is pretty simple:

Dump command line to a file in the job dir
Merge saved command line from file with new one (can't re-specify any options except args.inputs, args.jobdir is the same by definition):
a. Parse new command line (to discover --update).
b. Save args.inputs.
c. Parse old command line.
d. Merge args.inputs.
Rebuild makefile.

Need to update both quacreduce

Make test.sh verify imports correctly

Currently it is having trouble with packages. Possibly:

Create a module parallel to the module being tested that contains only the imports.
Try to import it (should be silent).
Delete it.

Copy bugs from `limitations.rst` to GitHub issue tracker

Location_Estimate.likelihood_polygons

Given a Location_Estimate and a list of Polygons, output <polygon_id, probability> tuples, where the probability is the likelihood that the true point is within that polygon. Threshold return list by probability.

Note that likelihood_polygon (without the 's') is already implemented and tested, and seems to do what we want for an individual polygon.

cmdtests should do all work in a subdirectory

Currently, the cmdtests tend to leave crud in the test directory after they are done. Rather than using a shared directory, each test should create a subdirectory for itself and work there. Perhaps something that could be done in environment.sh, rewriting $DESTDIR?

Support absolute paths in config files

Currently, paths in the config files must be relative to the location of the config file specified on the command line (specifically, absolute paths don’t work).

consider reimplementing test.sh in Python

It's becoming a rather large shell script, and shell is kind of horrible to program in.

Add road map & what's new

implement absolute paths in config

Put docs on the web

Looks like some moderately hairy scripting (involving submodules?) to get the docs onto the gh-pages branch.

Make `Geo_GMM` objects printable

Geo_GMM objects can't be printed because of a RuntimeError arising from (IMO) a design error in scikit-learn. I filed a bug report which was not well received.

Update citing.rst to point to the arXiv paper

look into gzip instead of Python zlib for decompressing .json.gz

Geoff tells me it's much faster.

update tests which use sorted() to use pformat() instead

Makes it more obvious what sort of data structure is being tested, and retains the order predictability of sorted().

Also looks like I've used pprint() instead of pformat() in a few places. Fix these too.

add library functions for profiling

Perhaps something in u.ArgumentParser(), u.profile_start(), u.profile_stop()? Example in hashsplit.

ack CNLS funding

Change Twitter library

Tweetstream isn't very well used all that much and the maintainer isn't responsive. It would be nice if we could use something that we didn't have to maintain (e.g., I added OAuth support myself).

add copyright notice to each file

... and add a test to verify.

Sphinx theme is kind of terrible

Currently we use the sphinxdoc, which is sort of OK but doesn't have a full table of contents in the sidebar (and the page TOC is mislabeled Table of contents) and there's no way to make the sidebar sticky.

I tried agogo, which has a real table of contents in the sidebar, but it's kind of ugly. Font too large, ugly page title.

README and overview.rst contain duplicate information

Perhaps split README into three files (README.rst, COPYRIGHT.rst, FUNDING.rst) and include them into the Sphinx stuff?

make web is broken

error excerpt:

cd ../doc && git push
Counting objects: 101, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (55/55), done.
Writing objects: 100% (75/75), 43.04 KiB, done.
Total 75 (delta 42), reused 45 (delta 20)
To [email protected]:reidpr/quac.git
   c792935..fc95ab6  gh-pages -> gh-pages
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to '[email protected]:reidpr/quac.git'
hint: Updates were rejected because a pushed branch tip is behind its remote
hint: counterpart. If you did not intend to push that branch, you may want to
hint: specify branches to push or set the 'push.default' configuration
hint: variable to 'current' or 'upstream' to push only the current branch.
make: *** [web] Error 1

Add "how to contribute" to the docs

versions aren't stable, they're more of a goal-setting tool
- crazy stuff happens on branches other than master, master should be in general reasonably stable, we try to do decent testing
reporting bugs (probably in README)
pick an issue
- if it's complex, converse before taking it on
- also submit an issue if you want something we haven't thought if (which is awesome)
- please don't submit pull requests without a corresponding issue
fork & pull requests
update What's New
PEP 8 plus:
- 3 spaces per indent
- parent around if, for, etc.
- follow surrounding code...
branching model (see "branching model taht works")
- merge feature branches with --no-ff --no-commit

how to structure contributions of core team: shared writability, or pull requests

Instagram downloads via streaming API

basic time series computation and correlation

One global time series per keyword.

Simplify `verbose` arguments

There are several places where a verbose argument could be eliminated by interrogating the logger instead (i.e., are we at DEBUG log level?).

test.sh should find and complain about files that contain tabs

refactor "getting started" and "dependencies" into "installation" and "collecting tweets"

Also probably remove "how to get QUAC" section from README.

revise docs for out-of-dateness

update scripts to use new u.ArgumentParser class

Flickr downloads via streaming API

configuration is pretty messy

Reading the configuration is scattered through different classes and functions in u.py. Evaluate and clean up.

download wikipedia access logs

Remove lib from .gitignore

lib contains .py files, so it should not be ignored.

test.sh invocation of find doesn't work on mac

The stock Mac find fails with:

find: -xtype: unknown primary or operator

Fix the find call to be compatible (-type may work, except for something about symlinks, maybe -follow helps?), or document installing findutils with homebrew: this works, except the command is installed as gfind instead of find.

Organize top level better

e.g., add lib (Python code), bin, etc. to reduce the number of files at the top level to less than a dozen or so.

`.gitignore` policy

Should we put a .gitignore file with e.g. .pyc in it, or depend on devs to have common stuff like that in a global .gitignore?

add blurb to docs about please e-mail us if QUAC is useful to you

u.py tests fail on Mac OS

$ ./test.sh u.py
+ u... 
**********************************************************************
File "/Users/reidpr/Documents/quac/u.py", line 403, in __main__.memory_use
Failed example:
    big = memory_use()
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__.memory_use[1]>", line 1, in <module>
        big = memory_use()
      File "/Users/reidpr/Documents/quac/u.py", line 414, in memory_use
        ss = open('/proc/self/status').read()
    IOError: [Errno 2] No such file or directory: '/proc/self/status'
**********************************************************************
1 items had failures:
   6 of   8 in __main__.memory_use
***Test Failed*** 6 failures.

add API documentation to Sphinx

At this stage, this just means ingesting the existing docstrings and cleaning up any egregious formatting errors.

hashsplit is slow

Part of the QUACreduce implementation requires splitting the output of mappers into multiple partitions. The right approach, AFAICT, is to hash each key and take modulus number of partitions.

I'd thought that there would be existing tools to do this, but I haven't been able to find any. I implemented a script hashsplit does this splitting, but it's pretty slow. On my box, it seems to top out at about 30 MB/s, and despite a fair bit of tinkering, I haven't been able to move it past that. In contrast, split breaking the same file into the same number of pieces with round-robin line distribution runs at over 200 MB/s; this is nearly as fast as dding /dev/zero onto the disk (230 MB/s).

Options to move forward that I've come up with so far:

Implement hashsplit in C.
Implement hashsplit in PyPy or Cython.
Path split in GNU coreutils? (There's already a function to distribute lines round-robin; perhaps this could be modified to do hashing.)

The Python API of QUACreduce could split internally rather than writing to stdout, but this would break extensibility to Hadoop Streaming and might not be a huge win anyway, since the same slow code from hashsplit would be used.

I also tried a Python 3 implementation which wasn't any faster.

I'm putting this on milestone v0.3 to keep it under close consideration for a little while, but it's possible deferring it will be appropriate.

It's unclear to me whether this matters and/or is worth the complexity to add the no-parse option, but it did seem worth documenting.

Infer location of all tweets

The functionality for inference is present, but we need to add it to the pre-processing pipeline.

reidpr / quac Goto Github PK

quac's People

Contributors

Stargazers

Watchers

Forkers

quac's Issues

Recommend Projects

Recommend Topics

Recommend Org