The petlx from petl-developers

todataframe

A todataframe() convenience function to load a table into a pandas DataFrame would be useful (probably have to go via numpy structured array).

hook into petl.fluent

Hook packages into petl.fluent so they can be used in the fluent style, e.g., etl().fromgff3() etc..

gff3 utilities

Proposed to add functions fromgff3, gff3unpackinfo and gff3intervaljoin for working with gff3 annotation files.

ipython display support unicode

Ensure display() and displayall() functions in ipython integration module support unicode.

fix link to petl in index.txt

Link points to petl 0.3, very out of date!

fromflagstat

Proposed to add utility function fromflagstat as convenience for parsing outputs of samtools flagstat.

fromxlsx flags

There are several flags available when opening and xlsx workbook, proposed to add these also to fromxlsx and pass through to openpyxl:

guess_types will enable (default) or disable type inference when reading cells.
data_only controls whether cells with formulae have either the formula (default) or the value stored the last time Excel read the sheet.
keep_vba controls whether any Visual Basic elements are preserved or not (default). If they are preserved they are still not editable.

vcf utilities

Proposed to add some simple vcf utility functions for reshaping vcf files into various simpler table forms. E.g., fromvcf, vcfmeltsamples, vcfunpackinfo, vcfunpacksamples, vcfheader.

fromgff3 attributes are none if trailing semicolon

Attributes field is none if the actual field has a trailing semicolon.

ipy notebook display()

Add a display() function/method to the ipython integration so you can get multiple tables to display their output from the same code cell.

I wonder if it would be useful to have a wiki for this project. Now that i'm trying to contribute, it would be nice to organize questions, new information etc. in one place. Such things could easily get lost in the Google group. For example, "How do I run the test cases" etc. etc.

cache data structures in interval module

Is it worth caching the trees in interval join containers? Currently trees are built each time an iterator is requested.

fromgff3 with region

Proposed to add support for extracting from a GFF3 file for a specific region where the GFF is tabix indexed.

ipython notebook display() caption option

rtree lookups and joins

Proposed to add functions based on bounding box queries via the rtree module - http://toblerity.github.com/rtree/tutorial.html

E.g., bboxlookup(), bboxlookupone(), bboxjoin()

intervalsubtract

Proposed to add function intervalsubtract() to interval module behaving as bedtools subtract.

to/from mongodb

Proposed to add to/from functions for working with mongodb.

fromvcf 'chrom' keyword doesn't work if provided without start or stop

Would be nice to select rows from a given chromosome, without also having to provide coords.

petlx.ipython display() default to 5 rows

Change default display to 5 rows similar to petl.head().

interval left join with missing facet key

intervalleftjoin() should work if right table is missing one of the facet keys found in the left

fromxlsx 'Sheet1' is default sheet name

Proposed to add 'Sheet1' as default sheet name.

array() convenience function for values/column

It would be nice to be able to do...

a = tbl.values('foo').array()

RowContainer import fails

RowContainer has moved in petl 0.11, need to update interval and gff3 modules.

fromhtsql

Proposed to add module providing adapters for htsql. Including fromhtsql taking htsql object or connection string followed by query.

simplify toarray/torecarray()?

Currently the dtype for a structured array is inferred one column at a time. However, passing a sample of the data to np.rec.array() would infer a dtype for the whole table in one go, and would simplify the code.

interval joins

It would be useful to have functions that support joining tables by overlapping ranges, rather than exact key values. E.g., to join a list of positions in a genome with records from a gene annotation table.

The proposal is to add functions intervaljoin and facetintervaljoin which join two tables based on overlapping ranges, with the faceted version combining a conventional key-based join with a range join.

faceted interval lookups

In addition to the existing interval lookup functions, it is proposed to add faceted versions of these functions, to allow for construction and query of multiple interval trees. The motivating use case is lookup of genomic locations, where you want one interval tree per chromosome.

The proposal is to add functions facetintervallookup, facetintervallookupone, facetintervalrecordlookup, facetintervalrecordlookupone, as faceted versions of the existing interval lookup functions.

cachetag is deprecated

Since petl 0.16 the cachetag convention is deprecated, remove cachetag methods in petlx and dependencies on deprecated members.

fromsav

Proposed to add fromsav using the spss recipe on activestate's website.

to/from hdf5

Proposed to add functions for working with hdf5 via pytables.

fromdta

Proposed to add fromdta using statsmodels.

to/from xls

Add support for working directly with Excel (XLS) files, probably via xlrd.

collapsed intervals

Proposed to add utility function to petlx.interval to return collapsed interval from a table with start, stop coords.

fromarray

Add fromarray() function to petlx.array module (was postponed from #1).

fix petl.fluent and petl.interactive integration after import pattern change

The import pattern has changed: petl-developers/petl#230

Need to modify the integration module accordingly.

fromxlsx default to first sheet in workbook

Modify fromxlsx to extract from first sheet in workbook, rather than fixed name 'Sheet1'.

pypi distribution does not contain xls.py

ipython display table

Add a package petlx.ipython with function display() which takes table, converts to HTML and inlines in notebook.

fromtabix with no header row

I suspect fromtabix will currently break on files with no header rows.

fromxlsx not wrapped in fluent/interactive

The result of fromxlsx is not properly wrapped when using petl.fluent or petl.interactive.

torecarray()

Request convenience function torecarray() in petlx.array, to save having to type ".view(recarray)" all the time.

fromxlsx add range option

Add a range keyword argument to fromxlsx to allow extracting a table from a specific cell range.

interval use of suffix notation ([]) is inappropriate

Current use of the suffix notation in the petlx.interval module is not appropriate as start and stop values are not indices and so not a slice. Proposed to change to a find() method as per the underlying bx-python module.

interval doc error

>>> from petlx import intervallookup

Should import from petl.interval package.

fromsoup

Implement a fromsoup() method using Beatiful Soup to provide more flexibility and power for extracting tables from XML or HTML.

interval left join as list of values

It would be convenient to be able to perform an interval left join but then have matching values from one or more fields the right hand table given as a list of values in a new column. I.e., the output would have one row per input row in the left hand table.

fromtabix via pysam

Proposed to add a package petlx.tabix with function fromtabix which supports extracting data from a tab delimited file specifying a sequence region and coords.

support automated database table creation upon loading

This is a placeholder for adding support for creating a database table prior to loading it. I.e., a function similar to the standard petl.todb() function, but automatically generate a schema definition based on the table to be loaded, and execute the table creation, prior to loading.

It looks like sqlalchemy has good support for managing different SQL dialects, so proposed to use sqlalchemy as a dependency.

Moved from petl-developers/petl#225

toxlsx using optimized writer

Implement a toxlsx function using the openpyxl optimized writer: http://pythonhosted.org/openpyxl/optimized.html#optimized-writer

to/from numpy structured array

I'd like the most convenient possible method of loading a table of data from a petl row container into a numpy structured array for plotting or numerical processing. You can use the numpy fromiter function, but it would be nice to wrap that to make it even more convenient, with minimal specification of the datatype and no need to duplicate the field names (also maybe even guess the data type for fields if not specified).

The proposal is to add a toarray(tbl, dtype, n) function taking a table (row container) as first positional arg, a dtype as some convenient way of specifying the dtype to use for the structured array (possibly sparse?) and an integer n as a hint on the array size (passed through to fromiter).

It is also proposed to add a fromarray(a) function taking a 1D structured array as input and providing a view as a row container to allow round-tripping to and from numpy arrays and petl transformation functions.

petl-developers / petlx Goto Github PK

petlx's People

Contributors

Stargazers

Watchers

Forkers

petlx's Issues

Recommend Projects

Recommend Topics

Recommend Org