cmap / cmappy Goto Github PK

View Code? Open in Web Editor NEW

121.0 27.0 76.0 20.34 MB

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools

Home Page: https://clue.io/cmapPy/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 83.60% Jupyter Notebook 16.37% Dockerfile 0.03%

gctx gct grp pandas gene genetics

cmappy's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 thunder001 johancapella mruffalo tbomberman huashi1 stuppie colinwxl starlida adofficer rcbiczok dllahr zeromtmu ajlee21 jdromano2 songminghu2004 krassowski luozhhub dimitralex dburkhardt saksham219 zhji0426 arthurzq91 weizhiting manzt linuxsjn vipin710 u8sand huang961372045 mszjaas khalimat firefly19 lingling93 deniseduma samanfrm dibya035643 lauren-blake epalo iwatobipen janjoy adeboyeml sailfish009 bbchen33 thodk shameemc alexqisong pyeongkim kelly1210 evabendix benanbardak yangluom lea-meunier darintay tompan-123 jsjyk capitaldata dblyon cellular-longevity darked89 harsh-el signorec greatchenlab recorkill zhangjiahuan17 gaisensei elucidatainc mjpemberton michaelcraige msultan danisouzv cloudcell lanouyang yeha-777

cmappy's Issues

pypi release

Hi,

the pypi release is severely outdated with the current code. Would it be possible to make a release on pypi, please?

Exception("parse_gctx check_id_validity " + msg)

When I am extracting data from GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctx using cmapPy, the exception occurs. Could you help me this exception? I think it shouldn't happen.

I tried to see what the data looks like by following the code in the tutorial cmapPy_pandasGEXpress_tutorial.ipynb.

Here is the code:
vorinostat_only_gctoo = parse("GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid=vorinostat_ids)

The following is the detail msg of the exception:

No handlers could be found for logger "cmap_logger"
Traceback (most recent call last):
File "", line 1, in
File "/exeh/exe3/zhaok/.local/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse.py", line 51, in parse
curr = parse_gctx.parse(file_path, convert_neg_666, rid, cid, ridx, cidx, meta_only, make_multiindex)
File "/exeh/exe3/zhaok/.local/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 65, in parse
(sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta)
File "/exeh/exe3/zhaok/.local/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 107, in check_and_order_id_inputs
col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df)
File "/exeh/exe3/zhaok/.local/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 140, in check_and_convert_ids
check_id_validity(id_list, meta_df)
File "/exeh/exe3/zhaok/.local/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 153, in check_id_validity
raise Exception("parse_gctx check_id_validity " + msg)
Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids: set(['LPROT002_MCF7_6H:P10', 'LJP008_HCC515_24H:A03', 'LPROT002_MCF7_6H:P12', 'LJP009_ASC_24H:A03',

need to convert all import statements to absolute paths

Bug .GCT written by cmapPy on Windows have inconsistent line endings

Hi Lev,

#Bug .GCT files written with cmapPy on Windows, show alternating blank lines after the top 3 lines when opened in Excel, though fine in code editor Spyder v5.12.3

Fix: The line below writes the 1st 2 lines of a .GCT file and would otherwise default to OS line_terminator of \r\n which conflicts with all other lines terminated by \n

Inconsistent line endings probably tricks Excels auto line ending recognition

C:\ProgramData\Anaconda3\Lib\site-packages\cmapPy\pandasGEXpress\write_gct.py #line 102

Write top_half_df to file

#top_half_df.to_csv(f, header=False, index=False, sep="\t")
top_half_df.to_csv(f, header=False, index=False, sep="\t", line_terminator='\n')

Please incorporate into next version. Screenshots attached.

Thanks,

--Karl
cmapPybug_inconsistentLineEndings.docx

pandasGEXpress parse fails

Download test_l1000.gct (https://github.com/cmap/cmapPy/raw/master/cmapPy/pandasGEXpress/tests/functional_tests/test_l1000.gct)

from cmapPy.pandasGEXpress import parse
parse("test_l1000.gct")

TypeError Traceback (most recent call last)
in ()
1 from cmapPy.pandasGEXpress import parse
----> 2 parse("test_l1000.gct")

/usr/local/lib/python3.5/dist-packages/cmapPy/pandasGEXpress/parse.py in parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
57 msg = "parse_gct does not use the argument {}. Ignoring it...".format(unused_arg)
58 logger.warning(msg)
---> 59 curr = parse_gct.parse(file_path, convert_neg_666, row_meta_only, col_meta_only, make_multiindex)
60 elif file_path.endswith(".gctx"):
61 curr = parse_gctx.parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only,

/usr/local/lib/python3.5/dist-packages/cmapPy/pandasGEXpress/parse_gct.py in parse(file_path, convert_neg_666, row_meta_only, col_meta_only, make_multiindex)
121 # Read version and dimensions
122 (version, num_data_rows, num_data_cols,
--> 123 num_row_metadata, num_col_metadata) = read_version_and_dims(file_path)
124
125 # Read in metadata and data

/usr/local/lib/python3.5/dist-packages/cmapPy/pandasGEXpress/parse_gct.py in read_version_and_dims(file_path)
145
146 # Get version from the first line
--> 147 version = f.readline().strip().lstrip("#")
148
149 if version not in ["1.3", "1.2"]:

TypeError: a bytes-like object is required, not 'str'

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

When I want to read gctx file. Is fixable?

File "C:/Users/Farshid/PycharmProjects/DGEP/gctx2npy.py", line 13, in main
   gctobj = parse.parse(GTEx_GCTX)
 File "C:\Users\Farshid\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cmapPy\pandasGEXpress\parse.py", line 68, in parse
   make_multiindex=make_multiindex)
 File "C:\Users\Farshid\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cmapPy\pandasGEXpress\parse_gctx.py", line 110, in parse
   data_df = parse_data_df(data_dset, sorted_ridx, sorted_cidx, row_meta, col_meta)
 File "C:\Users\Farshid\AppData\Local\Programs\Python\Python35-32\lib\site-packages\cmapPy\pandasGEXpress\parse_gctx.py", line 332, in parse_data_df
   data_array = np.empty(data_dset.shape, dtype=np.float32)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
Traceback (most recent call last):
 File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 382, in _on_run
   r = self.sock.recv(1024)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

reading gctx from a non-fs file object

Hi,
In our settings there is often a need to read a .gctx file from a non-file (i.e. python file object).
Currently it is not possible with cmapPy. The parse method explicitly checks for the filename:

cmapPy/cmapPy/pandasGEXpress/parse_gctx.py

Line 64 in f3fdf01

if not os.path.exists(full_path):

On the other hand h5py supports any file objects.

Would it be possible to rely on duck-typing in the parse function instead to allow for different types of input file objects?

need basic test coverage for pandasGEXpress.parse

@oena @levlitichev

See example for how set args and test main in test_concat_gctoo.test_main

cmapPy_pandasGEXpress_tutorial.ipynb fails to parse ids

When running the tutorial notebook ('cmapPy_pandasGEXpress_tutorial.ipynb'), this command creates an error:

from cmapPy.pandasGEXpress.parse import parse
vorinostat_only_gctoo = parse("GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid=vorinostat_ids)

some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LPROT001_PC3_6H:P12', 'LPROT001_NPC.TAK_6H:O08', 'LJP008_A375_24H:G07', 'LJP008_SKL_24H:G12', 'LJP007_HCC515_24H:A03', 'LJP008_NPC.TAK_24H:G08', 'LJP008_HEPG2_24H:G08', 'LJP008_A375_24H:A03', 'LJP008_NEU_24H:G09', 'LJP008_PC3_24H:G09', 'LJP009_CD34_24H:A03', 'LJP008_ASC.C_24H:G07', 'LJP008_NEU_24H:G12', 'LJP008_ASC.C_24H:G10', 'LJP008_NPC.TAK_24H:G11', 'LPROT002_NPC.TAK_6H:O12', 'LPROT002_NPC.TAK_6H:O08', 'LJP008_HT29_24H:G08', 'LJP008_ASC_24H:G12', 'LJP008_HA1E_24H:G10', 'LJP007_HA1E_24H:A03', 'LJP008_HEPG2_24H:A03', 'LPROT001_A375_6H:P11', 'LPROT001_NPC.TAK_6H:O10', 'LJP009_NEU_24H:A03', 'LJP007_MNEU.E_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_SKL_24H:G10', 'LJP008_NPC.CAS9_24H:G12', 'LJP008_HEPG2_24H:G11', 'LJP008_CD34_24H:G09', 'LJP008_HCC515_24H:A03', 'LJP008_PC3_24H:G11', 'LPROT001_MCF7_6H:O11', 'LJP008_ASC_24H:G09', 'LJP008_NPC.CAS9_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_CD34_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_SKL_24H:G08', 'LJP008_HME1_24H:G10', 'LJP007_A375_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_ASC.C_24H:A03', 'LJP008_NEU_24H:G11', 'LJP008_HME1_24H:G07', 'LJP008_HEPG2_24H:G10', 'LJP008_HT29_24H:G10', 'LJP008_ASC.C_24H:G11', 'LJP009_HME1_24H:A03', 'LJP008_HA1E_24H:G09', 'LPROT002_NPC.TAK_6H:O10', 'LPROT003_A549_6H:O10', 'LJP007_NPC_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_SKL.C_24H:G10', 'LPROT002_MCF7_6H:P08', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LPROT002_MCF7_6H:P10', 'LJP008_PC3_24H:A03', 'LJP008_HUVEC_24H:G07', 'LPROT003_NPC_6H:P09', 'LJP008_HME1_24H:A03', 'LJP008_NPC.TAK_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_HME1_24H:G09', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G08', 'LJP008_HT29_24H:G09', 'LJP007_HT29_24H:A03', 'LJP009_HUVEC_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_SKL_24H:G07', 'LJP008_NPC_24H:G09', 'LJP008_HCC515_24H:G12', 'LJP008_A549_24H:G09', 'LJP008_A549_24H:G08', 'LJP008_HEPG2_24H:G12', 'LPROT002_MCF7_6H:P12', 'LJP008_HA1E_24H:G07', 'LJP008_HUVEC_24H:G12', 'LJP008_NPC.CAS9_24H:G11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:A03', 'LPROT003_A549_6H:O12', 'LJP007_HUES3_24H:A03', 'LPROT003_NPC_6H:P07', 'LJP008_ASC.C_24H:G08', 'LPROT001_A375_6H:P07', 'LJP008_HCC515_24H:G09', 'LJP009_HT29_24H:A03', 'LJP008_HT29_24H:G11', 'LJP009_HEPG2_24H:A03', 'LJP008_SKL.C_24H:G11', 'LJP008_A549_24H:G10', 'LJP008_ASC_24H:A03', 'LJP008_A549_24H:A03', 'LJP008_A375_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP008_MCF7_24H:G08', 'LPROT002_A549_6H:O07', 'LPROT003_A549_6H:O08', 'LJP008_CD34_24H:G07', 'LPROT003_PC3_6H:O09', 'LJP007_SKL_24H:A03', 'LPROT001_PC3_6H:P08', 'LJP008_A375_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_ASC_24H:G07', 'LJP007_HUVEC_24H:A03', 'LJP008_HUVEC_24H:G10', 'LJP008_HCC515_24H:G10', 'LJP008_ASC_24H:G10', 'LPROT003_PC3_6H:O11', 'LJP008_HT29_24H:G07', 'LJP008_SKL.C_24H:G12', 'LJP008_NPC.CAS9_24H:G09', 'LJP008_MCF7_24H:A03', 'LJP007_HME1_24H:A03', 'LJP007_NPC.CAS9_24H:A03', 'LJP008_HA1E_24H:G12', 'LPROT002_A549_6H:O11', 'LJP007_ASC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LJP009_ASC.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_NPC_24H:G08', 'LPROT001_MCF7_6H:O09', 'LPROT003_A375_6H:P10', 'LPROT003_A375_6H:P08', 'LJP008_CD34_24H:G11', 'LJP009_PC3_24H:A03', 'LJP008_CD34_24H:G12', 'LJP008_A375_24H:G12', 'LJP009_HA1E_24H:A03', 'LJP007_A549_24H:A03', 'LPROT002_A375_6H:P11', 'LJP008_A375_24H:G11', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HT29_24H:G12', 'LJP008_NPC_24H:A03', 'LJP009_NPC_24H:A03', 'LJP008_SKL.C_24H:G07', 'LJP008_HME1_24H:G12', 'LJP009_SKL.C_24H:A03', 'LJP008_NPC_24H:G11', 'LJP008_CD34_24H:G08', 'LJP009_NPC.CAS9_24H:A03', 'LJP008_PC3_24H:G12', 'LJP008_MCF7_24H:G11', 'LJP008_PC3_24H:G10', 'LJP008_ASC.C_24H:G12', 'LPROT001_PC3_6H:P10', 'LJP007_MCF7_24H:A03', 'LJP008_HCC515_24H:G11', 'LJP008_HUVEC_24H:A03', 'LJP009_HCC515_24H:A03', 'LJP007_HEPG2_24H:A03', 'LJP009_A549_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_HA1E_24H:G11', 'LJP008_PC3_24H:G08', 'LJP008_ASC.C_24H:G09', 'LJP008_SKL.C_24H:G08', 'LJP008_SKL_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_JURKAT_24H:A03', 'LJP008_MCF7_24H:G12', 'LJP008_HEPG2_24H:G07', 'LJP008_NPC.TAK_24H:G07', 'LJP007_ASC.C_24H:A03', 'LJP008_SKL_24H:G09', 'LPROT002_A375_6H:P09', 'LPROT001_MCF7_6H:O07', 'LJP008_A549_24H:G11', 'LJP009_SKL_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_HUVEC_24H:G09', 'LJP008_HME1_24H:G11', 'LJP008_SKL_24H:G11', 'LJP009_MCF7_24H:A03', 'LJP009_NPC.TAK_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_PC3_24H:G07', 'LJP008_HCC515_24H:G08', 'LJP008_NPC.CAS9_24H:G07', 'LJP008_NPC.TAK_24H:G09', 'LPROT001_A375_6H:P09', 'LJP007_NEU_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_NPC_24H:G12', 'LJP008_NEU_24H:A03', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_HUVEC_24H:G11', 'LJP008_NPC.CAS9_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_NEU_24H:G10', 'LJP008_NEU_24H:G08', 'LJP008_A549_24H:G12', 'LJP008_NPC_24H:G10', 'LJP008_HUVEC_24H:G08', 'LJP008_NPC_24H:G07', 'LPROT002_A375_6H:P07', 'LJP007_PC3_24H:A03', 'LJP008_SKL.C_24H:A03', 'LJP008_HA1E_24H:G08', 'LJP008_ASC_24H:G11'}
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-5-38df3cb1c58c> in <module>()
      1 from cmapPy.pandasGEXpress.parse import parse
----> 2 vorinostat_only_gctoo = parse("GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx", cid=vorinostat_ids)

~/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse.py in parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
     66                               rid=rid, cid=cid, ridx=ridx, cidx=cidx,
     67                               row_meta_only=row_meta_only, col_meta_only=col_meta_only,
---> 68                               make_multiindex=make_multiindex)
     69 
     70     else:

~/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in parse(gctx_file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
    105 
    106         # validate optional input ids & get indexes to subset by
--> 107         (sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta)
    108 
    109         data_dset = gctx_file[data_node]

~/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta_df, col_meta_df)
    144     ordered_ridx = get_ordered_idx(row_type, row_ids, row_meta_df)
    145 
--> 146     col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df)
    147     ordered_cidx = get_ordered_idx(col_type, col_ids, col_meta_df)
    148     return (ordered_ridx, ordered_cidx)

~/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_convert_ids(id_type, id_list, meta_df)
    177         if id_type == "id":
    178             id_list = convert_ids_to_meta_type(id_list, meta_df)
--> 179             check_id_validity(id_list, meta_df)
    180         else:
    181             check_idx_validity(id_list, meta_df)

~/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_id_validity(id_list, meta_df)
    193             mismatch_ids)
    194         logger.error(msg)
--> 195         raise Exception("parse_gctx check_id_validity " + msg)
    196 
    197 

Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LPROT001_PC3_6H:P12', 'LPROT001_NPC.TAK_6H:O08', 'LJP008_A375_24H:G07', 'LJP008_SKL_24H:G12', 'LJP007_HCC515_24H:A03', 'LJP008_NPC.TAK_24H:G08', 'LJP008_HEPG2_24H:G08', 'LJP008_A375_24H:A03', 'LJP008_NEU_24H:G09', 'LJP008_PC3_24H:G09', 'LJP009_CD34_24H:A03', 'LJP008_ASC.C_24H:G07', 'LJP008_NEU_24H:G12', 'LJP008_ASC.C_24H:G10', 'LJP008_NPC.TAK_24H:G11', 'LPROT002_NPC.TAK_6H:O12', 'LPROT002_NPC.TAK_6H:O08', 'LJP008_HT29_24H:G08', 'LJP008_ASC_24H:G12', 'LJP008_HA1E_24H:G10', 'LJP007_HA1E_24H:A03', 'LJP008_HEPG2_24H:A03', 'LPROT001_A375_6H:P11', 'LPROT001_NPC.TAK_6H:O10', 'LJP009_NEU_24H:A03', 'LJP007_MNEU.E_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_SKL_24H:G10', 'LJP008_NPC.CAS9_24H:G12', 'LJP008_HEPG2_24H:G11', 'LJP008_CD34_24H:G09', 'LJP008_HCC515_24H:A03', 'LJP008_PC3_24H:G11', 'LPROT001_MCF7_6H:O11', 'LJP008_ASC_24H:G09', 'LJP008_NPC.CAS9_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_CD34_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_SKL_24H:G08', 'LJP008_HME1_24H:G10', 'LJP007_A375_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_ASC.C_24H:A03', 'LJP008_NEU_24H:G11', 'LJP008_HME1_24H:G07', 'LJP008_HEPG2_24H:G10', 'LJP008_HT29_24H:G10', 'LJP008_ASC.C_24H:G11', 'LJP009_HME1_24H:A03', 'LJP008_HA1E_24H:G09', 'LPROT002_NPC.TAK_6H:O10', 'LPROT003_A549_6H:O10', 'LJP007_NPC_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_SKL.C_24H:G10', 'LPROT002_MCF7_6H:P08', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LPROT002_MCF7_6H:P10', 'LJP008_PC3_24H:A03', 'LJP008_HUVEC_24H:G07', 'LPROT003_NPC_6H:P09', 'LJP008_HME1_24H:A03', 'LJP008_NPC.TAK_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_HME1_24H:G09', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G08', 'LJP008_HT29_24H:G09', 'LJP007_HT29_24H:A03', 'LJP009_HUVEC_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_SKL_24H:G07', 'LJP008_NPC_24H:G09', 'LJP008_HCC515_24H:G12', 'LJP008_A549_24H:G09', 'LJP008_A549_24H:G08', 'LJP008_HEPG2_24H:G12', 'LPROT002_MCF7_6H:P12', 'LJP008_HA1E_24H:G07', 'LJP008_HUVEC_24H:G12', 'LJP008_NPC.CAS9_24H:G11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:A03', 'LPROT003_A549_6H:O12', 'LJP007_HUES3_24H:A03', 'LPROT003_NPC_6H:P07', 'LJP008_ASC.C_24H:G08', 'LPROT001_A375_6H:P07', 'LJP008_HCC515_24H:G09', 'LJP009_HT29_24H:A03', 'LJP008_HT29_24H:G11', 'LJP009_HEPG2_24H:A03', 'LJP008_SKL.C_24H:G11', 'LJP008_A549_24H:G10', 'LJP008_ASC_24H:A03', 'LJP008_A549_24H:A03', 'LJP008_A375_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP008_MCF7_24H:G08', 'LPROT002_A549_6H:O07', 'LPROT003_A549_6H:O08', 'LJP008_CD34_24H:G07', 'LPROT003_PC3_6H:O09', 'LJP007_SKL_24H:A03', 'LPROT001_PC3_6H:P08', 'LJP008_A375_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_ASC_24H:G07', 'LJP007_HUVEC_24H:A03', 'LJP008_HUVEC_24H:G10', 'LJP008_HCC515_24H:G10', 'LJP008_ASC_24H:G10', 'LPROT003_PC3_6H:O11', 'LJP008_HT29_24H:G07', 'LJP008_SKL.C_24H:G12', 'LJP008_NPC.CAS9_24H:G09', 'LJP008_MCF7_24H:A03', 'LJP007_HME1_24H:A03', 'LJP007_NPC.CAS9_24H:A03', 'LJP008_HA1E_24H:G12', 'LPROT002_A549_6H:O11', 'LJP007_ASC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LJP009_ASC.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_NPC_24H:G08', 'LPROT001_MCF7_6H:O09', 'LPROT003_A375_6H:P10', 'LPROT003_A375_6H:P08', 'LJP008_CD34_24H:G11', 'LJP009_PC3_24H:A03', 'LJP008_CD34_24H:G12', 'LJP008_A375_24H:G12', 'LJP009_HA1E_24H:A03', 'LJP007_A549_24H:A03', 'LPROT002_A375_6H:P11', 'LJP008_A375_24H:G11', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HT29_24H:G12', 'LJP008_NPC_24H:A03', 'LJP009_NPC_24H:A03', 'LJP008_SKL.C_24H:G07', 'LJP008_HME1_24H:G12', 'LJP009_SKL.C_24H:A03', 'LJP008_NPC_24H:G11', 'LJP008_CD34_24H:G08', 'LJP009_NPC.CAS9_24H:A03', 'LJP008_PC3_24H:G12', 'LJP008_MCF7_24H:G11', 'LJP008_PC3_24H:G10', 'LJP008_ASC.C_24H:G12', 'LPROT001_PC3_6H:P10', 'LJP007_MCF7_24H:A03', 'LJP008_HCC515_24H:G11', 'LJP008_HUVEC_24H:A03', 'LJP009_HCC515_24H:A03', 'LJP007_HEPG2_24H:A03', 'LJP009_A549_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_HA1E_24H:G11', 'LJP008_PC3_24H:G08', 'LJP008_ASC.C_24H:G09', 'LJP008_SKL.C_24H:G08', 'LJP008_SKL_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_JURKAT_24H:A03', 'LJP008_MCF7_24H:G12', 'LJP008_HEPG2_24H:G07', 'LJP008_NPC.TAK_24H:G07', 'LJP007_ASC.C_24H:A03', 'LJP008_SKL_24H:G09', 'LPROT002_A375_6H:P09', 'LPROT001_MCF7_6H:O07', 'LJP008_A549_24H:G11', 'LJP009_SKL_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_HUVEC_24H:G09', 'LJP008_HME1_24H:G11', 'LJP008_SKL_24H:G11', 'LJP009_MCF7_24H:A03', 'LJP009_NPC.TAK_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_PC3_24H:G07', 'LJP008_HCC515_24H:G08', 'LJP008_NPC.CAS9_24H:G07', 'LJP008_NPC.TAK_24H:G09', 'LPROT001_A375_6H:P09', 'LJP007_NEU_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_NPC_24H:G12', 'LJP008_NEU_24H:A03', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_HUVEC_24H:G11', 'LJP008_NPC.CAS9_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_NEU_24H:G10', 'LJP008_NEU_24H:G08', 'LJP008_A549_24H:G12', 'LJP008_NPC_24H:G10', 'LJP008_HUVEC_24H:G08', 'LJP008_NPC_24H:G07', 'LPROT002_A375_6H:P07', 'LJP007_PC3_24H:A03', 'LJP008_SKL.C_24H:A03', 'LJP008_HA1E_24H:G08', 'LJP008_ASC_24H:G11'}```

need basic test coverage for pandasGEXpress.gct2gctx

@oena @levlitichev

See example for how set args and test main in test_concat_gctoo.test_main

Bug for the 3.1.1 version

Just want to report the bug: this issue only happen for 3.1.1. I stuck for a whole day...

from cmapPy.pandasGEXpress import parse
x=parse("/fs0/chenr6/Database_fs0/LINCS/GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctx",col_meta_only=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: 'module' object is not callable

GCTX parsing is not thread safe.

Here is the code I'm using:
`import cmapPy.pandasGEXpress.parse_gctx as parse_gctx
import time
from threading import Thread

res = []
threads = []

def read(idx):
    print(f'Start reading {idx}')
    t = time.time()
    res.append(parse_gctx.parse('GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctx', ridx=[idx]))
    t = (time.time() - t)
    print(f'Done reading {idx} in {t} seconds')

threads.append(Thread(target=read, args=(6000,)))
threads.append(Thread(target=read, args=(12000,)))
threads.append(Thread(target=read, args=(5000,)))
threads.append(Thread(target=read, args=(300,)))
threads.append(Thread(target=read, args=(40,)))
threads.append(Thread(target=read, args=(800,)))

all_t = time.time()

for t in threads:
    t.start()

for t in threads:
    t.join()

all_t = time.time() - all_t

print(f'The End in {all_t} seconds')

all_t = time.time()
res = []
for idx in [234, 4351, 6233, 9087, 987, 97]:
    read(idx)

all_t = time.time() - all_t

print(f'The End in {all_t} seconds')`

And here is the output:
Start reading 12000 Start reading 6000 Start reading 5000 Start reading 800 Start reading 40 Start reading 300 Done reading 12000 in 337.7198541164398 seconds Done reading 6000 in 337.7183690071106 seconds Done reading 800 in 338.19431233406067 seconds Done reading 300 in 338.36488699913025 seconds Done reading 5000 in 339.04932618141174 seconds Done reading 40 in 339.0456030368805 seconds The End in 339.0754089355469 seconds Start reading 234 Done reading 234 in 55.63448905944824 seconds Start reading 4351 Done reading 4351 in 55.87116312980652 seconds Start reading 6233 Done reading 6233 in 55.85987401008606 seconds Start reading 9087 Done reading 9087 in 55.898045778274536 seconds Start reading 987 Done reading 987 in 56.020151138305664 seconds Start reading 97 Done reading 97 in 56.393441915512085 seconds The End in 335.67835783958435 seconds

As you can it takes about 55 sec to read one record, when I read the records sequentially. When I try to create parallel threads it take the same time as when I read the files sequentially instead of about 55 seconds for all in the threads.

can not read .gctx file

Hi,
When I run this code:
from cmapPy.pandasGEXpress.parse import parse
data = parse ("GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx",col_meta_only=True)
I get an error. => "OSError: Unable to open file (truncated file: eof = 3244883968, sblock->base_addr = 0, stored_eof = 65110137212)"
How can I solve this problem?
Thank you,

pandasGEXpress.write_gctx id in gctx is null

In python 3 environment (not python 2), if I create a GCToo object where e.g. the index of the data_df is an integer (not a string or a float), and then call write_gctx it produces a gctx file where the all of the row ID entries (/0/META/ROW/id) are empty string ''.

The problem appears to be this line of code:

cmapPy/cmapPy/pandasGEXpress/write_gctx.py

Line 164 in 59d833b

    
           hdf5_out.create_dataset(metadata_node_name + "/id", data=[numpy.string_(x) for x in metadata_df.index],

numpy.string_(x) returns b'' for integer. Note it works fine if x is a str or a float.

For example in python 3 numpy.string_(3) returns b'' whereas in python 2 it returns '3'.

I've submitted an issue to numpy about this behavior (numpy/numpy#13427), might make sense to wait to hear back from them before taking any action here.

code to compute connectivity score

Dear sir,
Did you release the code to compute drug's connectivity score?

I appreciate your reply.
Regards

parse_gctx takes too much memory when we inquire specific columns and rows

When loading a very large gctx file ~24Gb on my laptop with 16 Gb using the function cmapPy.pandasGEXpress.parse.parse, I run out of memory with the following error:

Unable to allocate array with shape (473647,) and data type

If I use cidx to select a very low number of columns, then there is no more error.
However, when I request certain columns and certain rows, using both cidx and ridx, the same allocation error occurs. This indicates that the row filtering is applied, followed by the column filtering. This is a bad behaviour when dealing with very large cmap files, where it would be preferable that both filtering be applied simultaneously to avoid running out of RAM.

The problem comes from pandasGEXpress.parse_metadata_df, at the line curr_dset.read_direct(temp_array).
The function read_direct simply reads all the rows/columns without any means of filtering.

Exception while parsing gct files

Hi, I wanted to parse a sample gct file but there is a bug in the source code which causes an exception while parsing the file. The function read_version_and_dims() in parse_gct.py, there is a problem in its implementation. It opens the file in binary mode, so the version = f.readline().strip().lstrip("#") will make error. I reckon that if you open the file in non-binary mode, it will be solved.
The exception is :
TypeError: a bytes-like object is required, not 'str' when writing to a file

write_gctx TypeError in Python 3.6

I noticed there are other known Python 3 compatibility issues and I'm not sure if this is related, but just in case,

import cmapPy.pandasGEXpress as GEX                                                                                                                                                                                                            
import cmapPy.pandasGEXpress.write_gctx as write_gctx

lincs_path = "path to the 'level 5' .gctx from GEO GSE70138"                                                                                                                           
gctoo_1 = GEX.parse(lincs_path)                                                                                                                                                                                                        
write_gctx.write(gctoo_1, "test.gctx")

results in

Traceback (most recent call last):
  File "RGES/L1KGCT.py", line 173, in <module>
    write_gctx.write(gctoo_1, "test.gctx")
  File "/home/atwenzel/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/write_gctx.py", line 53, in write
    write_metadata(hdf5_out, "col", gctoo_object.col_metadata_df, convert_back_to_neg_666)
  File "/home/atwenzel/miniconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/write_gctx.py", line 124, in write_metadata
    hdf5_out.create_dataset(metadata_node_name + "/id", data=[str(x) for x in metadata_df.index])
  File "/home/atwenzel/miniconda3/lib/python3.6/site-packages/h5py/_hl/group.py", line 106, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/home/atwenzel/miniconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 100, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1530, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1552, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1613, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U26')

Error during parsing

I'm trying to read a gctx file downloaded from https://clue.io/data. This is my code:

from cmapPy.pandasGEXpress.parse_gct import parse
from cmapPy.pandasGEXpress.write_gct import write
parse("C:/Users/User/Desktop/gctx/testdatan1000978.gctx")

This is the error:
(base) C:\Users\User\Desktop\hello>D:/anaconda/python.exe c:/Users/User/Desktop/gctx/readgcxtx.py
The given path to the gct file cannot be found. gct_path: C:/Users/User/Desktop/gctx/testdatan1000978.gctx
Traceback (most recent call last):
File "c:/Users/User/Desktop/gctx/readgcxtx.py", line 3, in <module>
parse("C:/Users/User/Desktop/gctx/testdatan1000978.gctx")
File "D:\anaconda\lib\site-packages\cmapPy\pandasGEXpress\parse_gct.py", line 131, in parse
raise Exception(err_msg.format(file_path))
Exception: The given path to the gct file cannot be found. gct_path: C:/Users/User/Desktop/gctx/testdatan1000978.gctx

Error in parsing

when I tried to parse the cmap2020, there were errors as follow:

Traceback (most recent call last):
File "get_gct.py", line 17, in
get_gct()
File "get_gct.py", line 13, in get_gct
goo = cp_p.parse('level5_beta_trt_cp_n720216x12328.gctx',cid=brd_cid)
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse.py", line 68, in parse
make_multiindex=make_multiindex)
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 110, in parse
data_df = parse_data_df(data_dset, sorted_ridx, sorted_cidx, row_meta, col_meta)
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/cmapPy/pandasGEXpress/parse_gctx.py", line 338, in parse_data_df
first_subset = data_dset[cidx, :].astype(np.float32)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1490027549092/work/h5py/_objects.c:2846)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1490027549092/work/h5py/_objects.c:2804)
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 474, in getitem
selection = sel.select(self.shape, args, dsid=self.id)
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/h5py/_hl/selections.py", line 90, in select
sel[args]
File "/home/wuxiaolong/.conda/envs/my_cmapPy_env/lib/python2.7/site-packages/h5py/_hl/selections.py", line 392, in getitem
mshape = list(count)
UnboundLocalError: local variable 'count' referenced before assignment

Following tutorial yields error

Following the tutorial cmapPy_pandasGEXpress_tutorial.ipynb currently (2018-March-03) yields an error.

Since it uses an external data set GEO GSE70138 (rather than a test contained within cmapPy) it isn't clear, if this error reflects upon an update or problem within cmapPy, the tutorial, or GSE70138. (Besides not being able to follow a tutorial, this error hence makes it difficult for new users to become familiar with gctx files / cmapPy.)

works: upper part of tutorial

import pandas as pd
sig_info = pd.read_csv("GSE70138_Broad_LINCS_sig_info.txt", sep="\t") # updated file name

vorinostat_ids = sig_info["sig_id"][sig_info["pert_iname"] == "vorinostat"]

# Let us additionally report on the data
print("number of samples treated with vorinostat:", len(vorinostat_ids))
print('\n---- show first ones for debugging ----')
[print(x) for x in vorinostat_ids.values[:5]];

number of samples treated with vorinostat: 210

---- show first ones for debugging ----
LJP007_A375_24H:A03
LJP007_A549_24H:A03
LJP007_ASC.C_24H:A03
LJP007_ASC_24H:A03
LJP007_CD34_24H:A03

creates error: loading of records

from cmapPy.pandasGEXpress import parse
vorinostat_only_gctoo = parse(
    "GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx",   # updated file name
    cid=vorinostat_ids)

/Users/tstoeger/apps/anaconda/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LJP009_HT29_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_PC3_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_HA1E_24H:G07', 'LJP008_ASC_24H:G10', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_SKL_24H:G08', 'LPROT003_PC3_6H:O11', 'LJP008_A549_24H:G09', 'LJP008_HCC515_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_PC3_24H:G11', 'LJP008_HUVEC_24H:G11', 'LJP008_PC3_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_ASC_24H:G11', 'LJP008_A549_24H:G11', 'LJP008_HEPG2_24H:G11', 'LJP008_HT29_24H:G07', 'LPROT003_A549_6H:O12', 'LJP008_HUVEC_24H:G07', 'LJP008_HUVEC_24H:G10', 'LJP008_HME1_24H:G12', 'LJP007_A375_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_NPC.CAS9_24H:G09', 'LJP007_HEPG2_24H:A03', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LJP008_HT29_24H:G12', 'LPROT001_A375_6H:P11', 'LJP008_HUVEC_24H:G12', 'LJP008_PC3_24H:G07', 'LJP008_ASC.C_24H:G11', 'LJP008_NEU_24H:G11', 'LJP008_SKL_24H:G10', 'LPROT003_A375_6H:P08', 'LPROT002_MCF7_6H:P12', 'LJP008_NEU_24H:G08', 'LJP008_HCC515_24H:G09', 'LJP008_ASC_24H:G08', 'LJP008_HME1_24H:A03', 'LJP008_NEU_24H:G09', 'LPROT001_PC3_6H:P10', 'LJP008_HEPG2_24H:G08', 'LJP008_HCC515_24H:A03', 'LJP009_SKL.C_24H:A03', 'LPROT003_A549_6H:O08', 'LJP009_HCC515_24H:A03', 'LJP008_ASC.C_24H:G10', 'LJP008_SKL.C_24H:G08', 'LJP008_CD34_24H:G12', 'LJP007_MCF7_24H:A03', 'LJP008_NPC_24H:G08', 'LJP008_SKL.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_NPC_24H:G12', 'LJP008_A375_24H:G11', 'LJP009_CD34_24H:A03', 'LJP007_HME1_24H:A03', 'LJP009_MCF7_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_NEU_24H:G12', 'LJP007_HT29_24H:A03', 'LJP008_HUVEC_24H:G08', 'LJP008_HUVEC_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_HT29_24H:G10', 'LJP008_NPC.CAS9_24H:G11', 'LJP008_A375_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_SKL.C_24H:G10', 'LJP008_NEU_24H:A03', 'LJP009_NPC.CAS9_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_CD34_24H:G11', 'LJP008_NPC.CAS9_24H:G12', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G09', 'LJP008_HA1E_24H:G08', 'LJP008_SKL_24H:G07', 'LPROT001_MCF7_6H:O11', 'LJP008_A375_24H:A03', 'LJP008_CD34_24H:G07', 'LJP008_NPC.TAK_24H:G08', 'LPROT001_MCF7_6H:O07', 'LJP008_ASC_24H:A03', 'LJP008_PC3_24H:G10', 'LPROT001_A375_6H:P07', 'LPROT003_A375_6H:P10', 'LJP009_ASC.C_24H:A03', 'LPROT002_NPC.TAK_6H:O10', 'LJP009_SKL_24H:A03', 'LJP008_HT29_24H:G08', 'LJP008_PC3_24H:G09', 'LJP008_HCC515_24H:G08', 'LJP008_HME1_24H:G07', 'LJP008_SKL.C_24H:G07', 'LJP008_ASC.C_24H:G07', 'LJP008_ASC.C_24H:G09', 'LJP008_A375_24H:G12', 'LPROT003_NPC_6H:P09', 'LJP008_HT29_24H:G09', 'LPROT001_MCF7_6H:O09', 'LJP009_HA1E_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_CD34_24H:A03', 'LJP007_A549_24H:A03', 'LJP008_HA1E_24H:G11', 'LJP007_HUES3_24H:A03', 'LPROT002_A375_6H:P07', 'LJP008_CD34_24H:G08', 'LJP008_MCF7_24H:G11', 'LJP008_A549_24H:G08', 'LJP009_HEPG2_24H:A03', 'LPROT001_PC3_6H:P08', 'LPROT003_NPC_6H:P07', 'LJP008_HME1_24H:G10', 'LJP007_SKL_24H:A03', 'LJP008_HA1E_24H:G10', 'LJP008_PC3_24H:G12', 'LJP008_SKL_24H:G09', 'LPROT001_PC3_6H:P12', 'LJP008_ASC_24H:G07', 'LPROT002_A375_6H:P11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:G11', 'LJP009_HUVEC_24H:A03', 'LJP009_HME1_24H:A03', 'LJP008_HCC515_24H:G12', 'LJP007_MNEU.E_24H:A03', 'LJP008_SKL_24H:G12', 'LJP008_A375_24H:G10', 'LJP009_NPC_24H:A03', 'LJP008_CD34_24H:G09', 'LJP008_HME1_24H:G09', 'LJP008_NEU_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_A549_24H:A03', 'LJP008_HEPG2_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_NPC_24H:G07', 'LJP008_NPC.CAS9_24H:G08', 'LPROT002_MCF7_6H:P08', 'LJP008_NPC_24H:G09', 'LPROT001_A375_6H:P09', 'LJP008_ASC.C_24H:G08', 'LJP009_PC3_24H:A03', 'LJP008_HT29_24H:G11', 'LJP008_MCF7_24H:A03', 'LJP007_ASC_24H:A03', 'LJP008_NPC.CAS9_24H:G07', 'LPROT002_A549_6H:O07', 'LJP009_NPC.TAK_24H:A03', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HEPG2_24H:G12', 'LJP008_NPC.CAS9_24H:G10', 'LPROT002_NPC.TAK_6H:O12', 'LJP008_NPC.TAK_24H:G10', 'LJP008_SKL_24H:A03', 'LJP008_SKL.C_24H:G11', 'LPROT001_NPC.TAK_6H:O10', 'LJP008_HCC515_24H:G11', 'LJP008_SKL.C_24H:G12', 'LJP008_ASC.C_24H:G12', 'LJP008_NPC_24H:A03', 'LJP007_NPC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LPROT002_A549_6H:O11', 'LJP008_NPC.TAK_24H:A03', 'LJP008_HME1_24H:G11', 'LJP007_ASC.C_24H:A03', 'LJP008_MCF7_24H:G08', 'LJP007_HA1E_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_ASC.C_24H:A03', 'LJP008_SKL_24H:G11', 'LJP008_A549_24H:G12', 'LPROT003_PC3_6H:O09', 'LJP007_HUVEC_24H:A03', 'LJP008_NPC_24H:G11', 'LPROT003_A549_6H:O10', 'LJP008_NPC.TAK_24H:G09', 'LJP008_HUVEC_24H:G09', 'LPROT001_NPC.TAK_6H:O08', 'LJP007_NEU_24H:A03', 'LJP008_NPC_24H:G10', 'LJP008_HA1E_24H:G09', 'LJP008_HEPG2_24H:G07', 'LJP008_A375_24H:G07', 'LJP008_MCF7_24H:G12', 'LJP008_NPC.TAK_24H:G07', 'LJP008_HEPG2_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP007_JURKAT_24H:A03', 'LJP009_A549_24H:A03', 'LJP007_PC3_24H:A03', 'LPROT002_A375_6H:P09', 'LPROT002_NPC.TAK_6H:O08', 'LJP007_NPC.CAS9_24H:A03', 'LPROT002_MCF7_6H:P10', 'LJP008_HA1E_24H:G12', 'LJP009_NEU_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_HCC515_24H:A03', 'LJP008_ASC_24H:G12', 'LJP008_A549_24H:G10'}
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-f03c31e62771> in <module>()
      2 vorinostat_only_gctoo = parse(
      3     "GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328.gctx",   # updated file name
----> 4     cid=vorinostat_ids)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse.py in parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
     60     elif file_path.endswith(".gctx"):
     61         curr = parse_gctx.parse(file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only,
---> 62                                 make_multiindex)
     63     else:
     64         err_msg = "File to parse must be .gct or .gctx!"

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in parse(gctx_file_path, convert_neg_666, rid, cid, ridx, cidx, row_meta_only, col_meta_only, make_multiindex)
    101 
    102         # validate optional input ids & get indexes to subset by
--> 103         (sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta)
    104 
    105         data_dset = gctx_file[data_node]

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta_df, col_meta_df)
    140     ordered_ridx = get_ordered_idx(row_type, row_ids, row_meta_df)
    141 
--> 142     col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df)
    143     ordered_cidx = get_ordered_idx(col_type, col_ids, col_meta_df)
    144     return (ordered_ridx, ordered_cidx)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_and_convert_ids(id_type, id_list, meta_df)
    173         if id_type == "id":
    174             id_list = convert_ids_to_meta_type(id_list, meta_df)
--> 175             check_id_validity(id_list, meta_df)
    176         else:
    177             check_idx_validity(id_list, meta_df)

~/apps/anaconda/anaconda3/lib/python3.6/site-packages/cmapPy/pandasGEXpress/parse_gctx.py in check_id_validity(id_list, meta_df)
    189             mismatch_ids)
    190         logger.error(msg)
--> 191         raise Exception("parse_gctx check_id_validity " + msg)
    192 
    193 

Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in the metadata for the file being parsed - mismatch_ids:  {'LJP009_HT29_24H:A03', 'LJP007_SKL.C_24H:A03', 'LJP008_PC3_24H:G08', 'LJP008_HCC515_24H:G07', 'LJP008_HA1E_24H:G07', 'LJP008_ASC_24H:G10', 'LJP008_NPC.CAS9_24H:A03', 'LJP008_SKL_24H:G08', 'LPROT003_PC3_6H:O11', 'LJP008_A549_24H:G09', 'LJP008_HCC515_24H:G10', 'LJP008_MCF7_24H:G07', 'LJP008_PC3_24H:G11', 'LJP008_HUVEC_24H:G11', 'LJP008_PC3_24H:A03', 'LJP009_A375_24H:A03', 'LJP008_ASC_24H:G11', 'LJP008_A549_24H:G11', 'LJP008_HEPG2_24H:G11', 'LJP008_HT29_24H:G07', 'LPROT003_A549_6H:O12', 'LJP008_HUVEC_24H:G07', 'LJP008_HUVEC_24H:G10', 'LJP008_HME1_24H:G12', 'LJP007_A375_24H:A03', 'LJP008_SKL.C_24H:G09', 'LJP008_NPC.CAS9_24H:G09', 'LJP007_HEPG2_24H:A03', 'LJP007_CD34_24H:A03', 'LPROT003_NPC_6H:P11', 'LJP008_HT29_24H:G12', 'LPROT001_A375_6H:P11', 'LJP008_HUVEC_24H:G12', 'LJP008_PC3_24H:G07', 'LJP008_ASC.C_24H:G11', 'LJP008_NEU_24H:G11', 'LJP008_SKL_24H:G10', 'LPROT003_A375_6H:P08', 'LPROT002_MCF7_6H:P12', 'LJP008_NEU_24H:G08', 'LJP008_HCC515_24H:G09', 'LJP008_ASC_24H:G08', 'LJP008_HME1_24H:A03', 'LJP008_NEU_24H:G09', 'LPROT001_PC3_6H:P10', 'LJP008_HEPG2_24H:G08', 'LJP008_HCC515_24H:A03', 'LJP009_SKL.C_24H:A03', 'LPROT003_A549_6H:O08', 'LJP009_HCC515_24H:A03', 'LJP008_ASC.C_24H:G10', 'LJP008_SKL.C_24H:G08', 'LJP008_CD34_24H:G12', 'LJP007_MCF7_24H:A03', 'LJP008_NPC_24H:G08', 'LJP008_SKL.C_24H:A03', 'LJP008_HEPG2_24H:G09', 'LJP008_HT29_24H:A03', 'LJP008_HA1E_24H:A03', 'LJP008_NPC_24H:G12', 'LJP008_A375_24H:G11', 'LJP009_CD34_24H:A03', 'LJP007_HME1_24H:A03', 'LJP009_MCF7_24H:A03', 'LJP008_A549_24H:G07', 'LJP008_NEU_24H:G12', 'LJP007_HT29_24H:A03', 'LJP008_HUVEC_24H:G08', 'LJP008_HUVEC_24H:A03', 'LJP008_A375_24H:G08', 'LJP008_HT29_24H:G10', 'LJP008_NPC.CAS9_24H:G11', 'LJP008_A375_24H:G09', 'LJP008_NEU_24H:G07', 'LJP008_SKL.C_24H:G10', 'LJP008_NEU_24H:A03', 'LJP009_NPC.CAS9_24H:A03', 'LPROT002_A549_6H:O09', 'LJP008_CD34_24H:G11', 'LJP008_NPC.CAS9_24H:G12', 'LJP009_ASC_24H:A03', 'LJP008_ASC_24H:G09', 'LJP008_HA1E_24H:G08', 'LJP008_SKL_24H:G07', 'LPROT001_MCF7_6H:O11', 'LJP008_A375_24H:A03', 'LJP008_CD34_24H:G07', 'LJP008_NPC.TAK_24H:G08', 'LPROT001_MCF7_6H:O07', 'LJP008_ASC_24H:A03', 'LJP008_PC3_24H:G10', 'LPROT001_A375_6H:P07', 'LPROT003_A375_6H:P10', 'LJP009_ASC.C_24H:A03', 'LPROT002_NPC.TAK_6H:O10', 'LJP009_SKL_24H:A03', 'LJP008_HT29_24H:G08', 'LJP008_PC3_24H:G09', 'LJP008_HCC515_24H:G08', 'LJP008_HME1_24H:G07', 'LJP008_SKL.C_24H:G07', 'LJP008_ASC.C_24H:G07', 'LJP008_ASC.C_24H:G09', 'LJP008_A375_24H:G12', 'LPROT003_NPC_6H:P09', 'LJP008_HT29_24H:G09', 'LPROT001_MCF7_6H:O09', 'LJP009_HA1E_24H:A03', 'LPROT003_PC3_6H:O07', 'LJP008_CD34_24H:A03', 'LJP007_A549_24H:A03', 'LJP008_HA1E_24H:G11', 'LJP007_HUES3_24H:A03', 'LPROT002_A375_6H:P07', 'LJP008_CD34_24H:G08', 'LJP008_MCF7_24H:G11', 'LJP008_A549_24H:G08', 'LJP009_HEPG2_24H:A03', 'LPROT001_PC3_6H:P08', 'LPROT003_NPC_6H:P07', 'LJP008_HME1_24H:G10', 'LJP007_SKL_24H:A03', 'LJP008_HA1E_24H:G10', 'LJP008_PC3_24H:G12', 'LJP008_SKL_24H:G09', 'LPROT001_PC3_6H:P12', 'LJP008_ASC_24H:G07', 'LPROT002_A375_6H:P11', 'LPROT003_A375_6H:P12', 'LJP008_NPC.TAK_24H:G11', 'LJP009_HUVEC_24H:A03', 'LJP009_HME1_24H:A03', 'LJP008_HCC515_24H:G12', 'LJP007_MNEU.E_24H:A03', 'LJP008_SKL_24H:G12', 'LJP008_A375_24H:G10', 'LJP009_NPC_24H:A03', 'LJP008_CD34_24H:G09', 'LJP008_HME1_24H:G09', 'LJP008_NEU_24H:G10', 'LJP008_MCF7_24H:G10', 'LJP008_A549_24H:A03', 'LJP008_HEPG2_24H:A03', 'LJP008_HME1_24H:G08', 'LJP008_NPC_24H:G07', 'LJP008_NPC.CAS9_24H:G08', 'LPROT002_MCF7_6H:P08', 'LJP008_NPC_24H:G09', 'LPROT001_A375_6H:P09', 'LJP008_ASC.C_24H:G08', 'LJP009_PC3_24H:A03', 'LJP008_HT29_24H:G11', 'LJP008_MCF7_24H:A03', 'LJP007_ASC_24H:A03', 'LJP008_NPC.CAS9_24H:G07', 'LPROT002_A549_6H:O07', 'LJP009_NPC.TAK_24H:A03', 'LJP007_NPC.TAK_24H:A03', 'LJP008_HEPG2_24H:G12', 'LJP008_NPC.CAS9_24H:G10', 'LPROT002_NPC.TAK_6H:O12', 'LJP008_NPC.TAK_24H:G10', 'LJP008_SKL_24H:A03', 'LJP008_SKL.C_24H:G11', 'LPROT001_NPC.TAK_6H:O10', 'LJP008_HCC515_24H:G11', 'LJP008_SKL.C_24H:G12', 'LJP008_ASC.C_24H:G12', 'LJP008_NPC_24H:A03', 'LJP007_NPC_24H:A03', 'LJP008_NPC.TAK_24H:G12', 'LPROT002_A549_6H:O11', 'LJP008_NPC.TAK_24H:A03', 'LJP008_HME1_24H:G11', 'LJP007_ASC.C_24H:A03', 'LJP008_MCF7_24H:G08', 'LJP007_HA1E_24H:A03', 'LJP008_MCF7_24H:G09', 'LJP008_ASC.C_24H:A03', 'LJP008_SKL_24H:G11', 'LJP008_A549_24H:G12', 'LPROT003_PC3_6H:O09', 'LJP007_HUVEC_24H:A03', 'LJP008_NPC_24H:G11', 'LPROT003_A549_6H:O10', 'LJP008_NPC.TAK_24H:G09', 'LJP008_HUVEC_24H:G09', 'LPROT001_NPC.TAK_6H:O08', 'LJP007_NEU_24H:A03', 'LJP008_NPC_24H:G10', 'LJP008_HA1E_24H:G09', 'LJP008_HEPG2_24H:G07', 'LJP008_A375_24H:G07', 'LJP008_MCF7_24H:G12', 'LJP008_NPC.TAK_24H:G07', 'LJP008_HEPG2_24H:G10', 'LPROT001_NPC.TAK_6H:O12', 'LJP007_JURKAT_24H:A03', 'LJP009_A549_24H:A03', 'LJP007_PC3_24H:A03', 'LPROT002_A375_6H:P09', 'LPROT002_NPC.TAK_6H:O08', 'LJP007_NPC.CAS9_24H:A03', 'LPROT002_MCF7_6H:P10', 'LJP008_HA1E_24H:G12', 'LJP009_NEU_24H:A03', 'LJP008_CD34_24H:G10', 'LJP007_HCC515_24H:A03', 'LJP008_ASC_24H:G12', 'LJP008_A549_24H:G10'}

pandasGEXpress tests break for python 3

Seems like mostly str/dtype-related issues.

'utf-8' codec Error when parsing gctx

Here is my code:
"
from cmapPy.pandasGEXpress.parse_gct import parse
f='./GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx'
x=parse(f)
"
It shows:
"
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
"

Version:
cmapPy 4.0
python 3.8.2

Any suggestions? Thanks.

error in 'test_genes_queries.py'

I try to run 'test_genes_queries.py' but I got the below error:

NoSectionError Traceback (most recent call last)
/home/fadl2/cmapPy/cmapPy/clue_api_client/tests/test_genes_queries.py in ()
40 setup_logger.setup(verbose=True)
41
---> 42 cao = test_clue_api_client.build_clue_api_client_from_default_test_config()
43
44 unittest.main()

/home/fadl2/cmapPy/cmapPy/clue_api_client/tests/test_clue_api_client.py in build_clue_api_client_from_default_test_config()
125 #cao = clue_api_client.ClueApiClient(base_url=cfg.get(config_section, "clue_api_url"),
126 # user_key=cfg.get(config_section, "clue_api_user_key"))
--> 127 cao = clue_api_client.ClueApiClient(base_url=cfg.get(config_section, " https://api.clue.io/api/genes"),
128 user_key=cfg.get(config_section, "87474c084256d13f140eaa3227ab48b2"))
129 return cao

/home/fadl2/anaconda3/envs/my_cmapPy_env2/lib/python2.7/ConfigParser.pyc in get(self, section, option)
328 if section not in self._sections:
329 if section != DEFAULTSECT:
--> 330 raise NoSectionError(section)
331 if opt in self._defaults:
332 return self._defaults[opt]

NoSectionError: No section: 'test'

in file "clue_api_client.py" I use the two lines:

 self.base_url = https://api.clue.io/api/genes
 self.headers = {"user_key":my_key_ID}

Cannot write back GCT file that has been parsed

I am trying to do just a simple test of reading in an already existing GCT file and then writing it back but this fails. The same GCT file works in cmapR. Thank you for your help!

I'm using Python 3.8.9.

Here's the code:

from cmapPy.pandasGEXpress.parse import parse as parse_gct
from cmapPy.pandasGEXpress.write_gct import write as write_gct

plate34_dda = parse_gct("LINCS_P100_DIA_Plate34_annotated_minimized_2018-05-02_19-56-02.processed.gct")
write_gct(plate34_dda, "test_lvl4")  # this fails

Here's the error:

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
Input In [16], in <cell line: 1>()
----> 1 write_gct(plate34_dda, "test_lvl4")

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/cmapPy/pandasGEXpress/write_gct.py:43, in write(gctoo, out_fname, data_null, metadata_null, filler_null, data_float_format)
     40 write_version_and_dims(VERSION, dims, f)
     42 # Write top half of the gct
---> 43 write_top_half(f, gctoo.row_metadata_df, gctoo.col_metadata_df,
     44                metadata_null, filler_null)
     46 # Write bottom half of the gct
     47 write_bottom_half(f, gctoo.row_metadata_df, gctoo.data_df,
     48                   data_null, data_float_format, metadata_null)

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/cmapPy/pandasGEXpress/write_gct.py:98, in write_top_half(f, row_metadata_df, col_metadata_df, metadata_null, filler_null)
     95 col_metadata_indices = (range(1, top_half_df.shape[0]),
     96                         range(1 + row_metadata_df.shape[1], top_half_df.shape[1]))
     97 # pd.DataFrame.at to insert into dataframe(python3)
---> 98 top_half_df.at[col_metadata_indices[0], col_metadata_indices[1]] = (
     99     col_metadata_df.astype(str).replace("nan", value=metadata_null).T.values)
    101 # Write top_half_df to file
    102 top_half_df.to_csv(f, header=False, index=False, sep="\t")

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/indexing.py:2273, in _AtIndexer.__setitem__(self, key, value)
   2270     self.obj.loc[key] = value
   2271     return
-> 2273 return super().__setitem__(key, value)

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/indexing.py:2228, in _ScalarAccessIndexer.__setitem__(self, key, value)
   2225 if len(key) != self.ndim:
   2226     raise ValueError("Not enough indexers for scalar access (setting)!")
-> 2228 self.obj._set_value(*key, value=value, takeable=self._takeable)

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/frame.py:3870, in DataFrame._set_value(self, index, col, value, takeable)
   3867     series._set_value(index, value, takeable=True)
   3868     return
-> 3870 series = self._get_item_cache(col)
   3871 loc = self.index.get_loc(index)
   3872 dtype = series.dtype

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/frame.py:3939, in DataFrame._get_item_cache(self, item)
   3934 res = cache.get(item)
   3935 if res is None:
   3936     # All places that call _get_item_cache have unique columns,
   3937     #  pending resolution of GH#33047
-> 3939     loc = self.columns.get_loc(item)
   3940     res = self._ixs(loc, axis=1)
   3942     cache[item] = res

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/indexes/range.py:388, in RangeIndex.get_loc(self, key, method, tolerance)
    386         except ValueError as err:
    387             raise KeyError(key) from err
--> 388     self._check_indexing_error(key)
    389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

File ~/.local/share/virtualenvs/alphamap-6YdFO0zX/lib/python3.8/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key)
   5633 def _check_indexing_error(self, key):
   5634     if not is_scalar(key):
   5635         # if key is not a scalar, directly raise an error (the code below
   5636         # would convert to numpy arrays and raise later any way) - GH29926
-> 5637         raise InvalidIndexError(key)

InvalidIndexError: range(7, 103)

pandasGEXpress all rid, cid default as strings in HDF5

try these for starters - I think one has them as ints, the other as strings:
/cmap/projects/GEO_deposition/2017-01/combined_matrices/GSE70138_Broad_LINCS_Level2_GEX_n345976x978.gctx
/cmap/projects/M1/DATASETS/GEO/matrices/GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx
--DL

License in setup.py: Need to replace s/MIT/BSD/

The license in setup.py (and thus the one displayed at PyPI) is given as MIT (or a non-existent 3-clause MIT):
https://github.com/cmap/cmapPy/blob/v3.1.0/setup.py#L28
https://github.com/cmap/cmapPy/blob/v3.1.0/setup.py#L43
I suppose you want those to be BSD.

Solving environment: failed

I am trying to install cmappy following the instructions in this repo with the current anconda:
conda create --name my_cmapPy_env -c bioconda python=2.7.11 numpy=1.11.2 pandas=0.20.3 h5py=2.7.0 requests==2.13.0 cmappy
, however:

PackagesNotFoundError: The following packages are not available from current channels:

  - numpy=1.11.2
  - requests==2.13.0
  - python=2.7.11

Please advise how to proceed with the installation!

investigate issues with travis running tests directory

CMap tutorial IPython notebook doesn't open

I'm trying to open the notebook in my browser but it doesn't open!

cmapPy/tutorials/cmapPy_pandasGEXpress_tutorial.ipynb

Can you fix this, please?

Thanks,
Denise

parse_gctx: don't sort returned values

Hi @oena @levlitichev

I was thinking about doing a pull request where I modified parse_gctx to not return the dataframes sorted by index/column. The reason I propose this is if you read them out and get them in the order that they appear in the file, you can then choose the ones you are interested in, figure out their index id, and then use the ridx/cidx option to load them, which is much faster.

Also, could make it an option to do the sort. What do you think?

Add slice_gct as command line tool

Similar to concat_gctoo, it would be nice to be able to call this from the command line.

cmapPy breaks with pandas == 0.20

Pandas 0.20 changed dtypes slightly; causes some unittests to break. Needs fixing to maintain compatibility with newer versions of requirements packages.

Details on dtype here: https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#uint64-support-improved

write_gct capable of writing invalid GCToo objects

parse contains following assertion

assert full_df.shape == (num_col_metadata + num_data_rows + 1,
num_row_metadata + num_data_cols + 1)

while writer does not apply any such assertion

Add annotating a gct/x as a command-line tool

need test for main w/in slice_gct

@oena @levlitichev

See example for how set args and test main in test_concat_gctoo.test_main

Wired performance in parsing gctx file

In jupyter notebook cmapPy_pandasGEXpress_tutorial.ipynb, I try the following codes:

from cmapPy.pandasGEXpress import parse
gctx_fn="GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx"
time my_col_metadata = parse(gctx_fn, meta_only=True)
time gctx = parse(gctx_fn)

But the result in my computer is so strange. It looks like the execution time of parse(gctx_fn, meta_only=True) is too much more slower than parse(gctx_fn). Here is the result on my side.

time time my_col_metadata = parse(gctx_fn, meta_only=True)
CPU times: user 4.06 s, sys: 462 ms, total: 4.52 s
Wall time: 4.52 s

time gctx = parse(gctx_fn)
CPU times: user 67.2 ms, sys: 166 ms, total: 234 ms
Wall time: 887 ms

get row id in cmappy in gctx Files

I want to get row ids in cmapPy like l1Ktools this below code in L1ktools
GTEx = map(lambda x: x.split('.')[0], GTEx_gctobj.get_rids())
How do it in cmappy?

No module named cmap.io.gct

I'm trying to read in a .gctx file in Python and write it as a .gct file. If I follow the instructions in the GitHub readme, I run the following codes:
#!/usr/bin/env python
import sys
import numpy as np
import cmap.io.gct as gct
def main():
infile = sys.argv[1]
outfile = sys.argv[2]

gctobj = gct.GCT(infile)
gctobj.read()

data = gctobj.matrix[:, :].astype('float64')

np.save(outfile, data)

if name == 'main':
main()

Then, I get the following error:
No module named cmap.io.gct

how I create a conda environment for cmapPy and python 3

Not sure if this useful for others / including in docs, but here's how I create a conda environment for cmapPy in python 3. I mention this b/c I have done it other ways and ended up not having the hdf5 command line tools in the environment (which I find useful).

conda create -n cmapPy3 python=3 scikit-learn scipy numpy seaborn matplotlib statsmodels pandas jupyter sympy h5py
conda activate cmapPy3
pip install cmappy
try it out!
0. should have hdf5 command line tools

(NB: not installing h5py via conda, and instead having pip install it as a dependency of cmappy caused me to not have the command line tools for whatever reason). Obvs this also includes some other analysis libraries / tools I find useful.

need basic test coverage for pandasGEXpress.gctx2gct

@oena @levlitichev

See example for how set args and test main in test_concat_gctoo.test_main

mismatch between metadata and gctx

I am trying to parse the:

1-GSE70138_Broad_LINCS_Level3_INF_mlr12k_n345976x12328_2017-03-06.gctx.gz
2-GSE70138_Broad_LINCS_Level3_INF_mlr12k_n78980x22268_2015-06-30.gct.gz
3-GSE70138_Broad_LINCS_Level4_ZSPCINF_mlr12k_n113012x22268_2015-12-31.gct.gz

files with:

1-GSE70138_Broad_LINCS_sig_info_2017-03-06.txt.gz
or
2-GSE70138_Broad_LINCS_inst_info_2017-03-06.txt.gz

metadata files. I am trying to make a subset of files to make the process possible and easy to handle.

import pandas as pd
sig_info = pd.read_csv("GSE70138_Broad_LINCS_sig_info_2017-03-06.txt", sep="\t")
mcf7_cell = sig_info["pert_id"][sig_info["cell_id"] == "MCF7"][sig_id["pert_idose"]=="10.0 um"][sig_info["pert_itime"]=="24 h"]
from cmapPy.pandasGEXpress.parse import parse
MCF7_details = parse("GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx", cid=mcf7_cell)

Each time I do this with the:
GSE70138_Broad_LINCS_Level3_INF_mlr12k_n345976x12328_2017-03-06.gctx.gz
I see an error:

some of the ids being used to subset the data are not present in the metadata for the file being parsed - mimatch_ids: {'neratinib'}
Traceback (most recent call last):
File "", line 1, in
File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse.py, line 65, in parse
out = parse_gctx.parse(file_path, convert_neg_666=convert_neg_666,
File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 107, in parse
(sorted_ridx, sorted_cidx) = check_and_order_id_inputs(rid, ridx, cid, cidx, row_meta, col_meta)
File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 146, in check_and_order_id_inputs
col_ids = check_and_convert_ids(col_type, col_ids, col_meta_df)
File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 179, in check_and_convert_ids
check_id_validity(id_list, meta_df)
File "/home/sysmedicine/anaconda3/envs/my_conda/lib/python3.8/site-packages/cmapPy/pandasGEXpress/parse_gcx.py", line 195, in check_id_validity
raise Exception("parse_gctx check_id_validity " + msg)
Exception: parse_gctx check_id_validity some of the ids being used to subset the data are not present in themetadata for the file being parsed - mismatch_ids: {'neratinib'}

How can I fix this problem???

Parsing GCT files fails in python3

I don't know if the code has been made python3 compatible or not. I am providing the example for regenerating this issue in python3. Link to input file gct file

from cmapPy.pandasGEXpress.parse import parse
parse("gct_file_path")

line 168, in read_version_and_dims
version = f.readline().strip().lstrip("#")
TypeError: a bytes-like object is required, not 'str'

The problem is due to the fact that my GCT file uses bytes string encoding therefore while reading I will have to lstrip(b'#') while apply it on a bytes object. I can fix this perhaps and make the parsing compatible to both byte('bytes' type in python3) and unicode('str' type in python3) strings.

Problems parsing gctx due to meta_only

Hi there.
I'm being unable to parse a .gctx file.
I think that's because of line 51 in parse.py which calls for parse_gctx.parse with an argument meta_only which is not defined.
Thanks

Convert .gctx file to .gct

this is my code for that purpose:

import sys
from cmapPy.pandasGEXpress import gctx2gct


def main():
    gctx2gct.gctx2gct_main(sys.argv)


if __name__ == '__main__':
    main()

and when I want to run it from consol I get below error:

Traceback (most recent call last):
  File "gct2npy.py", line 25, in <module>
    main()
  File "gct2npy.py", line 8, in main
    gctx2gct.gctx2gct_main(sys.argv)
  File "C:\Users\Farshid\AppData\Local\Programs\Python\Python35\lib\site-package
s\cmapPy\pandasGEXpress\gctx2gct.py", line 51, in gctx2gct_main
    in_gctoo = parse_gctx.parse(args.filename, convert_neg_666=False)
AttributeError: 'list' object has no attribute 'filename'

indeed I want to write equal code for below code(is in deprecate version of cmap) in cmappy:

import sys
import numpy as np
import cmap.io.gct as gct


def main():
    infile = sys.argv[1]
    outfile = sys.argv[2]
    
    gctobj = gct.GCT(infile)
    gctobj.read()
    
    data = gctobj.matrix[:, :].astype('float64')
    
    np.save(outfile, data)
    
        
if __name__ == '__main__':
    main()

Subsetting picks wrong dimension to subset over first

There is some logic in parse_data_df (pasted below) that attempts to pick the best dimension to subset over first, but it isn't quite right.

def parse_data_df(data_dset, ridx, cidx, row_meta, col_meta):
  if len(ridx) == len(row_meta.index) and len(cidx) == len(col_meta.index):  # no subset
        data_array = np.empty(data_dset.shape, dtype=np.float32)
        data_dset.read_direct(data_array)
        data_array = data_array.transpose()
  elif len(ridx) <= len(cidx):
        first_subset = data_dset[:, ridx].astype(np.float32)
        data_array = first_subset[cidx, :].transpose()
  elif len(cidx) < len(ridx):
        first_subset = data_dset[cidx, :].astype(np.float32)
        data_array = first_subset[:, ridx].transpose()

For example, imagine you're parsing a .gctx with 720216 cols and 12328 rows, and you want to pull out 20000 columns.

The subset logic is going to try to subset on rows first because 12328 < 20000. But we're not even subsetting on rows here, that is all of them. This results in the entire array getting temporarily allocated into memory.

The better heuristic is to minimize the size of the intermediate array - you want to pick the minimum of: len(ridx) * len(col_meta.index) vs len(cidx) * len(row_meta.index)

Cmap tutorial IPython notebook doesn't open in the browser

I'm trying to open the notebook in my browser but it doesn't open!

cmapPy/tutorials/cmapPy_pandasGEXpress_tutorial.ipynb

Can you fix this, please?

Thanks,
Denise

write_gct adds in the index column

I'm using write_gct to output a GCT file. GCT files normally have the following columns: name, description, sample1, sample2...

But when I use write_gct, it has the following columns: id, name, description, sample1, sample2...

I've tried dropping the index column from my dataframe, but that doesn't fix the problem. My current code is as follows:

       gctoo = GCToo(expression_df)
       write_gct.write(gctoo, output_filename)

If I was writing the file myself, directly from the Pandas data frame, I would simply use:

expression_df.to_csv(output_filename, sep='\t', index=False)

Unfortunately, there doesn't seem to be an option in write_gct to accomplish this. Any guidance would be appreciated.

parse behaves the same regardless of whether gct or gctx is used

Specifically when trying to parse just row or column metadata, if passing a gct file, these options are ignored. Update parse_gct to behave the same as parse_gctx, so that parse behaves the same regardless of file type.

Python 3.11 version

Hello,
Is it possible to generate the python 3.11 version of the package?

EDIT: duplicate of bioconda/bioconda-recipes#37805

cmap / cmappy Goto Github PK

cmappy's People

Contributors

Stargazers

Watchers

Forkers

cmappy's Issues

Fix: The line below writes the 1st 2 lines of a .GCT file and would otherwise default to OS line_terminator of \r\n which conflicts with all other lines terminated by \n

Inconsistent line endings probably tricks Excels auto line ending recognition

Write top_half_df to file

I try to run 'test_genes_queries.py' but I got the below error:

How can I fix this problem???

Recommend Projects

Recommend Topics

Recommend Org