matteobertozzi / hadoop Goto Github PK

View Code? Open in Web Editor NEW

242.0 242.0 149.0 138 KB

Hadoop (Utilities, Patches and Examples)

Home Page: http://th30z.blogspot.com

C 3.63% Java 16.94% Python 72.22% Shell 7.22%

hadoop's People

Contributors

Stargazers

Watchers

Forkers

bwhite bbloniarz-trulia dataartisan pbharrin domodanke hyysun fone4u yuzeh trivio rosarion paxan wilbeibi nellaivijay lealem cdcttr okomestudio zcwfeng motpro tianhuil jkahn sanealytics kimyoungdeok martinmev paberline climberbrad viveksck jzmq wowgeeker yusixiao chinna1986 janardhanv jetbanana leelakrishna snork-alt bpig pfhayes lamwolf2010 kaynewest arpit12 cluo bachsio steventhegood navula1 pombreda singuri burritothief warrenbloom 3rwww1 zj56664669 rohitrrcat strategicc bityon yang2814 isinghgithub nshravan crazyang j4dk rajeshmr newfarking yonglehou songfj spsmhaitjema josejamilena pamuba cherrypeng usc-isi-i2 zsmj513 artem-garmash caohy1988 santhoshpoudapally maitreya1975 haohui cscheffler kaiyik manishmuttreja kai5263499 dennyx bigfreecoder roshan4u rahuljain2104 sshyran echo-ji kittuhadoop wuxiaolei499390725 jdbrown239 theeeeefrechman nanfengpo theeeefrechman superchaoran zhmocean alonazrael bh-lushuai albertoesteva88 saggu leeqiang250 chenguoliang1990 jamesdelorme manismetu skyqin luoq

hadoop's Issues

Does the SequenceFile.Reader support LzoCodec ?

I have a sequence file with LzoCodec, that I am unable to read through the module .

from hadoop.io import SequenceFile
fh='/home/ekta/my_file'
reader = SequenceFile.Reader(fh)

first few lines in the file I am trying to read

SEQ org.apache.hadoop.io.Text com.bloomreach.proto.PwfPixelLog #com.hadoop.compression.lzo.LzoCodecF��7�u_�v �W Y�d��F��7�u_�v �W Y�du u

'
'd`

+8 � ` $

It seems to me that it is searching for a decompressor , but unable to find one. If this is supported, What am I doing wrong ?
Also, I installed hadoop-lzo from here, https://github.com/twitter/hadoop-lzo - though I see that the
com.hadoop.compression.lzo

Traceback (most recent call last):
File "/home/ekta/CUSTOM_WORK/protobuf.py", line 3, in
reader = SequenceFile.Reader(fh)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 288, in init
self._initialize(path, start, length)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 478, in _initialize
self._codec = CodecPool().getDecompressor(codec_class)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/compress/CodecPool.py", line 34, in getDecompressor
codec_class = ReflectionUtils.hadoopClassFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 24, in hadoopClassFromName
return classFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 44, in classFromName
module = import(module_name, globals(), locals(), [str(class_name)], -1)
ImportError: No module named com.hadoop.compression.lzo

in the hadoop-lzo package, I do see "com.hadoop.compression.lzo" - is it that the program is unable to find this class in hadoop-lzo . In the dist packages , I have Hadoop-0.1.4-py2.7.egg _lzo.so*, lzo.py, python_lzo-1.0.egg-info

I believe that com.hadoop.compression.lzo.LzoCodec.java might be needed to read my file as above ?

:~/Downloads/hadoop-lzo$ tree
[..more ]

Does it support sequence file from pyspark?

hi,

Recently I use pyspark to write image to sequence file.

I Use scikit-image and numpy to convert/restore image data to bytearray, but failed to restore the image from the sequence file.

Here is how I write the image to sequence file in spark

from PIL import Image
from io import BytesIO
def write():
    bg = io.imread(image_file_name)
    # check if it can restore to images
    np.fromstring(bg.tobytes(), dtype = np.uint8).reshape((bg.shape[0], bg.shape[1], bg.shape[2]))
    return [('image:%s-%d-%d-%d' %(filename[0], bg.shape[0], bg.shape[1], bg.shape[2]), bg.tobytes())]

but it failed to restore from the sequence file

    reader = SequenceFile.Reader('image.seq')

    key_class = reader.getKeyClass()
    value_class = reader.getValueClass()
    print type(value_class)

    key = key_class()
    value = value_class()
    print type(value)

    #reader.sync(4042)
    position = reader.getPosition()
    while reader.next(key, value):
        #  print '*' if reader.syncSeen() else ' ',
        #  print '[%6s] %6s %6s' % (position, key.toString(), value.toString())
        key_str = key.toString()
        if key_str.startswith(IMAGE_KEY):
            filename, width, height, channel  = key_str[len(IMAGE_KEY):].split('-')
            # failed to convert to image
             np.fromstring(value.getBytes(), dtype=np.uint8).reshape(width, height, channel)
        position = reader.getPosition()

    reader.close()

Here is the sequence file

https://drive.google.com/file/d/0B18-oWPEXrIWMVpkME9RUFdCOEE/view?usp=sharing

thanks for the help.

reading custom Serialization for sequencefiles

Serialized HDFS files can be tricky to read, because sometimes they are

Compressed
Encoded in a non Writable Sequence file format (thrift, avro,...)

I wonder if I can use this API to read thrift Sequence files in python ?

Clearly, the sequencefile.reader class : https://github.com/matteobertozzi/Hadoop/tree/master/python-hadoop/hadoop

Appears to use the base classes that are here could allow for implementation of a more advanced sequencefile reader, that handled reading custom serialization+hadoop formats.

I would potentially be able to work with you on implementation of this for thrift... feel free to contact me directly !

Add value to a Text Writable

Hi, first, thanks for the library it seems very useful.

Now the question, I haven't been able to find how to set the value of a Text Writable are these methods missing.

Thanks,

Trouble installing Python module

Hi,

I'm having trouble installing the contents of the python-hadoop subfolder as a Python module. Here is what I tried:

Clone the repository
Add the python-hadoop folder to my PYTHONPATH
In a script executed under Python 3.5.2, import hadoop.io

I get the following error:

<my_working_dir>/Hadoop/python-hadoop/hadoop/__init__.py in <module>()
     18
     19 import io
---> 20 import util
     21

ImportError: No module named 'util'

I have tried various other import statements like

import hadoop
from hadoop import io
but none of those work, either. This seems to be another issue with Python submodules and relative imports in the __init__.py files. But I don't quite know how to fix this. Any ideas?

Then it proceeds to read records without checking whether or not reading that sync placed it at the EOF.
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L361

Example of why you might have a sync right before EOF:
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L154

matteobertozzi / hadoop Goto Github PK

hadoop's People

Contributors

Stargazers

Watchers

Forkers

hadoop's Issues

Does the SequenceFile.Reader support LzoCodec ?

first few lines in the file I am trying to read

Does it support sequence file from pyspark?

reading custom Serialization for sequencefiles

Add value to a Text Writable

Trouble installing Python module

Append to an existing SequenceFile

Hadoop

SequenceFile reader fails when file ends with sync

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent