Code Monkey home page Code Monkey logo

hadoop's People

Contributors

artem-garmash avatar b11z avatar calmofthestorm avatar haohui avatar jkahn avatar kai5263499 avatar martinmev avatar matteobertozzi avatar pcmdx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadoop's Issues

Does the SequenceFile.Reader support LzoCodec ?

I have a sequence file with LzoCodec, that I am unable to read through the module .

from hadoop.io import SequenceFile
fh='/home/ekta/my_file'
reader = SequenceFile.Reader(fh)

first few lines in the file I am trying to read

SEQ org.apache.hadoop.io.Text com.bloomreach.proto.PwfPixelLog #com.hadoop.compression.lzo.LzoCodecF��7�u_�v �W Y�d����F��7�u_�v �W Y�du u

'
'd`

+8 � ` $

It seems to me that it is searching for a decompressor , but unable to find one. If this is supported, What am I doing wrong ?
Also, I installed hadoop-lzo from here, https://github.com/twitter/hadoop-lzo - though I see that the
com.hadoop.compression.lzo

Traceback (most recent call last):
File "/home/ekta/CUSTOM_WORK/protobuf.py", line 3, in
reader = SequenceFile.Reader(fh)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 288, in init
self._initialize(path, start, length)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 478, in _initialize
self._codec = CodecPool().getDecompressor(codec_class)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/compress/CodecPool.py", line 34, in getDecompressor
codec_class = ReflectionUtils.hadoopClassFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 24, in hadoopClassFromName
return classFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 44, in classFromName
module = import(module_name, globals(), locals(), [str(class_name)], -1)
ImportError: No module named com.hadoop.compression.lzo

in the hadoop-lzo package, I do see "com.hadoop.compression.lzo" - is it that the program is unable to find this class in hadoop-lzo . In the dist packages , I have Hadoop-0.1.4-py2.7.egg _lzo.so*, lzo.py, python_lzo-1.0.egg-info

I believe that com.hadoop.compression.lzo.LzoCodec.java might be needed to read my file as above ?

:~/Downloads/hadoop-lzo$ tree
[..more ]

| | |-- com
| | | | |-- hadoop
| | | | | |-- compression
| | | | | | `-- lzo
| | | | | | |-- CChecksum.java
| | | | | | |-- DChecksum.java
| | | | | | |-- DistributedLzoIndexer.java
| | | | | | |-- GPLNativeCodeLoader.java
| | | | | | |-- LzoCodec.java
| | | | | | |-- LzoCompressor.java
| | | | | | |-- LzoDecompressor.java
| | | | | | |-- LzoIndex.java
| | | | | | |-- LzoIndexer.java
| | | | | | |-- LzoInputFormatCommon.java
| | | | | | |-- LzopCodec.java
| | | | | | |-- LzopDecompressor.java
| | | | | | |-- LzopInputStream.java
| | | | | | |-- LzopOutputStream.java

Does it support sequence file from pyspark?

hi,

Recently I use pyspark to write image to sequence file.

I Use scikit-image and numpy to convert/restore image data to bytearray, but failed to restore the image from the sequence file.

Here is how I write the image to sequence file in spark

from PIL import Image
from io import BytesIO
def write():
    bg = io.imread(image_file_name)
    # check if it can restore to images
    np.fromstring(bg.tobytes(), dtype = np.uint8).reshape((bg.shape[0], bg.shape[1], bg.shape[2]))
    return [('image:%s-%d-%d-%d' %(filename[0], bg.shape[0], bg.shape[1], bg.shape[2]), bg.tobytes())]

but it failed to restore from the sequence file

    reader = SequenceFile.Reader('image.seq')

    key_class = reader.getKeyClass()
    value_class = reader.getValueClass()
    print type(value_class)

    key = key_class()
    value = value_class()
    print type(value)

    #reader.sync(4042)
    position = reader.getPosition()
    while reader.next(key, value):
        #  print '*' if reader.syncSeen() else ' ',
        #  print '[%6s] %6s %6s' % (position, key.toString(), value.toString())
        key_str = key.toString()
        if key_str.startswith(IMAGE_KEY):
            filename, width, height, channel  = key_str[len(IMAGE_KEY):].split('-')
            # failed to convert to image
             np.fromstring(value.getBytes(), dtype=np.uint8).reshape(width, height, channel)
        position = reader.getPosition()

    reader.close()

Here is the sequence file

https://drive.google.com/file/d/0B18-oWPEXrIWMVpkME9RUFdCOEE/view?usp=sharing

thanks for the help.

reading custom Serialization for sequencefiles

Serialized HDFS files can be tricky to read, because sometimes they are

  • Compressed
  • Encoded in a non Writable Sequence file format (thrift, avro,...)

I wonder if I can use this API to read thrift Sequence files in python ?

Clearly, the sequencefile.reader class : https://github.com/matteobertozzi/Hadoop/tree/master/python-hadoop/hadoop

Appears to use the base classes that are here could allow for implementation of a more advanced sequencefile reader, that handled reading custom serialization+hadoop formats.

I would potentially be able to work with you on implementation of this for thrift... feel free to contact me directly !

Add value to a Text Writable

Hi, first, thanks for the library it seems very useful.

Now the question, I haven't been able to find how to set the value of a Text Writable are these methods missing.

Thanks,

Trouble installing Python module

Hi,

I'm having trouble installing the contents of the python-hadoop subfolder as a Python module. Here is what I tried:

  • Clone the repository
  • Add the python-hadoop folder to my PYTHONPATH
  • In a script executed under Python 3.5.2, import hadoop.io

I get the following error:

<my_working_dir>/Hadoop/python-hadoop/hadoop/__init__.py in <module>()
     18
     19 import io
---> 20 import util
     21

ImportError: No module named 'util'

I have tried various other import statements like

  • import hadoop
  • from hadoop import io
    but none of those work, either. This seems to be another issue with Python submodules and relative imports in the __init__.py files. But I don't quite know how to fix this. Any ideas?

Append to an existing SequenceFile

Hi Matteo,
I am wondering if there is a workaround for appending to a previously closed, already existing SequenceFile. The current implementation of the writer does not seem to support it. Is there plan for adding append in the future?

SequenceFile reader fails when file ends with sync

The check for EOF appears here.
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L346

If that check reports that we are not at EOF, then it attempts to read any sync.

Then it proceeds to read records without checking whether or not reading that sync placed it at the EOF.
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L361

Example of why you might have a sync right before EOF:
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L154

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.