matteobertozzi / hadoop Goto Github PK
View Code? Open in Web Editor NEWHadoop (Utilities, Patches and Examples)
Home Page: http://th30z.blogspot.com
Hadoop (Utilities, Patches and Examples)
Home Page: http://th30z.blogspot.com
I have a sequence file with LzoCodec, that I am unable to read through the module .
from hadoop.io import SequenceFile
fh='/home/ekta/my_file'
reader = SequenceFile.Reader(fh)
SEQ org.apache.hadoop.io.Text com.bloomreach.proto.PwfPixelLog #com.hadoop.compression.lzo.LzoCodecF��7�u_�v �W Y�d����F��7�u_�v �W Y�du u
'
'd`
+8 � ` $
It seems to me that it is searching for a decompressor , but unable to find one. If this is supported, What am I doing wrong ?
Also, I installed hadoop-lzo from here, https://github.com/twitter/hadoop-lzo - though I see that the
com.hadoop.compression.lzo
Traceback (most recent call last):
File "/home/ekta/CUSTOM_WORK/protobuf.py", line 3, in
reader = SequenceFile.Reader(fh)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 288, in init
self._initialize(path, start, length)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/SequenceFile.py", line 478, in _initialize
self._codec = CodecPool().getDecompressor(codec_class)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/io/compress/CodecPool.py", line 34, in getDecompressor
codec_class = ReflectionUtils.hadoopClassFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 24, in hadoopClassFromName
return classFromName(class_path)
File "/home/ekta/Downloads/Hadoop/python-hadoop/hadoop/util/ReflectionUtils.py", line 44, in classFromName
module = import(module_name, globals(), locals(), [str(class_name)], -1)
ImportError: No module named com.hadoop.compression.lzo
in the hadoop-lzo package, I do see "com.hadoop.compression.lzo" - is it that the program is unable to find this class in hadoop-lzo . In the dist packages , I have Hadoop-0.1.4-py2.7.egg _lzo.so*, lzo.py, python_lzo-1.0.egg-info
I believe that com.hadoop.compression.lzo.LzoCodec.java might be needed to read my file as above ?
:~/Downloads/hadoop-lzo$ tree
[..more ]
| | |-- com
| | | | |-- hadoop
| | | | | |-- compression
| | | | | | `-- lzo
| | | | | | |-- CChecksum.java
| | | | | | |-- DChecksum.java
| | | | | | |-- DistributedLzoIndexer.java
| | | | | | |-- GPLNativeCodeLoader.java
| | | | | | |-- LzoCodec.java
| | | | | | |-- LzoCompressor.java
| | | | | | |-- LzoDecompressor.java
| | | | | | |-- LzoIndex.java
| | | | | | |-- LzoIndexer.java
| | | | | | |-- LzoInputFormatCommon.java
| | | | | | |-- LzopCodec.java
| | | | | | |-- LzopDecompressor.java
| | | | | | |-- LzopInputStream.java
| | | | | | |-- LzopOutputStream.java
hi,
Recently I use pyspark to write image to sequence file.
I Use scikit-image and numpy to convert/restore image data to bytearray, but failed to restore the image from the sequence file.
Here is how I write the image to sequence file in spark
from PIL import Image
from io import BytesIO
def write():
bg = io.imread(image_file_name)
# check if it can restore to images
np.fromstring(bg.tobytes(), dtype = np.uint8).reshape((bg.shape[0], bg.shape[1], bg.shape[2]))
return [('image:%s-%d-%d-%d' %(filename[0], bg.shape[0], bg.shape[1], bg.shape[2]), bg.tobytes())]
but it failed to restore from the sequence file
reader = SequenceFile.Reader('image.seq')
key_class = reader.getKeyClass()
value_class = reader.getValueClass()
print type(value_class)
key = key_class()
value = value_class()
print type(value)
#reader.sync(4042)
position = reader.getPosition()
while reader.next(key, value):
# print '*' if reader.syncSeen() else ' ',
# print '[%6s] %6s %6s' % (position, key.toString(), value.toString())
key_str = key.toString()
if key_str.startswith(IMAGE_KEY):
filename, width, height, channel = key_str[len(IMAGE_KEY):].split('-')
# failed to convert to image
np.fromstring(value.getBytes(), dtype=np.uint8).reshape(width, height, channel)
position = reader.getPosition()
reader.close()
Here is the sequence file
https://drive.google.com/file/d/0B18-oWPEXrIWMVpkME9RUFdCOEE/view?usp=sharing
thanks for the help.
Serialized HDFS files can be tricky to read, because sometimes they are
I wonder if I can use this API to read thrift Sequence files in python ?
Clearly, the sequencefile.reader class : https://github.com/matteobertozzi/Hadoop/tree/master/python-hadoop/hadoop
Appears to use the base classes that are here could allow for implementation of a more advanced sequencefile reader, that handled reading custom serialization+hadoop formats.
I would potentially be able to work with you on implementation of this for thrift... feel free to contact me directly !
Hi, first, thanks for the library it seems very useful.
Now the question, I haven't been able to find how to set the value of a Text Writable are these methods missing.
Thanks,
Hi,
I'm having trouble installing the contents of the python-hadoop subfolder as a Python module. Here is what I tried:
import hadoop.io
I get the following error:
<my_working_dir>/Hadoop/python-hadoop/hadoop/__init__.py in <module>()
18
19 import io
---> 20 import util
21
ImportError: No module named 'util'
I have tried various other import statements like
import hadoop
from hadoop import io
__init__.py
files. But I don't quite know how to fix this. Any ideas?Hi Matteo,
I am wondering if there is a workaround for appending to a previously closed, already existing SequenceFile. The current implementation of the writer does not seem to support it. Is there plan for adding append in the future?
The check for EOF appears here.
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L346
If that check reports that we are not at EOF, then it attempts to read any sync.
Then it proceeds to read records without checking whether or not reading that sync placed it at the EOF.
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L361
Example of why you might have a sync right before EOF:
https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/hadoop/io/SequenceFile.py#L154
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.