This DXRAM connector lets you run Apache Hadoop or HBASE jobs directly on data in DXRAM instead of HDFS.
It is still in pre Alpha state! and is still working in a /tmp/myfs/
folder instead of dxram only!
German: Final Report or PDF
since 6. Sep. 2018: exists
, mkdir
, list
, isdir
, size
, delete
and rename
works in DXRAM,
but mkdir
, delete
and rename
does not with more than ref_ids_each_fsnode
entries.
UTF8/16 chars or a path with more than max_pathlength_chars
are a problem, too.
Stop Codeing Part of the Project.
- last:
- implementing a non-hadoop dxnet Application to test my rpc-like API
DxramFile.create()
in hadoopDxramOutputStream
in hadoop- getting FsNode, Blockinfos, Block from a file
- read the last
BlockChunk
into buffer - modify FsNode, Blockinfos, Block from a file local
- doing a
flush()
of local data, to transfer to DxramFsApp and (may) enlarge refIds in FsNode - not yet tested, because creating initial FsNode fails
- fail on:
- 1 create FsNode for a new file
- 2 delete, rename
Not supporting remove operation if chunk locks are disable
- 3 transfer complete FsNode "ROOT" initial to a new (2nd) peer
- next:
DxramFile.open()
andDxramInputStream
in hadoop, to get bytes from file - other TODOs:
DxramFile.getFileBlockLocations()
- using EXT-type in FsNode to store more things in folders and long files
- using chunk locks and a kind of atomar procedures in the filesystem
- try hadoop unit-tests on dxramfs
- far away: testing mapreduce and HBASE examples (multi node)
Get the final report (PDF, German) from here.
... to develop a hadoop-like FS and test it
- Spark&Hadoop Google Cloud Storage Connector
- Filesystem Compatibility with Apache Hadoop
- hadoop FS guide
unzip HadoopDxramFS.zip
cd HadoopDxramFS/connector
mvn clean
mvn package
cp -f target/hadoop-dxram-fs-*.jar /hadoop/common/lib/hadoopDxramFs.jar
cp -f lib/*.jar /hadoop/common/lib/
To configure the connector you have to modify core-site.xml
of your hoadoop.
unzip HadoopDxramFS.zip
cd HadoopDxramFS/dxram_part
cp DxramFsApp.conf /dxram/dxapp/
cd dxapp
./build.sh
cp build/libs/dxapp-dxramfs-1.0.jar /dxram/dxapp/
To reduce confusion, here are some simple keywords to communicate over different project parts:
- the hadoop part
- it is a client
- it requests Filedata
- it is a connector to a DXRAM Peer
- it uses DXNet to connect to a DXRAM Peer/Application
- the DXRAM part
- it is a server
- it serves Filedata
- it handles DXNet Messages with DXRAM
- it is a DXRAM Application running on a Peer
Hadoop splits processes and lets calculate them with blockdata on nodes .
DXRAM splits Memory requests and get/set their data as chunks on peers.
Start my Envorinment (and take a look into this bash file):
. ./my-env.sh
use hadoop fs CLI to access dxram://namenode:9000
from core-site.xml
- ROOT is a FsNodeChunk
- FsNodeChunk builds a tree with ID (dxram chunk id) and a referenceId (parent FsNodeChunk ID)
- FsNodeChunk stores data about a file or a folder
- FsNodeChunk has an array of blockinfoIds (if it is full, extID refer to a FsNodeChunk with a new blockinfoIds array)
- blockinfoIds are dxram chunk ids to BlockinfoChunks
- BlockinfoChunk stores informations about a BlockChunk
- every BlockinfoChunk refer with storageId (a dxram chunk id) to a BlockChunk
- BlockChunk stores the bytes of a file
- extract dxnet or dxram hostname/ip/port from hadoop fs-scheme!
- switch to dxnet gradle (since 4. Sep 2018) in connector and dxram_part
- Check,if ASCII-only filenames, no append() and timestamp with 0 in Filesystem is a problem for mapreduce or HBASE
ok with dxram + /tmp/myfs folder:
bin/hadoop fs -mkdir /user
bin/hadoop fs -mkdir /user/tux
bin/hadoop fs -ls /user
bin/hadoop fs -mv /user/tux /user/other
bin/hadoop fs -rm -f /user/other
bin/hadoop fs -ls /
because Not supporting remove operation if chunk locks are disabled
a mv
creates two folders with same new name!
not working:
- utf8 chars
- storing/deleting/renameing files
- move/store a file or folder into / or move or rename something in /
Note 2018-12-03: hbase-site.xml
and the hbase libs needs a jar file and similar fs.dxram.impl
configs, too!
File hadoop-2.8.2-src/hadoop-dist/target/hadoop-2.8.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.dxram.impl</name>
<value>de.hhu.bsinfo.dxramfs.connector.DxramFileSystem</value>
<description>The FileSystem for dxram.</description>
</property>
<property>
<name>fs.AbstractFileSystem.dxram.impl</name>
<value>de.hhu.bsinfo.dxramfs.connector.DxramFs</value>
<description>
The AbstractFileSystem for dxram
</description>
</property>
<property>
<name>fs.defaultFS</name>
<!-- value>file:///tmp/tee/</value -->
<!-- value>hdfs://abook.localhost.fake:9000</value -->
<value>dxram://localhost:9000</value>
</property>
<property>
<name>dxram.file_blocksize</name>
<!-- blocksize is smaller than chunksize (dxram: jan 2018 max was 8MB) -->
<value>4194304</value>
</property>
<property>
<name>dxram.ref_ids_each_fsnode</name>
<value>128</value>
</property>
<property>
<name>dxram.max_pathlength_chars</name>
<value>512</value>
</property>
<property>
<name>dxram.max_filenamelength_chars</name>
<value>128</value>
</property>
<property>
<name>dxram.max_hostlength_chars</name>
<value>80</value>
</property>
<property>
<name>dxram.max_addrlength_chars</name>
<value>48</value>
</property>
<property>
<name>dxnet.me</name>
<value>0</value>
</property>
<property>
<name>dxnet.to_dxram_peers</name>
<!-- me is talking to localhost:65221 or localhost:65222, and them are talking to localhost:22222 or 22223. -->
<!-- the dxnet-dxram peer mapping localhost:65221 at localhost:22222 is good, to identify the location of a block. -->
<value>[email protected]:65220@,[email protected]:[email protected]:22222,[email protected]:[email protected]:22223,[email protected]:65223@</value>
</property>
</configuration>
If you use org.slf4j
and ...
LOG.info(Thread.currentThread().getStackTrace()[1].getMethodName()+"({})", p);
... in your code, you have to take a look to
vim etc/hadoop/hadoop-env.sh
and set export log4j_logger_org_apache_hadoop=INFO
to see your logs
while using bin/hadoop fs -<command> ...
!
open issue: using LOG.debug()
and export ...=DEBUG
did not work.
cd /EXAMPLE/hadoop-2.8.2-src/hadoop-dist/target/hadoop-2.8.2/
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dxram[a-z.]+'
2018-02-23: this examples works on dxramfs (via /tmp/myfs/ folder, not dxram):
- I got the right result
- JobRunner (is a part of yarn, but you do not have to start-yarn.sh) runs on local fs. hdfs on single node do it local, too.
To get a big (300MB) text file, there is a MR example RandomTextWriter. You need a
config file etc/hadoop/mapred-site.xml
to configure the 300222000 Byte output
insteat of 1099511627776 Bytes (default):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.randomtextwriter.minwordskey</name>
<value>5</value>
</property>
<property>
<name>mapreduce.randomtextwriter.maxwordskey</name>
<value>10</value>
</property>
<property>
<name>mapreduce.randomtextwriter.minwordsvalue</name>
<value>20</value>
</property>
<property>
<name>mapreduce.randomtextwriter.maxwordsvalue</name>
<value>100</value>
</property>
<property>
<name>mapreduce.randomtextwriter.totalbytes</name>
<!-- value>1099511627776</value -->
<value>300222000</value>
</property>
</configuration>
If you are the user tux and have a /user/tux/
home dir in hdfs or dxramfs you can
run this:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar randomtextwriter outrand
The result is written to /user/tux/outrand/part-m-00000
with "300MB":
bin/hadoop fs -ls /user/tux/outrand/
Found 2 items
-rw-rw-rw- 0 0 1970-01-01 01:00 /user/tux/outrand/_SUCCESS
-rw-rw-rw- 0 307822548 1970-01-01 01:00 /user/tux/outrand/part-m-00000
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar wordcount outrand/part-m-00000 wcout
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
Modify the etc/hadoop/mapred-site.xml
file:
...
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
</property>
...
to set java options e.g. a bigger heapsize. Remove /user/tux/wcout
before redo ;-D
Maybe 300MB file is realy to big! Try 30 MB as input.
You need
- kerberos (does
kinit
work?) - hadoop "binaries" (maybe part of hbase)
- zookeeper (part of hbase)
You need an old protobuf version
git clone https://github.com/google/protobuf.git
cd protobuf
git checkout tags/v2.5.0
unsure: ./autogen.sh
./configure --prefix=/usr
make
sudo make install
sudo ldconfig
reboot ?!
Get Hadoop:
gunzip hadoop-*
tar -xvf hadoop-*
cd hadoop-2.8.2-src/
mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests
or use for offline:
mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests -o
you got an error and you fix a single line e.g. in the hadoop-hdfs project, restart maven on that place, where the error comes (and is fixed):
mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests -o -rf :hadoop-hdfs
Compile HDFS only:
Backup your etc/hadoop/*.xml
and etc/hadoop/hadoop-env.sh
files !!! It may change.
- edit
src/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
- cd to
src/hadoop-hdfs-project/hadoop-hdfs-client/
- do
mvn clean
- cd to
src/
- do
mvn package -Pdist -Pdoc -Psrc -Dtar -DskipTests -o
- hope for a libprotoc 2.5.0 (sometimes a system upgrade makes it 3.x)
- copy new jar files to the right place
I have a bash script for the last point:
cp ${HADOOP_HOME}/../../../hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.8.2.jar \
${HADOOP_HOME}/share/hadoop/common/lib/
cp ${HADOOP_HOME}/../../../hadoop-hdfs-project/hadoop-hdfs-native-client/target/hadoop-hdfs-native-client-2.8.2.jar \
${HADOOP_HOME}/share/hadoop/hdfs/
cp ${HADOOP_HOME}/../../../hadoop-hdfs-project/hadoop-hdfs-client/target/hadoop-hdfs-client-2.8.2.jar \
${HADOOP_HOME}/share/hadoop/hdfs/lib/
And finaly check the etc/hadoop/*.xml
and hadoop-env.sh
file !!! It may change.
For the pre-last point: goto protobuf
folder (you got it with git clone) and redo a make install
. If
you got a new gcc version, make clean
and ./configure
before make install
is a good choice!
export JAVA_HOME=/usr
export HADOOP_CONF_DIR="/EXAMPLE/hadoop-2.8.2-src/hadoop-dist/target/hadoop-2.8.2/etc/hadoop/"
export HADOOP_HOME="/EXAMPLE/hadoop-2.8.2-src/hadoop-dist/target/hadoop-2.8.2/"
export HBASE_CONF_DIR="/etc/hbase/"
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin/:$HADOOP_HOME/sbin/:$PATH
start:
hdfs namenode -format
start-dfs.sh
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/tux
start-hbase.sh
kinit
klist
note: hdfs dfs -mkdir /user
is equal to bin/hadoop fs -mkdir /user
if
hdfs is your defaultFS in core-site.xml
stop:
stop-hbase.sh
stop-dfs.sh