ripe-ncc / hadoop-pcap Goto Github PK

View Code? Open in Web Editor NEW

206.0 40.0 100.0 387 KB

Hadoop library to read packet capture (PCAP) files

License: GNU Lesser General Public License v3.0

Java 100.00%

hadoop-pcap's Introduction

Hadoop PCAP library

License

This library is distributed under the LGPL.
See: https://raw.github.com/RIPE-NCC/hadoop-pcap/master/LICENSE

Repository

UPDATE: Since Bintray has been discontinued, the latest releases of hadoop-pcap are not available there, and you have to build them from source.

<repositories>
  <repository>
    <id>hadoop-pcap</id>
    <url>http://dl.bintray.com/hadoop-pcap/hadoop-pcap</url>
  </repository>
</repositories>

Screencast

We have created a screencast showing the use of the Hadoop PCAP SerDe in Hive using Amazon Elastic MapReduce.
You can find the video on YouTube: http://www.youtube.com/watch?v=FLxeQciax-Q

Components

This project consists of two components:

Library

Bundles the code used to read PCAPs. Can be used within MapReduce jobs to natively read PCAP files.
See: https://github.com/RIPE-NCC/hadoop-pcap/tree/master/hadoop-pcap-lib

SerDe

Implements a Hive Serializer/Deserializer (SerDe) to query PCAPs using SQL like commands.
See: https://github.com/RIPE-NCC/hadoop-pcap/tree/master/hadoop-pcap-serde

hadoop-pcap's People

Contributors

Stargazers

Watchers

Forkers

hyperboreans gconklin butteryoon jbeisser vpereira jasmor ussjoin ivelin lorthos gauravsri zined dwmclary todun jmbowles stevenleroux necoma spclops imclab justajoe aeppert kangzhw maxmetagravity malviyaaayushi bikash dpasquazzo rmikio ctilley cnbird1999 pftuser onsdk dmitry-orekhov data-sci raghuveerm ivkrishna26 tpiscitell songyuejs mageru vsingh58 xptest18 mikeat fharenheit nzrs akibsayyed tempbottle vicenteg syed-shah-asad hgvanpariya rakurai jeffkzhao codingtony weslambert nickolayvasilishin jimvin clemsonbds rallnav thisarattr iitsoft cosminu packetstreamsecurity barianet dcferreira allen3lichao ndanl mkanchwala swoldetsadick sky4star rowan091994 rahulpedduri jiacheo qingchen1984 ddoloroi nathanmadams dawsongzhao awesome-nfv maheshibm mischamarty rysavy-ondrej joseph743 dyerdave yafengli kranthia2b plebbob fengyinyang bobokingbao rajivraj indigos33k3r flysnail100 blue-infosec finch0001 hzj415909583 cheesterx3 bjbhaha shmvoreasio iq-scm mobius-software-ltd sphill007 jhon24921010

hadoop-pcap's Issues

headerKeys in HttpPcapReader not initialize properly

Line 122 to 129
private void propagateHeaders(HttpPacket packet, Header[] headers) {
LinkedList headerKeys = new LinkedList();
for (Header header : headers) {
String headerKey = HEADER_PREFIX + header.getName().toLowerCase();
packet.put(headerKey, header.getValue());
}
packet.put(HttpPacket.HTTP_HEADERS, Joiner.on(',').join(headerKeys));
}

headerKeys is initialized, but never be populated with any value.

parsed pcaket with NULL source and destination

Hi all,

Using the latest jar, I tried to get the pcaket properties like src, dst, ts, etc. for 160 pcap files. I tried with PcapReader. However, I found lot of packets have src, dst, protocol, src_port, dst_port, etc. with NULL value.

I am not sure whether it is a bug or it can be happen for PCAP?

Is there any documentation for this 'hadoop-pcap'? or any diagragram for process flow for pcaps?

Thanks,
Mustafa

Releasing 1.2 in Maven repository?

Hello all,

Great work here, I use this code in my research. I was wondering if we could get version 1.2 with the newer Hadoop 2 structuring pushed to the Maven repository? I actually use 1.2 in my code, but I would like to move the dependency management to Maven rather than having to attach the 1.2 jar.

Thanks!

Fragment ordering

At the moment fragments received out-of-order can cause re-assembly of incomplete payloads. The re-assembly should strictly check the linking.

Not able to extract packet payload data

There is no function to get hexdump of a packet payload data. There is a function in PcapReader.java in net.ripe.hadoop.pcap package called readPayload() but it takes a packetdata, payloadDataStart and payloadLength parameters.

Refactor HttpPcapReader

The current implementation of HttpPcapReader from pull request #4 only supports HTTP headers in a defined order. Needs to be refactored to use a proper implementation such as Apache HttpClient.

Cannot use hadoop-pcap-lib with Hadoop 1.2.1

Hi,

I am trying to use hadoop-pcap-lib with Hadoop 1.2.1. I am writing a driver, but driver expects the argument to setInputFormatClass a class extending mapreducer.FileInputFormat instead of mapred.FileInputFormat. Do you have any latest version of PcapInputFormat.java?

Thanks
Rahul

Re-assembly for out-of-order packets

Current implementations for UDP and TCP re-assembly will only work if the final packet is the last packet of the transmission. If packets are received out-of-order and the final packet triggers the re-assembly it will result in a failed re-assembly of the datastream.

NegativeArraySizeException when reading PCAP

2011-10-31 17:35:27,369 WARN org.apache.hadoop.mapred.TaskTracker (main): Error running child
java.lang.NegativeArraySizeException
    at net.ripe.hadoop.pcap.PcapReader.nextPacket(PcapReader.java:65)
    at net.ripe.hadoop.pcap.PcapReader.access$100(PcapReader.java:12)
    at net.ripe.hadoop.pcap.PcapReader$PacketIterator.fetchNext(PcapReader.java:182)
    at net.ripe.hadoop.pcap.PcapReader$PacketIterator.hasNext(PcapReader.java:187)
    at net.ripe.hadoop.pcap.io.reader.PcapRecordReader.next(PcapRecordReader.java:47)
    at net.ripe.hadoop.pcap.io.reader.PcapRecordReader.next(PcapRecordReader.java:17)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.next(CombineHiveRecordReader.java:87)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.next(CombineHiveRecordReader.java:36)
    at org.apache.hadoop.hive.shims.Hadoop20Shims$CombineFileRecordReader.next(Hadoop20Shims.java:155)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:194)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:178)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

Porting code to spark 2.0

Does this code work on spark 2.0? I tried and it was giving me an NPE. If you know anyone who had done this, then let me know, or I can try my hand at it.

No Module in PySpark

Is this Module compatible to PySpark. Every time I try to import it, it fails. It works ok on Scala.

Can not read the ts_usec field

I am having some issues to read the ts_usec field from the hive table. I am doing exactly the same steps as listed in the example: https://github.com/RIPE-NCC/hadoop-pcap/tree/master/hadoop-pcap-serde.
And I tried the latest release "hadoop-pcap-serde-1.0-jar-with-dependencies.jar", with latest Hadoop 2.4.1 and Hive 0.13.1, but still no luck.

The error message is:
hive (default)> select * from pcaps limit 10;
OK
pcaps.ts pcaps.ts_usec pcaps.protocol pcaps.src pcaps.src_port pcaps.dst pcaps.dst_port pcaps.len pcaps.ttl pcaps.dns_queryid pcaps.dns_flags pcaps.dns_opcode pcaps.dns_rcode pcaps.dns_question pcaps.dns_answer pcaps.dns_authority pcaps.dns_additional
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

From the error message, it looks like that the convert from jave BigDecimal to HiveDecimal failed, but I am not sure how to fix it.

get URL in HTTP_HEADERS

Hi,
how can I get the url with your project ? when I try don't come the url.
do you have a example ?

IPv6 support

At this point we are only decoding IPv4 packets. With the advance of IPv6 it becomes more important to be able to decode those packets too.

new MapReduce API support would be nice

Hi,

I was glad to find this project, and had hoped to use it in some work I'm doing with spark. However once I got to spark streaming, I learned that the old MR API is not supported by spark streaming. So I hacked my way through porting the RecordReader and InputFormat to the new MR API, which makes it possible to use with Spark Streaming.

My fork is here: https://github.com/vicenteg/hadoop-pcap

The diff: master...vicenteg:master

If this looks reasonable to you, I'll go ahead and submit the pull request.

Thanks for this project!

Export maven artifact to public repository

Opening an issue as requested by Wolfgang:
#7

Noticed that the hadoop-pcap-lib jar is not deployed to a maven repository (no distributionManagement tag). Would it be possible to register it in a public repo so we can declare as dependency in the Mobicents Media Server pom.xml?

Thank you,

Ivelin

Reverse-endian PCAP files

I'm little confused by the below comment in the code

//To read reversed-endian PCAPs; the header is the only part that switches
private boolean reverseHeaderByteOrder = false;

Can you please explain why should we ignore the endianness of the packet?

This is not an issue, but I had no option other than raising an issue.

Thanks!

Re-compile to include HiveDecimal

Hi All,
I downloaded the Serde and installed on a Cloudera Disto of Hadoop version 5 and got the following error: [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal.

This led to attempts a recompiling but I'm lacking package org.apache.hadoop.hive.common.type and others.

Reference and back ground postings over at Google user group for Cloudera and Hive:
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/Qlc1hfo51cg

Regards

TCP stream reassembly

To be able to decode the payload of higher level protocols reliably it would be very useful to be able to reassemble a TCP stream so it can be consumed as a regular InputStream.

how to read incoming network traffic information through hadoop pcap lib and hadoop pacp serde libraries

i am a beginer to hadoop how to read incoming network traffic information from internet through hadoop pcap libraries. please provide me guidelines to work with it.

java.lang.RuntimeException: java.lang.RuntimeException: class net.ripe.hadoop.pcap.DnsPcapReader not net.ripe.hadoop.pcap.PcapReader

Hi,

I am using cdh5.4.0 and following steps. But got following run time error. Please help.

hive> add jar hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar;
Added [hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar] to class path
Added resources: [hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar]
hive> SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
hive> SET mapred.max.split.size=104857600;
hive> SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.DnsPcapReader;
hive> CREATE EXTERNAL TABLE pcaps (ts bigint,
> ts_usec decimal,
> protocol string,
> src string,
> src_port int,
> dst string,
> dst_port int,
> len int,
> ttl int,
> dns_queryid int,
> dns_flags string,
> dns_opcode string,
> dns_rcode string,
> dns_question string,
> dns_answer array,
> dns_authority array,
> dns_additional array)
> ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer'
> STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION 'hdfs://localhost:8020/pcaps/';
OK
Time taken: 0.355 seconds
hive> select * from pcaps limit 2;

OK
java.lang.RuntimeException: java.lang.RuntimeException: class net.ripe.hadoop.pcap.DnsPcapReader not net.ripe.hadoop.pcap.PcapReader
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2144)
at net.ripe.hadoop.pcap.io.PcapInputFormat.initPcapReader(PcapInputFormat.java:56)
at net.ripe.hadoop.pcap.io.PcapInputFormat.initPcapRecordReader(PcapInputFormat.java:50)
at net.ripe.hadoop.pcap.io.PcapInputFormat.getRecordReader(PcapInputFormat.java:38)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:667)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:323)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:445)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:227)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:756)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.RuntimeException: class net.ripe.hadoop.pcap.DnsPcapReader not net.ripe.hadoop.pcap.PcapReader
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2138)
... 21 more
Failed with exception java.io.IOException:java.lang.NullPointerException
Time taken: 0.162 seconds

Best Regards,
Mustafa

NullPointerException at "NewTrackingRecordReader.initialize(MapTask.java:548)"

Hi, I'm trying to use hadoop-pcap to work with some pcap files on hadoop, but I get the issue in the title.

Here follows my configuration.

I'm using cloudera 5.5 virtual machine and I compiled hadoop-pcap by cloning the repo and issuing the following command:
mvn -Dskip-tests clean package
This produced the file hadoop-pcap-lib-1.2-SNAPSHOT.jar.

I imported this file in Eclipse and I rewrote your example found here to respect new hadoop standards. Here is my code (I had to attach .txt files because github doesn't accept .java)
PcapCount.txt
PcapMapper.txt
PcapReducer.txt

With Eclipse I created a jar file (pcapcount.jar) and I tried to execute it in hadoop with the following commands:

export HADOOP_CLASSPATH=/home/cloudera/pcap-tests/hadoop-pcap-lib-1.2-SNAPSHOT.jar
hadoop jar /home/cloudera/pcap-tests/pcapcount.jar mypkg.PcapCount -libjars /home/cloudera/pcap-tests/hadoop-pcap-lib-1.2-SNAPSHOT.jar /user/cloudera/pcapcount/input /user/cloudera/pcapcount/output

The mapreduce job can't start and this is the error trace: ErrorTrace.txt.

I searched for the error in the web and I found this useful information:

This should be the class where the NullPointerException happens (line 548): org.apache.hadoop.mapred.MapTask
I think the problem is in your class PcapInputFormat.java. As stated in this thread, Hadoop: NullPointerException with Custom InputFormat, your custom InputFormat should set the backend variable.

It would be great if you can look into this or give me some advice. I am quite new to hadoop.
Thanks.

java.lang.ClassCastException: net.ripe.hadoop.pcap.io.PcapInputFormat cannot be cast to org.apache.hadoop.mapred.InputFormat

I was trying to use the class in Spark, but it fails every time, with the above error. It works great on hive, but fails everytime in spark.:

hc = HiveContext(sc)
hc.sql("ADD JAR /home/hadoop/jars/hadoop-pcap-serde-1.1-jar-with-dependencies.jar")
hc.sql("ADD JAR /home/hadoop/jars/p3-lib.jar") <= this is a customized version, implementing oracle dissector.
hc.sql("SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
hc.sql("SET mapred.max.split.size=104857600")
hc.sql("SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.OraclePCapReader")

secSQL = "CREATE EXTERNAL TABLE ora_pcap_sec (id bigint, ts_usec double, src string, src_port int, dst string, dst_port int, tns_auth_user string, tns_aut_app_user string, tns_auth_passwd string, tns_auth_terminal string, tns_auth_program_nm string, tns_auth_machine string)
ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer' WITH SERDEPROPERTIES("serialization.encoding"='SJIS')
STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs:///pcaps/'"

hc.sql(secSQL)

sSec = "FROM ora_pcap_sec select *"

pgSec = hc.sql(sSec)

pgSec.write.jdbc(postgre_url,"ora_pcap_sec", "append", myPgProps)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:923)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:923)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2305)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2305)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2305)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2304)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:446)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: net.ripe.hadoop.pcap.io.PcapInputFormat cannot be cast to org.apache.hadoop.mapred.InputFormat
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

Am I missing something?

Ignore malformed packets

Some packets can cause various Exceptions (NPE, ArrayIndexOutOfBounds). We should have functionality to allow the user to specify if they wish to simply ignore such packets and proceed with the remaining stream of data.

dns_flags as a map

Has anyone considered returning the dns_flags as a map, instead of a string? Thus you can run queries that check, for example, if it's a recursive query, like

SELECT * from pcap where dns_flags['rd'];

currently that can be expressed as

select * from pcap where array_contains(split(dns_flags, ' '), 'rd')

This idea can be extended as well to parse options in the OPT RR, to extract EDNS info or extended flags such as NSID

New fields addition

I did not see any help on how to extract fields from the binary files. For example i want to extract bytes_transferred for each packet in the ndttrace files but any change i do is not reflecting in the code. Kindly suggest.

"Not a PCAP file" error

I dump a file like this:
sudo tcpdump -U -s 0 -w dump1.pcap

When I try to use a DnsPcapReader on it, it throws the "Not a PCAP file (Couldn't find magic number)" error.

If I hexdump the file, I do see the magic number near the start:
00000000 0a 0d 0d 0a 94 00 00 00 4d 3c 2b 1a 01 00 00 00

I tried skipping those first 8 bytes, and looking inside PcapReaderUtil, it looks like it is actually accessing the correct bytes to do the reverse evaluation (finding a1, b2, c3, d4)--but it still fails evaluation. Any idea what I am doing wrong here?

The method close() of type PcapRecordReader must override a superclass method

I'm having this error :The method close() of type PcapRecordReader must override a superclass method
when using the library with Spark to get an RDD. I'm using hadoopFile. Actually, I'm doing the same the guy posting this issue: #42
But I still get that error, any help please?

How To Run Hadoop PCAP File In Eclipse

i am beginner to hadoop pcap library how to run them and how to add them to project