Code Monkey home page Code Monkey logo

kite's Introduction

Kite Build Status Gitter chat

Kite is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

The goals of Kite are:

  • Codify expert patterns and practices for building data-oriented systems and applications.
  • Let developers focus on business logic, not plumbing or infrastructure.
  • Provide smart defaults for platform choices.
  • Support piecemeal adoption via loosely-coupled modules.

Eric Sammer recorded a webinar in which he talks about the goals of the project, which was then called CDK (the Cloudera Development Kit).

This project is organized into modules. Modules may be independent or have dependencies on other modules within Kite. When possible, dependencies on external projects are minimized.

Modules

The following modules currently exist.

Kite Data

The data module provides logical abstractions on top of storage subsystems (e.g. HDFS) that let users think and operate in terms of records, datasets, and dataset repositories. If you're looking to read or write records directly to/from a storage system, the data module is for you.

Kite Maven Plugin

The Kite Maven Plugin provides Maven goals for packaging, deploying, and running distributed applications.

Kite Morphlines

The Morphlines module reduces the time and skills necessary to build and change Hadoop ETL stream processing applications that extract, transform and load data into Apache Solr, Enterprise Data Warehouses, HDFS, HBase or Analytic Online Dashboards.

Kite Tools

The tools module provides command-line tools and APIs for performing common tasks with the Kite.

Examples

Example code demonstrating how to use Kite can be found in the separate GitHub repository at https://github.com/kite-sdk/kite-examples

License

Kite is provided under the Apache Software License 2.0. See the file LICENSE.txt for more information.

Building

To build using the default CDH dependencies use

mvn install

For Hadoop 2:

mvn install -Dhadoop.profile=2

For Hadoop 1:

mvn install -Dhadoop.profile=1

By default Java 7 is used. If you want to use Java 6, then add -DjavaVersion=1.6, -DjavaTargetVersion=1.6 e.g.

mvn install -DjavaVersion=1.6 -DjavaTargetVersion=1.6

kite's People

Contributors

abayer avatar awarring avatar bbaugher avatar bbrownz avatar bmahe-tango avatar busbey avatar dlyle65535 avatar edwardskoviak avatar esammer avatar grchanan avatar jarcec avatar joey avatar lfrancke avatar markrmiller avatar mkwhitacre avatar mladkov avatar phunt avatar rbrush avatar rdblue avatar rvs avatar scheeser avatar smola avatar sodre avatar szvasas avatar tomwheeler avatar tomwhite avatar williamstw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kite's Issues

Kite CLI csv-import reads in only null values when --no-header flag not set

When importing a csv file without a header and without specifying the --no-header flag, kite gives no error message but imports only null values instead of the correct values.

$ ./kite-dataset csv-import path/to/sample.csv sample
> Added 3255 records to sample
$ ./kite-dataset show sample --num-records 5
> {"key0": null, "key1": null, "key2": null, "key3": null, "key4": null, "key5": null},
  {"key0": null, "key1": null, "key2": null, "key3": null, "key4": null, "key5": null},
  {"key0": null, "key1": null, "key2": null, "key3": null, "key4": null, "key5": null},
  {"key0": null, "key1": null, "key2": null, "key3": null, "key4": null, "key5": null},
  {"key0": null, "key1": null, "key2": null, "key3": null, "key4": null, "key5": null}

The avro specification was generated with kite and looks sth like this:

  ...
  fields: {
    [
        {
            "name": "key0",
            "type": ["null", "string"]
        }, ...
     ]
  }

I am using Kite version 1.1.0

Best,
Christoph

Running the kite-sdk commands in mapreduce mode

Hi,

I had a look at the kite dataset code and found that kite internally uses apache crunch to run map reduce pipeline.

In my case, I invoke the kite cli from oozie to import the json data. But I noticed that by default, the apache crunch program is running mapreduce in LocalRunner mode. If I want to run the program in distributed mapreduce mode, how do I achieve that?

Regards,
Malathi

Spark with Sqoop and Kite - Mismatch in Command?

Trying to dig into this one. When Sqoop is used without Kite (IE, no parquet) there are no issues. The moment the job runs to export to parquet, everything blows up. It seems like Kite may be the offender, but if you have somewhere else to point me I will gladly work upstream.

System:

  • Debian 9
  • Hadoop 2.9
  • Spark 2.3

Installed Dependencies (JARs):

  • sqoop-1.4.7-hadoop260
  • kite-data-mapreduce-1.1.0
  • kite-hadoop-compatibility-1.1.0.jar
  • kite-data-crunch-1.1.0
  • kite-data-core-1.1.0
  • avro-tools-1.8.2.jar
  • mysql-connector-java-5.1.42
  • parquet-tools-1.8.3

Error:

19/07/09 17:55:28 INFO mapreduce.Job: Job job_1562682312457_0020 failed with state FAILED due to: Job setup failed : java.lang.IllegalArgumentException: Parquet only supports generic and specific data models, type parameter must implement IndexedRecord
	at org.kitesdk.shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
	at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:96)
	at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:128)
	at org.kitesdk.data.spi.filesystem.FileSystemDataset$Builder.build(FileSystemDataset.java:687)
	at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:199)
	at org.kitesdk.data.Datasets.load(Datasets.java:108)
	at org.kitesdk.data.Datasets.load(Datasets.java:165)
	at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.load(DatasetKeyOutputFormat.java:542)
	at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.loadOrCreateJobDataset(DatasetKeyOutputFormat.java:569)
	at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.access$300(DatasetKeyOutputFormat.java:67)
	at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.setupJob(DatasetKeyOutputFormat.java:369)
	at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:255)
	at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:235)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


19/07/09 17:55:28 INFO mapreduce.Job: Counters: 2

Again, it only fails on the final conversion. I am not sure of the full details since the command is inside a parallel process. Any direction would be appreciated.

package.json file error when it updates from 0.97.0 to 0.98.0

I receive the error below when I try to update the latest version.

PS C:\Users> apm update kite
Package Updates Available (1)
โ””โ”€โ”€ kite 0.97.0 -> 0.98.0

Would you like to install these updates? (yes) yes

Installing [email protected] to C:\Users\user\.atom\packages failed

npm ERR! Windows_NT 10.0.17134
npm ERR! argv "C:\\Users\\user\\AppData\\Local\\atom\\app-1.28.1\\resources\\app\\apm\\bin\\node.exe" "C:\\Users\\user\\AppData\\Local\\atom\\app-1.28.1\\resources\\app\\apm\\node_modules\\npm\\bin\\npm-cli.js" "--globalconfig" "C:\\Users\\user\\.atom\\.apm\\.apmrc" "--userconfig" "C:\\Users\\user\\.atom\\.apmrc" "install" "C:\\Users\\user\\AppData\\Local\\Temp\\d-11869-56324-b01nm2.gtuyehr529\\package.tgz" "--runtime=electron" "--target=2.0.4" "--arch=x64" "--global-style" "--msvs_version=2015"
npm ERR! node v6.9.5
npm ERR! npm  v3.10.10
npm ERR! file C:\Users\user\.atom\.apm\md5\2.2.1\package\package.json
npm ERR! code EJSONPARSE

npm ERR! Failed to parse json
npm ERR! Unexpected token '\u0000' at 1:1
npm ERR!
npm ERR! ^
npm ERR! File: C:\Users\user\.atom\.apm\md5\2.2.1\package\package.json
npm ERR! Failed to parse package.json data.
npm ERR! package.json must be actual JSON, not just JavaScript.
npm ERR!
npm ERR! This is not a bug in npm.
npm ERR! Tell the package author to fix their package.json file. JSON.parse

npm ERR! Please include the following file with any support request:
npm ERR!     C:\Users\user\AppData\Local\Temp\apm-install-dir-11869-56324-1uic1wf.vlgt2fn7b9\npm-debug.log

it says, i need to report the author to fix this.

kite-dataset fails on Mac OS X due to case insensitive filesystem while unpacking the JAR

The kite-tools-1.1.0-binary.jar will fail in Mac OS X since the HFS+ filesystem is case-insensitive and the jar contains META-INF/LICENSE and META-INF/license. The HFS+ by default doesn't not allow two filenames that only differ in case, it's case preserving but case insensitive.

You can verify that the JAR indeed contains a license and LICENSE with the command jar tvf kite-tools-1.1.0-binary.jar |grep -i license

This filename clash / conflict renders it unusable since when Hadoop tries to unpack the JAR will throw and IOException: Mkdirs failed to create <tmpdir>.../hadoop-unjar/.../META-INF/license:

kite-dataset csv-schema movies.csv --record-name Movie                                                                                                                     
/Users/ecerulm/bin/kite-dataset debug: Using HADOOP_COMMON_HOME=/Users/ecerulm/.local/stow/hadoop-2.8.1/
/Users/ecerulm/bin/kite-dataset debug: Using HADOOP_MAPRED_HOME=/Users/ecerulm/.local/stow/hadoop-2.8.1//../hadoop-mapreduce
/Users/ecerulm/bin/kite-dataset debug: Using HBASE_HOME=/Users/ecerulm/.local/stow/hadoop-2.8.1//../hbase
/Users/ecerulm/bin/kite-dataset debug: Using HIVE_HOME=/Users/ecerulm/.local/stow/hadoop-2.8.1//../hive
/Users/ecerulm/bin/kite-dataset debug: Using HIVE_CONF_DIR=/Users/ecerulm/.local/stow/hadoop-2.8.1//../hive/conf
/Users/ecerulm/bin/kite-dataset debug: Using HADOOP_CLASSPATH=/Users/ecerulm/bin/kite-dataset::
Exception in thread "main" java.io.IOException: Mkdirs failed to create /var/folders/j5/8yjty44917v3_ydfjyy0gz0c0000gn/T/hadoop-unjar7609709732056315890/META-INF/license
	at org.apache.hadoop.util.RunJar.ensureDirectory(RunJar.java:140)
	at org.apache.hadoop.util.RunJar.unJar(RunJar.java:109)
	at org.apache.hadoop.util.RunJar.unJar(RunJar.java:85)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:222)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Is it possible to change the JAR build process to rename the META-INF/license dir to META-INF/licenses. Googling around I found the Maven [ApacheLicenseResourceTransformer])(https://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ApacheLicenseResourceTransformer) may solve the problem.

Alternatively, maybe move or rename the META-INF/LICENSE (Jackson JSON processor license).

Is this possible?, otherwise kite-dataset cannot be used (as far as I understand) on Mac OS X.

Can't disable codec for avro

In avro http://avro.apache.org/docs/current/spec.html#Required+Codecs a null codec means no compression, however it looks like in kite v17 you don't have a option to disable compression. If you don't set it with avro it defaults to snappy.

I assumed that CompressionType.Uncompressed would disable the compression but instead it throws:

java.lang.IllegalStateException: Format avro doesn't support compression format uncompressed
    at com.google.common.base.Preconditions.checkState(Preconditions.java:172)
    at org.kitesdk.data.DatasetDescriptor.checkCompressionType(DatasetDescriptor.java:1010)
    at org.kitesdk.data.DatasetDescriptor.<init>(DatasetDescriptor.java:126)
    at org.kitesdk.data.DatasetDescriptor$Builder.build(DatasetDescriptor.java:940)

I think choice for this should be left to user and not force these settings.

Upgrade to Solr 5.1

Solr 4.x clients are incompatible with Solr 5.x. So it should upgrade to version 5.1
and the class that is used is renamed from ConcurrentUpdateSolrServer to ConcurrentUpdateSolrClient

Getting AvroRuntimeException when using @Nullable on POJO field with 0.17.0

Hi,
I have a simple POJO that has a field with an @nullable annotation and I get an exception trying to write it to a dataset. The generated schema looks fine but exception is thrown while writing the data.

Pojo:

public class TestPojo {
    Integer id;
    Long timestamp;
    @Nullable
    String description;

    // setters and getters

}

Full code is here: https://github.com/trisberg/kite-pojo

The same code seems to work fine with 0.16.0 and earlier, so I'm guessing something changed in 0.17.0 or I'm doing something stupid.

Full exception:

Caused by: org.apache.avro.AvroRuntimeException: Nested union: ["null",["null","string"]]
    at org.apache.avro.Schema$UnionSchema.<init>(Schema.java:766)
    at org.apache.avro.Schema.createUnion(Schema.java:166)
    at org.apache.avro.reflect.ReflectData.makeNullable(ReflectData.java:406)
    at org.kitesdk.data.spi.DataModelUtil$AllowNulls.createFieldSchema(DataModelUtil.java:56)
    at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:354)
    at org.apache.avro.specific.SpecificData.getSchema(SpecificData.java:154)
    at org.kitesdk.data.spi.DataModelUtil.getReaderSchema(DataModelUtil.java:147)
    at org.kitesdk.data.spi.DataModelUtil.resolveType(DataModelUtil.java:124)
    at org.kitesdk.data.spi.AbstractDataset.<init>(AbstractDataset.java:44)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:85)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:115)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset$Builder.build(FileSystemDataset.java:541)
    at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:143)
    at org.kitesdk.data.spi.AbstractDatasetRepository.create(AbstractDatasetRepository.java:34)
    at com.springdeveloper.data.WritePojosApp.init(WritePojosApp.java:53)
    at com.springdeveloper.data.WritePojosApp.run(WritePojosApp.java:31)
    at org.springframework.boot.SpringApplication.runCommandLineRunners(SpringApplication.java:677)
    ... 11 more

Any ideas?

Can't use external Configuration with Parquet dataset

Short story is that a way Kite currently is writing/reading dataset with parquet, it's not possible to use custom Hadoop Configuration. This is a use case with Spring Hadoop and when Hadoop's minicluster is used(Configuration is build by a minicluster classes).

This will lead to exception shown below:

java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:53548/tmp/dataset/test/simplepojo/c5716ae2-df6a-4ce1-b240-85255d40d728.parquet, expected: file:///

More details can be found from SHDP jira ticket https://jira.spring.io/browse/SHDP-358. We're currently using Kite version 0.13.0.

Partition writers get closed early and often

Hi,

Having some strangeness when writing partitioned dataset.

Even when the number of buckets in the partition are no more than 10, I see some partition writers being closed prematurely.

I have a sample project here: https://github.com/trisberg/hdfs-examples/tree/master/kite-dataset

Am I doing something wrong here? I'd expect all partitions to contain a single file at the end of the run.

Partial logging output:

2014-06-05 18:06:57.763 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@f5d6449 for partition:StorageKey{values=[0]}
2014-06-05 18:06:57.792 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@6890f7c5 for partition:StorageKey{values=[4]}
2014-06-05 18:06:57.798 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@35f0ade4 for partition:StorageKey{values=[1]}
2014-06-05 18:06:57.828 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@690c1955 for partition:StorageKey{values=[7]}
2014-06-05 18:06:57.835 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@1cb08785 for partition:StorageKey{values=[0]}
2014-06-05 18:06:57.859 DEBUG 27109 --- [ main] o.k.d.s.f.PartitionedDatasetWriter : Closing writer:org.kitesdk.data.spi.filesystem.FileSystemWriter@2fe48c89 for partition:StorageKey{values=[1]}

Hive Dataset as external table with HDFS Dataset

I create a dataset at HDFS with schema and partition:

kite-dataset create dataset:hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw --schema sensorRecord.avsc --partition-by partition.json

and use Gobblin to continuously ingest data from kafka to HDFS. The partition looks like:

[
  {"type": "identity", "source": "src", "name": "source"},
  {"type": "year",     "source": "timestamp"},
  {"type": "month",    "source": "timestamp"},
  {"type": "day",      "source": "timestamp"},
  {"type": "hour",     "source": "timestamp"}
]

This part works well.

Then I try to use Hive to query this data, so I create a new Hive dataset as an external table by assign the --location parameter:

kite-dataset create depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw

Then I can find the table default/depa_raw and data in Hive.

But one thing wrong. With the data keep coming from Kafka to HDFS, the partition increases in HDFS by path, but in Hive table, no partition will be created automatically! Which means I can't see newly updated data in Hive.

So what can I do to solve this problem? (I just want to get newly coming data in Hive)

  • I tried kite-dataset delete depa_raw, and wanted to create a new external Hive table, but all the data on HDFS gone after the command.
  • I tried kite-dataset update depa_raw --location hdfs://10.0.1.63:8020/user/pnda/PNDA_datasets/datasets/kafka/depa_raw but nothing happened.

"not" command breaks the flow

The "not" command does not continue processing children when the condition fails. Consider this morphline:

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]

    commands : [
      {
        tryRules {
          throwExceptionIfAllRulesFailed : true
          rules : [
            {
              commands : [
                # todo: test with a command that replicates the record multiple times
                { not { fail { } } }
                { logInfo { format : "hello" } }                
                { addValues { foo : bar } }
                { setValues { myHeader : [myTest] } }
                { copyTest { name : iter, count : 2 } }
                # { fail {} }
              ]
            }

            {
              commands : [
                { logInfo { format : "hello2" } }                
                { addValues { foo2 : bar2 } }
                { setValues { myHeader : [myTest] } }
              ]
            }

          ]
        }
      }
      { equals { myHeader : [myTest] } }
    ]
  }
]

I would expect execution of the first rule to continue after { not { fail { } } }. However, no more commands are executed, and the rule is considered successful. So the record is lost.

Create HBase Dataset ERROR

When I tried to create a HBase dataset:

kite-dataset create dataset:hbase:10.0.1.214:2181/sensor -s sensorRecord.avsc -p partition.json -m map.json

it returns:

IO error: Cannot open schema table

Those are my config files:

schema:

{
  "fields": [
    { "name": "timestamp", "type": "long" },
    { "name": "sensor_group", "type": "string" },
    { "name": "sensor", "type": "string" },
    { "name": "uuid", "type": "string"},
    { "name": "value", "type": "string" },
    { "name": "src", "type": "string"}
  ],
  "name": "sensorRecord",
  "type": "record"
}

partition:

[
  {"type": "identity", "source": "timestamp"},
  {"type": "year",     "source": "timestamp"},
  {"type": "month",    "source": "timestamp"},
  {"type": "day",      "source": "timestamp"},
  {"type": "hour",     "source": "timestamp"}
]

mapping:

[ {
  "source" : "timestamp",
  "type" : "key"
}, {
  "source" : "sensor_group",
  "type" : "column",
  "family" : "v",
  "qualifier" : "sensor_group"
}, {
  "source" : "sensor",
  "type" : "column",
  "family" : "v",
  "qualifier" : "sensor"
}, {
  "source" : "uuid",
  "type" : "column",
  "family" : "v",
  "qualifier" : "uuid"
}, {
  "source" : "value",
  "type" : "column",
  "family" : "v",
  "qualifier" : "value"
}, {
  "source" : "src",
  "type" : "column",
  "family" : "v",
  "qualifier" : "src"
} ]

Reporting a security issue

I have one security-related question about Kite. I am not sure whether it's going to be considered as a vulnerability or a security enhancement - it depends on how one looks at it. Before opening an issue, I'd like to share it privately with the project maintainers. Who would be the best contact? Unfortunately, I didn't find any contact for that.ย In fact, the project doesn't look active anymore.

@szvasas @whoschek @ebogi (found you in the commit history)

extractAvroPaths does not traverses arrays with the '[]' notation when array is part of a union

When avro schema contains an array within a union then the '[]' operator does not traverses the array.

Given schema:
{ "namespace": "org.kitesdk.morphline.avro", "type": "record", "name": "ArrayInUnionTestRecord", "fields": [ {"name": "items", "type": {"type": "array", "items": "string"}}, {"name": "itemsInUnion", "type": ["null", {"type": "array", "items":"string"}], "default": null} ] }

and configuration
extractAvroPaths { paths : { "/items[]" : "/items[]" "/itemsInUnion[]" : "/itemsInUnion[]" } }

/items[] is properly extracted but /itemsInUnion[] is not.

Support Timestamp type in HiveSchemaConverter

Avro is now supporting Timestamp data type in 1.8.1 version but kite-sdk is still not able to handle timestamp data types and is depending on an older version of Avro.

Exception Stack trace while trying to convert Hive Table containing Timestamp to Avro

java.lang.IllegalArgumentException: Cannot convert unsupported type: timestamp

at org.kitesdk.shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
at org.kitesdk.data.spi.hive.HiveSchemaConverter.convert(HiveSchemaConverter.java:199)
at org.kitesdk.data.spi.hive.HiveSchemaConverter.convertField(HiveSchemaConverter.java:173)
at org.kitesdk.data.spi.hive.HiveSchemaConverter.convertTable(HiveSchemaConverter.java:132)

hive metastore server !!! out of sequence response !!! no thread safe ?

    at org.kitesdk.data.filesystem.PartitionedDatasetWriter.write(PartitionedDatasetWriter.java:96)
    at com.chinanetcenter.mtools.task.MockWriteTask.run(MockWriteTask.java:78)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: Hive metastore exception
    at org.kitesdk.data.hcatalog.HCatalog.addPartition(HCatalog.java:151)
    at org.kitesdk.data.hcatalog.HCatalog.addPartition(HCatalog.java:138)
    at org.kitesdk.data.hcatalog.HCatalogMetadataProvider.partitionAdded(HCatalogMetadataProvider.java:95)
    at org.kitesdk.data.filesystem.PartitionedDatasetWriter$DatasetWriterCacheLoader.load(PartitionedDatasetWriter.java:181)
    at org.kitesdk.data.filesystem.PartitionedDatasetWriter$DatasetWriterCacheLoader.load(PartitionedDatasetWriter.java:154)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
    at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
    at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834)
    at org.kitesdk.data.filesystem.PartitionedDatasetWriter.write(PartitionedDatasetWriter.java:90)
    ... 8 more
Caused by: org.apache.thrift.TApplicationException: append_partition_by_name_with_environment_context failed: out of sequence response
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
    at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1296)
    at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.append_partition_by_name_with_environment_context(ThriftHiveMetastore.java:1280)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:436)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.appendPartition(HiveMetaStoreClient.java:430)
    at org.kitesdk.data.hcatalog.HCatalog.addPartition(HCatalog.java:145)
```java

Can't use Dataset features with Hadoop 1.2.1

I'm assuming that the Kite SDK isn't just targeting Hadoop v2. The PathIterator class uses a FileStatus.isFile() which is only available in Hadoop v2. In v1 there is an isDir() method which is deprecated in v2. Not sure if you would want to use the deprecated method or use reflection to check for available methods before invoking.

Kite errors out when quoted csv header contains space

I tried kite sdk to create avro schema out of csv file using below command:
kite-dataset csv-schema test.CSV --class Sample

It fails (when column header has spaces):
"field 1","field 2","field 3"
"Agwam","Agwam, MA","25007"

with following error:

Unknown error: Bad header for field, should start with a character or _ and can contain only alphanumerics and _ 0: "field 1"

Success example:
"field1","field2","field3"
"Agwam","Agwam, MA","25007"

Not sure if there is a way to handle it.

What dependencies to set when using SBT

I'm trying to build a Scala app that uses KiteSDK to put some data in HDFS. I'm using SBT as my build tool, so I can't use a parent POM. I'm running into problems getting the right dependencies in, which leads to this error when trying to build a DatasetDescriptor:

scala> val descriptor = new DatasetDescriptor.Builder().schema(accountSchema).build()
warning: Class org.apache.hadoop.fs.Path not found - continuing with a stub.
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
  ... 43 elided
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
  at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 43 more

My sbt file:

name := "myapp"

organization := "com.datadudes"

version := "0.1-SNAPSHOT"

scalaVersion := "2.11.4"

libraryDependencies ++= Seq(
  "org.scala-lang.modules"    %% "scala-xml"        % "1.0.2",
  "org.apache.avro"           % "avro"              % "1.7.5",
  "com.force.api"             % "force-wsc"         % "33.0.1" exclude("org.antlr", "ST4"),
  "com.force.api"             % "force-partner-api" % "33.0.1",
  "com.datadudes"             %% "wsdl2avro"        % "0.1-SNAPSHOT",
  "org.kitesdk"               % "kite-data-core"    % "0.17.1",
  "org.specs2"                %% "specs2-junit"     % "2.4.15"    % "test"
)

What other dependencies do I need, next to kite-data-core in order to be able to use Kite Datasets?

Logic to infer data type and locale specific number formats

private static final Pattern LONG = Pattern.compile("\d+");
private static final Pattern DOUBLE = Pattern.compile("\d*\.\d*[dD]?");
private static final Pattern FLOAT = Pattern.compile("\d*\.\d*[fF]?");

I suggest that different locale specific numbers formatting should also be supported. What do you think about custom formats recognizers for types like Dates, UUIDs etc?

Also, looks like only 1st line of CSV is used for type inference, not first 25 as expected.
private static final int DEFAULT_INFER_LINES = 25;

I have a file there 1st row of data contains some column like "device model" with digits only, and in the 2nd row there are also letters. Schema inferred contains union type "null", "string", and import failed on the same 2nd row.

GCS Support

You have S3 support. It's pretty much a copy/paste to add GCS support I would wager...

NoClassDefFoundError on csv-import (HDFS to Hive)

After building the latest from source (e1d3d6f), attempting to run csv-import from HDFS to a Hive dataset fails during the MR with the following:

2016-12-08 14:39:59,046 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-12-08 14:39:59,095 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2016-12-08 14:39:59,095 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (appAttemptId { application_id { id: 2 cluster_timestamp: 1481206905363 } attemptId: 2 } keyId: 1120164819)
2016-12-08 14:39:59,417 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.
2016-12-08 14:39:59,419 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in config null
2016-12-08 14:40:00,098 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/HiveOutputFormat
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.kitesdk.data.spi.hive.HiveUtils.getHiveParquetOutputFormat(HiveUtils.java:462)
at org.kitesdk.data.spi.hive.HiveUtils.(HiveUtils.java:94)
at org.kitesdk.data.spi.hive.Loader.newHiveConf(Loader.java:161)
at org.kitesdk.data.spi.hive.Loader.access$100(Loader.java:41)
at org.kitesdk.data.spi.hive.Loader$ManagedBuilder.getFromOptions(Loader.java:104)
at org.kitesdk.data.spi.hive.Loader$ManagedBuilder.getFromOptions(Loader.java:99)
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:111)
at org.kitesdk.data.Datasets.load(Datasets.java:103)
at org.kitesdk.data.Datasets.load(Datasets.java:165)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.load(DatasetKeyOutputFormat.java:549)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.getOutputCommitter(DatasetKeyOutputFormat.java:506)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:517)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:499)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1598)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:499)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:285)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1556)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1553)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1486)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.HiveOutputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 37 more

Running this against HDP 2.5; didn't have this issue with the binary packaged for 1.1.0 (but that had an issue, KITE-1073 that prevented the MR from working with the wrong FS exception)

Cheers
Jason

importCommands using globs does not with with Java9

the "importCommands" parsing code for dealing with prefix based globs (ie: org.kitesdk.** or com.foo.morphlines.*) Doesn't work using Java9.

This seems to be because the underlying classpath scanning is build on the ClassPath.from(ClassLoader) API shadded from Guava, which has a very notable limitation documented...

* <p>Currently only {@link URLClassLoader} and only {@code file://} urls are supported.

...but in Java9, URLClassLoader is (aparently) rarely used as a result of the new (jigsaw) module system.

The only work around seems to be to change all morphlines configs to remove * globs from importCommands declarations and enumate every CommandBuilder implemenation needed for the config.

I suggest morphlines switch to using SPI based scanning for CommandBuilders since that is a good API for plugins that has been supported by the JVM for a looooong time, and continues to work in Java9.

See also: https://issues.apache.org/jira/browse/SOLR-8876

Your project kite-sdk kite is using buggy third-party libraries [WARNING]

Hi, there!

We are a research team working on third-party library analysis. We have found that some widely-used third-party libraries in your project have major/critical bugs, which will degrade the quality of your project. We highly recommend you to update those libraries to new versions.

We have attached the buggy third-party libraries and corresponding jira issue links below for you to have more detailed information. We have analyzed the api call related to the following libraries and found one library that is using the API call that might invoke buggy methods in the library of the history.

  1. commons-logging commons-logging
    version: 1.1.1
    Jira issues:
    Unit tests fail on linux with java16
    deadlock on re-registration of logger
    Potential missing privileged block for class loader
    Log4JLogger uses deprecated static members of Priority such as INFO
    LogFactory/LogFactoryImpl ingore Throwable
    LogFactory.nullClassLoaderFactory is not properly synchronized
    SimpleLog.log - unsafe update of shortLogName
    BufferedReader is not closed properly
  2. commons-io commons-io
    version: 2.5
    Jira issues:
    ant test fails - resources missing from test classpath
    Exceptions are suppressed incorrectly when copying files.
    ThresholdingOutputStream.thresholdReached() results in FileNotFoundException
    Tailer.run race condition runaway logging
    Thread bug in FileAlterationMonitor#stop(int)
    2.5 ExceptionInInitializerError
  3. commons-codec commons-codec
    version: 1.4
    API call in your project:org.apache.commons.codec.binary.Base64.setInitialBuffer(byte[],int,int)

Jira issues:
Base64InputStream#read(byte[]) incorrectly returns 0 at end of any stream which is multiple of 3 bytes long
ArrayIndexOutOfBoundsException when doing multiple reads() on encoding Base64InputStream
Base64 encoding issue for larger avi files
org.apache.commons.codec.net.URLCodec.ESCAPE_CHAR isn't final but should be
org.apache.commons.codec.language.RefinedSoundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE
org.apache.commons.codec.language.Soundex.US_ENGLISH_MAPPING should be package protected MALICIOUS_CODE
Caverphone encodes names starting and ending with "mb" incorrectly.
All links to fixed bugs in the "Changes Report" http://commons.apache.org/codec/changes-report.html point nowhere; e.g. http://issues.apache.org/jira/browse/34157. Looks as if all JIRA tickets were renumbered.
Regression: Base64.encode(chunk=true) has bug when input length is multiple of 76
DigestUtils: MD5 checksum is not calculated correctly on linux64-platforms
new Base64().encode() appends a CRLF; and chunks results into 76 character lines
Base64 encode() method is no longer thread-safe; breaking clients using it as a shared BinaryEncoder
Base64 default constructor behaviour changed to enable chunking in 1.4
Base64InputStream causes NullPointerException on some input
Base64.encodeBase64String() shouldn't chunk
4. commons-lang commons-lang
version: 2.5
Jira issues:
Testing with JDK 1.7
Some StringUtils methods should take an int character instead of char to use String API features.
SystemUtils.getJavaVersionAsFloat throws StringIndexOutOfBoundsException on Android runtime/Dalvik VM
NumberUtils createNumber throws a StringIndexOutOfBoundsException when argument containing "e" and "E" is passed in
FastDateFormat.format() outputs incorrect week of year because locale isn't respected
RandomStringUtils.random(count; 0; 0; false; false; universe; random) always throws java.lang.ArrayIndexOutOfBoundsException
Exception when combining custom and choice format in ExtendedMessageFormat

Sincerely~
FDU Software Engineering Lab
Marth 14th,2019

Kite Avro SDK: Merge Can Create 'type' Lists with Default Not as First Element

From the Avro spec

Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing "null", the "null" is usually listed first, since the default value of such unions is typically null.

However, merging two record schemas does not always enforce this rule. For example: "type":["string","null"], "default":null which should be "type":["null", "string"], "default":null. I can reproduce this bug with the following:

  1. A Json String with a null value for key X
  2. A Json String with a non-null value for key X
  3. A Json String with no entry for X (DNE)
  4. Schemas inferred from each of the sample Json Strings
  5. The schemas merged in the following way: merge(non-null, merge(null, dne)) or merge(non-null, merge(dne, null))
scala> val nul = """{"key":null}"""
nul: String = {"key":null}

scala> val dne = """{"other":3}"""
dne: String = {"other":3}

scala> val str = """{"key":"hello"}"""
str: String = {"key":"hello"}

scala> def stream(s: String): InputStream = new ByteArrayInputStream(s.getBytes("UTF-8"))
stream: (s: String)java.io.InputStream

scala> val nulSchema = JsonUtil.inferSchema(stream(nul), "com.example", 1)
nulSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":"null","doc":"Type inferred from 'null'"}]}

scala> val dneSchema = JsonUtil.inferSchema(stream(dne), "com.example", 1)
dneSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"other","type":"int","doc":"Type inferred from '3'"}]}

scala> val nPlusDne = SchemaUtil.merge(dneSchema, nulSchema)
nPlusDne: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"other","type":["null","int"],"doc":"Type inferred from '3'","default":null},{"name":"key","type":"null","doc":"Type inferred from 'null'","default":null}]}

scala> val strSchema = JsonUtil.inferSchema(stream(str), "com.example", 1)
strSchema: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":"string","doc":"Type inferred from '\"hello\"'"}]}

scala> val merged = SchemaUtil.merge(strSchema, nPlusDne)
[WARNING] Avro: Invalid default for field key: null not a ["string","null"]
merged: org.apache.avro.Schema = {"type":"record","name":"example","namespace":"com","fields":[{"name":"key","type":["string","null"],"doc":"Type inferred from '\"hello\"'","default":null},{"name":"other","type":["null","int"],"doc":"Type inferred from '3'","default":null}]}

The final merge produces "type":["string","null"], "default":null, despite the type of the default value needing to be the first element of the type list.

JsonUtil.inferSchema is outputting incorrect names.

Hello,

I have been testing the Kite-sdk maven dependency due to a recommendation made by this question:

https://stackoverflow.com/questions/46556614/is-there-a-way-to-programmatically-convert-json-to-avro-schema/46566592#46566592

The problem is, when I use the small script I wrote:

package avro_schema_builder;

import java.io.File;
import java.io.FileReader;

import org.json.simple.JSONArray;
import org.json.simple.parser.JSONParser;
import org.kitesdk.data.spi.JsonUtil;

public class avro_schema_builder {

	public static void main(String[] args) throws Exception {

		File file = new File("src/avro_schema_builder/avro.json");
		String path = file.getCanonicalPath();
		JSONParser jsonParser = new JSONParser();
		FileReader reader = new FileReader(path);
		Object obj = jsonParser.parse(reader);
		JSONArray json = (JSONArray) obj;
		String avroSchema = JsonUtil.inferSchema(JsonUtil.parse(json.toString()), "test").toString();
		System.out.println(avroSchema);

	}

}

And a random online JSON I found:

[
	{
		"webapp": {
			"asd": {
				"ase": {
					"asf": 0
				}
			}
		},
		"servlet": [
			{
				"servletname": "cofaxCDS",
				"servletclass": "org.cofax.cds.CDSServlet",
				"initparam": {
					"configGlossary_installationAt": "Philadelphia, PA",
					"configGlossary_adminEmail": "[email protected]",
					"configGlossary_poweredBy": "Cofax",
					"configGlossary_poweredByIcon": "/images/cofax.gif",
					"configGlossary_staticPath": "/content/static",
					"templateProcessorClass": "org.cofax.WysiwygTemplate",
					"templateLoaderClass": "org.cofax.FilesTemplateLoader",
					"templatePath": "templates",
					"templateOverridePath": "",
					"defaultListTemplate": "listTemplate.htm",
					"defaultFileTemplate": "articleTemplate.htm",
					"useJSP": false,
					"jspListTemplate": "listTemplate.jsp",
					"jspFileTemplate": "articleTemplate.jsp",
					"cachePackageTagsTrack": 200,
					"cachePackageTagsStore": 200,
					"cachePackageTagsRefresh": 60,
					"cacheTemplatesTrack": 100,
					"cacheTemplatesStore": 50,
					"cacheTemplatesRefresh": 15,
					"cachePagesTrack": 200,
					"cachePagesStore": 100,
					"cachePagesRefresh": 10,
					"cachePagesDirtyRead": 10,
					"searchEngineListTemplate": "forSearchEnginesList.htm",
					"searchEngineFileTemplate": "forSearchEngines.htm",
					"searchEngineRobotsDb": "WEBINF/robots.db",
					"useDataStore": true,
					"dataStoreClass": "org.cofax.SqlDataStore",
					"redirectionClass": "org.cofax.SqlRedirection",
					"dataStoreName": "cofax",
					"dataStoreDriver": "com.microsoft.jdbc.sqlserver.SQLServerDriver",
					"dataStoreUrl": "jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon",
					"dataStoreUser": "sa",
					"dataStorePassword": "dataStoreTestQuery",
					"dataStoreTestQuery": "SET NOCOUNT ON;select test='test';",
					"dataStoreLogFile": "/usr/local/tomcat/logs/datastore.log",
					"dataStoreInitConns": 10,
					"dataStoreMaxConns": 100,
					"dataStoreConnUsageLimit": 100,
					"dataStoreLogLevel": "debug",
					"maxUrlLength": 500
				}
			},
			{
				"servletname": "cofaxEmail",
				"servletclass": "org.cofax.cds.EmailServlet",
				"initparam": {
					"mailHost": "mail1",
					"mailHostOverride": "mail2"
				}
			},
			{
				"servletclass": "org.cofax.cds.AdminServlet"
			},
			{
				"servletname": "fileServlet",
				"servletclass": "org.cofax.cds.FileServlet"
			},
			{
				"servletname": "cofaxTools",
				"servletclass": "org.cofax.cms.CofaxToolsServlet",
				"initparam": {
					"templatePath": "toolstemplates/",
					"log": 1,
					"logLocation": "/usr/local/tomcat/logs/CofaxTools.log",
					"logMaxSize": "",
					"dataLog": 1,
					"dataLogLocation": "/usr/local/tomcat/logs/dataLog.log",
					"dataLogMaxSize": "",
					"removePageCache": "/content/admin/remove?cache=pages&id=",
					"removeTemplateCache": "/content/admin/remove?cache=templates&id=",
					"fileTransferFolder": "/usr/local/tomcat/webapps/content/fileTransferFolder",
					"lookInContext": 1,
					"adminGroupID": 4,
					"betaServer": true
				}
			}
		],
		"servletmapping": {
			"cofaxCDS": "/",
			"cofaxEmail": "/cofaxutil/aemail/*",
			"cofaxAdmin": "/admin/*",
			"fileServlet": "/static/*",
			"cofaxTools": "/tools/*"
		},
		"taglib": {
			"tagliburi": "cofax.tld",
			"tagliblocation": "/WEBINF/tlds/cofax.tld"
		}
	}
]

I get the wrong output:

{
    "type": "array",
    "items": {
        "type": "record",
        "name": "test",
        "fields": [{
                "name": "webapp",
                "type": {
                    "type": "record",
                    "name": "webapp",
                    "fields": [{
                            "name": "asd",
                            "type": {
                                "type": "record",
                                "name": "webapp",
                                "namespace": "asd",
                                "fields": [{
                                        "name": "ase",
                                        "type": {
                                            "type": "record",
                                            "name": "webapp",
                                            "namespace": "ase.asd",
                                            "fields": [{
                                                    "name": "asf",
                                                    "type": "int",
                                                    "doc": "Type inferred from '0'"
                                                }
                                            ]
                                        },
                                        "doc": "Type inferred from '{\"asf\":0}'"
                                    }
                                ]
                            },
                            "doc": "Type inferred from '{\"ase\":{\"asf\":0}}'"
                        }
                    ]
                },
                "doc": "Type inferred from '{\"asd\":{\"ase\":{\"asf\":0}}}'"
            }, {
                "name": "servletmapping",
                "type": {
                    "type": "record",
                    "name": "servletmapping",
                    "fields": [{
                            "name": "cofaxAdmin",
                            "type": "string",
                            "doc": "Type inferred from '\"/admin/*\"'"
                        }, {
                            "name": "cofaxCDS",
                            "type": "string",
                            "doc": "Type inferred from '\"/\"'"
                        }, {
                            "name": "cofaxEmail",
                            "type": "string",
                            "doc": "Type inferred from '\"/cofaxutil/aemail/*\"'"
                        }, {
                            "name": "fileServlet",
                            "type": "string",
                            "doc": "Type inferred from '\"/static/*\"'"
                        }, {
                            "name": "cofaxTools",
                            "type": "string",
                            "doc": "Type inferred from '\"/tools/*\"'"
                        }
                    ]
                },
                "doc": "Type inferred from '{\"cofaxAdmin\":\"/admin/*\",\"cofaxCDS\":\"/\",\"cofaxEmail\":\"/cofaxutil/aemail/*\",\"fileServlet\":\"/static/*\",\"cofaxTools\":\"/tools/*\"}'"
            }, {
                "name": "taglib",
                "type": {
                    "type": "record",
                    "name": "taglib",
                    "fields": [{
                            "name": "tagliburi",
                            "type": "string",
                            "doc": "Type inferred from '\"cofax.tld\"'"
                        }, {
                            "name": "tagliblocation",
                            "type": "string",
                            "doc": "Type inferred from '\"/WEBINF/tlds/cofax.tld\"'"
                        }
                    ]
                },
                "doc": "Type inferred from '{\"tagliburi\":\"cofax.tld\",\"tagliblocation\":\"/WEBINF/tlds/cofax.tld\"}'"
            }, {
                "name": "servlet",
                "type": {
                    "type": "array",
                    "items": {
                        "type": "record",
                        "name": "servlet",
                        "fields": [{
                                "name": "servletclass",
                                "type": "string",
                                "doc": "Type inferred from '\"org.cofax.cds.CDSServlet\"'"
                            }, {
                                "name": "initparam",
                                "type": ["null", {
                                        "type": "record",
                                        "name": "servlet",
                                        "namespace": "initparam",
                                        "fields": [{
                                                "name": "cachePackageTagsTrack",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '200'",
                                                "default": null
                                            }, {
                                                "name": "redirectionClass",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"org.cofax.SqlRedirection\"'",
                                                "default": null
                                            }, {
                                                "name": "jspFileTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"articleTemplate.jsp\"'",
                                                "default": null
                                            }, {
                                                "name": "cacheTemplatesRefresh",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '15'",
                                                "default": null
                                            }, {
                                                "name": "dataStorePassword",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"dataStoreTestQuery\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreClass",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"org.cofax.SqlDataStore\"'",
                                                "default": null
                                            }, {
                                                "name": "cacheTemplatesTrack",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '100'",
                                                "default": null
                                            }, {
                                                "name": "configGlossary_poweredByIcon",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/images/cofax.gif\"'",
                                                "default": null
                                            }, {
                                                "name": "searchEngineFileTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"forSearchEngines.htm\"'",
                                                "default": null
                                            }, {
                                                "name": "configGlossary_adminEmail",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"[email protected]\"'",
                                                "default": null
                                            }, {
                                                "name": "defaultFileTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"articleTemplate.htm\"'",
                                                "default": null
                                            }, {
                                                "name": "templateProcessorClass",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"org.cofax.WysiwygTemplate\"'",
                                                "default": null
                                            }, {
                                                "name": "configGlossary_installationAt",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"Philadelphia, PA\"'",
                                                "default": null
                                            }, {
                                                "name": "searchEngineListTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"forSearchEnginesList.htm\"'",
                                                "default": null
                                            }, {
                                                "name": "cachePagesStore",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '100'",
                                                "default": null
                                            }, {
                                                "name": "useDataStore",
                                                "type": ["null", "boolean"],
                                                "doc": "Type inferred from 'true'",
                                                "default": null
                                            }, {
                                                "name": "configGlossary_poweredBy",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"Cofax\"'",
                                                "default": null
                                            }, {
                                                "name": "templateLoaderClass",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"org.cofax.FilesTemplateLoader\"'",
                                                "default": null
                                            }, {
                                                "name": "cachePagesTrack",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '200'",
                                                "default": null
                                            }, {
                                                "name": "searchEngineRobotsDb",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"WEBINF/robots.db\"'",
                                                "default": null
                                            }, {
                                                "name": "cachePagesDirtyRead",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '10'",
                                                "default": null
                                            }, {
                                                "name": "cachePackageTagsStore",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '200'",
                                                "default": null
                                            }, {
                                                "name": "cachePackageTagsRefresh",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '60'",
                                                "default": null
                                            }, {
                                                "name": "configGlossary_staticPath",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/content/static\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreConnUsageLimit",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '100'",
                                                "default": null
                                            }, {
                                                "name": "useJSP",
                                                "type": ["null", "boolean"],
                                                "doc": "Type inferred from 'false'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreLogLevel",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"debug\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreUrl",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon\"'",
                                                "default": null
                                            }, {
                                                "name": "templatePath",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"templates\"'",
                                                "default": null
                                            }, {
                                                "name": "cacheTemplatesStore",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '50'",
                                                "default": null
                                            }, {
                                                "name": "jspListTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"listTemplate.jsp\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreTestQuery",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"SET NOCOUNT ON;select test='test';\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreMaxConns",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '100'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreName",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"cofax\"'",
                                                "default": null
                                            }, {
                                                "name": "maxUrlLength",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '500'",
                                                "default": null
                                            }, {
                                                "name": "templateOverridePath",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"\"'",
                                                "default": null
                                            }, {
                                                "name": "cachePagesRefresh",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '10'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreDriver",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"com.microsoft.jdbc.sqlserver.SQLServerDriver\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreUser",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"sa\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreLogFile",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/usr/local/tomcat/logs/datastore.log\"'",
                                                "default": null
                                            }, {
                                                "name": "defaultListTemplate",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"listTemplate.htm\"'",
                                                "default": null
                                            }, {
                                                "name": "dataStoreInitConns",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '10'",
                                                "default": null
                                            }, {
                                                "name": "mailHost",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"mail1\"'",
                                                "default": null
                                            }, {
                                                "name": "mailHostOverride",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"mail2\"'",
                                                "default": null
                                            }, {
                                                "name": "dataLogMaxSize",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"\"'",
                                                "default": null
                                            }, {
                                                "name": "log",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '1'",
                                                "default": null
                                            }, {
                                                "name": "logMaxSize",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"\"'",
                                                "default": null
                                            }, {
                                                "name": "fileTransferFolder",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/usr/local/tomcat/webapps/content/fileTransferFolder\"'",
                                                "default": null
                                            }, {
                                                "name": "removeTemplateCache",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/content/admin/remove?cache=templates&id=\"'",
                                                "default": null
                                            }, {
                                                "name": "dataLogLocation",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/usr/local/tomcat/logs/dataLog.log\"'",
                                                "default": null
                                            }, {
                                                "name": "lookInContext",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '1'",
                                                "default": null
                                            }, {
                                                "name": "removePageCache",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/content/admin/remove?cache=pages&id=\"'",
                                                "default": null
                                            }, {
                                                "name": "adminGroupID",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '4'",
                                                "default": null
                                            }, {
                                                "name": "betaServer",
                                                "type": ["null", "boolean"],
                                                "doc": "Type inferred from 'true'",
                                                "default": null
                                            }, {
                                                "name": "logLocation",
                                                "type": ["null", "string"],
                                                "doc": "Type inferred from '\"/usr/local/tomcat/logs/CofaxTools.log\"'",
                                                "default": null
                                            }, {
                                                "name": "dataLog",
                                                "type": ["null", "int"],
                                                "doc": "Type inferred from '1'",
                                                "default": null
                                            }
                                        ]
                                    }
                                ],
                                "doc": "Type inferred from '{\"cachePackageTagsTrack\":200,\"redirectionClass\":\"org.cofax.SqlRedirection\",\"jspFileTemplate\":\"articleTemplate.jsp\",\"cacheTemplatesRefresh\":15,\"dataStorePassword\":\"dataStoreTestQuery\",\"dataStoreClass\":\"org.cofax.SqlDataStore\",\"cacheTemplatesTrack\":100,\"configGlossary_poweredByIcon\":\"/images/cofax.gif\",\"searchEngineFileTemplate\":\"forSearchEngines.htm\",\"configGlossary_adminEmail\":\"[email protected]\",\"defaultFileTemplate\":\"articleTemplate.htm\",\"templateProcessorClass\":\"org.cofax.WysiwygTemplate\",\"configGlossary_installationAt\":\"Philadelphia, PA\",\"searchEngineListTemplate\":\"forSearchEnginesList.htm\",\"cachePagesStore\":100,\"useDataStore\":true,\"configGlossary_poweredBy\":\"Cofax\",\"templateLoaderClass\":\"org.cofax.FilesTemplateLoader\",\"cachePagesTrack\":200,\"searchEngineRobotsDb\":\"WEBINF/robots.db\",\"cachePagesDirtyRead\":10,\"cachePackageTagsStore\":200,\"cachePackageTagsRefresh\":60,\"configGlossary_staticPath\":\"/content/static\",\"dataStoreConnUsageLimit\":100,\"useJSP\":false,\"dataStoreLogLevel\":\"debug\",\"dataStoreUrl\":\"jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon\",\"templatePath\":\"templates\",\"cacheTemplatesStore\":50,\"jspListTemplate\":\"listTemplate.jsp\",\"dataStoreTestQuery\":\"SET NOCOUNT ON;select test='test';\",\"dataStoreMaxConns\":100,\"dataStoreName\":\"cofax\",\"maxUrlLength\":500,\"templateOverridePath\":\"\",\"cachePagesRefresh\":10,\"dataStoreDriver\":\"com.microsoft.jdbc.sqlserver.SQLServerDriver\",\"dataStoreUser\":\"sa\",\"dataStoreLogFile\":\"/usr/local/tomcat/logs/datastore.log\",\"defaultListTemplate\":\"listTemplate.htm\",\"dataStoreInitConns\":10}'",
                                "default": null
                            }, {
                                "name": "servletname",
                                "type": ["null", "string"],
                                "doc": "Type inferred from '\"cofaxCDS\"'",
                                "default": null
                            }
                        ]
                    }
                },
                "doc": "Type inferred from '[{\"servletclass\":\"org.cofax.cds.CDSServlet\",\"initparam\":{\"cachePackageTagsTrack\":200,\"redirectionClass\":\"org.cofax.SqlRedirection\",\"jspFileTemplate\":\"articleTemplate.jsp\",\"cacheTemplatesRefresh\":15,\"dataStorePassword\":\"dataStoreTestQuery\",\"dataStoreClass\":\"org.cofax.SqlDataStore\",\"cacheTemplatesTrack\":100,\"configGlossary_poweredByIcon\":\"/images/cofax.gif\",\"searchEngineFileTemplate\":\"forSearchEngines.htm\",\"configGlossary_adminEmail\":\"[email protected]\",\"defaultFileTemplate\":\"articleTemplate.htm\",\"templateProcessorClass\":\"org.cofax.WysiwygTemplate\",\"configGlossary_installationAt\":\"Philadelphia, PA\",\"searchEngineListTemplate\":\"forSearchEnginesList.htm\",\"cachePagesStore\":100,\"useDataStore\":true,\"configGlossary_poweredBy\":\"Cofax\",\"templateLoaderClass\":\"org.cofax.FilesTemplateLoader\",\"cachePagesTrack\":200,\"searchEngineRobotsDb\":\"WEBINF/robots.db\",\"cachePagesDirtyRead\":10,\"cachePackageTagsStore\":200,\"cachePackageTagsRefresh\":60,\"configGlossary_staticPath\":\"/content/static\",\"dataStoreConnUsageLimit\":100,\"useJSP\":false,\"dataStoreLogLevel\":\"debug\",\"dataStoreUrl\":\"jdbc:microsoft:sqlserver://LOCALHOST:1433;DatabaseName=goon\",\"templatePath\":\"templates\",\"cacheTemplatesStore\":50,\"jspListTemplate\":\"listTemplate.jsp\",\"dataStoreTestQuery\":\"SET NOCOUNT ON;select test='test';\",\"dataStoreMaxConns\":100,\"dataStoreName\":\"cofax\",\"maxUrlLength\":500,\"templateOverridePath\":\"\",\"cachePagesRefresh\":10,\"dataStoreDriver\":\"com.microsoft.jdbc.sqlserver.SQLServerDriver\",\"dataStoreUser\":\"sa\",\"dataStoreLogFile\":\"/usr/local/tomcat/logs/datastore.log\",\"defaultListTemplate\":\"listTemplate.htm\",\"dataStoreInitConns\":10},\"servletname\":\"cofaxCDS\"},{\"servletclass\":\"org.cofax.cds.EmailServlet\",\"initparam\":{\"mailHost\":\"mail1\",\"mailHostOverride\":\"mail2\"},\"servletname\":\"cofaxEmail\"},{\"servletclass\":\"org.cofax.cds.AdminServlet\"},{\"servletclass\":\"org.cofax.cds.FileServlet\",\"servletname\":\"fileServlet\"},{\"servletclass\":\"org.cofax.cms.CofaxToolsServlet\",\"initparam\":{\"dataLogMaxSize\":\"\",\"log\":1,\"logMaxSize\":\"\",\"templatePath\":\"toolstemplates/\",\"fileTransferFolder\":\"/usr/local/tomcat/webapps/content/fileTransferFolder\",\"removeTemplateCache\":\"/content/admin/remove?cache=templates&id=\",\"dataLogLocation\":\"/usr/local/tomcat/logs/dataLog.log\",\"lookInContext\":1,\"removePageCache\":\"/content/admin/remove?cache=pages&id=\",\"adminGroupID\":4,\"betaServer\":true,\"logLocation\":\"/usr/local/tomcat/logs/CofaxTools.log\",\"dataLog\":1},\"servletname\":\"cofaxTools\"}]'"
            }
        ]
    }
}

In the first segment, webapp should not be repeated more than twice. It should be:

name: webapp, type: { name: webapp, type: record, fields: [ {name: asd, type: {type: 'record, name: 'asd}]} 

not

name: webapp, namespace: asd

PS, is there a way to get rid of the "doc"s and also avoid the dot notation?

Support Secure Hadoop ?

16:10:47.549 [main] WARN  org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
16:10:47.579 [main] INFO  hive.metastore - Trying to connect to metastore with URI thrift://master123:50041
Search Subject for Kerberos V5 INIT cred (<<DEF>>, sun.security.jgss.krb5.Krb5InitCredential)
Debug is  true storeKey false useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is /home/sinclair/workspace/cnc-works/mtools/target/classes/sec/sinclair.keytab refreshKrb5Config is false principal is sinclair@XXXXX.COM tryFirstPass is false useFirstPass is false storePass is false clearPass is false
principal is sinclair@XXXXX.COM
Will use keytab
Commit Succeeded 

16:10:47.865 [main] INFO  hive.metastore - Connected to metastore.
16:10:47.979 [main] WARN  org.kitesdk.data.hcatalog.HCatalog - Using a local Hive MetaStore (for testing only)
16:10:48.020 [main] WARN  org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
16:10:59.010 [main] INFO  o.k.d.h.HCatalogExternalMetadataProvider - Creating an external Hive table named: a_b_c_staging
Search Subject for Kerberos V5 INIT cred (<<DEF>>, sun.security.jgss.krb5.Krb5InitCredential)
Debug is  true storeKey false useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is /home/sinclair/workspace/cnc-works/mtools/target/classes/sec/sinclair.keytab refreshKrb5Config is false principal is sinclair@XXXXX.COM tryFirstPass is false useFirstPass is false storePass is false clearPass is false
principal is sinclair@XXXXX.COM
Will use keytab
Commit Succeeded 

Exception in thread "main" org.kitesdk.data.DatasetRepositoryException: Cannot access data location
    at org.kitesdk.data.filesystem.FileSystemDatasetRepository.ensureExists(FileSystemDatasetRepository.java:388)
    at org.kitesdk.data.filesystem.AccessorImpl.ensureExists(AccessorImpl.java:62)
    at org.kitesdk.data.hcatalog.HCatalogExternalMetadataProvider.create(HCatalogExternalMetadataProvider.java:86)
    at org.kitesdk.data.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:123)
    at com.xxxxx.mtools.Main.createDateSet(Main.java:561)
    at com.xxxxx.mtools.Main.main(Main.java:259)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: sinclair@XXXXX.COM is not allowed to impersonate sinclair
    at org.apache.hadoop.ipc.Client.call(Client.java:1409)
    at org.apache.hadoop.ipc.Client.call(Client.java:1362)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy37.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:699)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy38.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1757)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
    at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
    at org.kitesdk.data.filesystem.FileSystemDatasetRepository.ensureExists(FileSystemDatasetRepository.java:384)
    ... 10 more
``` java

SolrCellBuilder appears to fail to configure Tika to extract embedded contents

To configure Tika to parse embedded documents recursively, you need to set the embedded parser in the parse context. If my reading of SolrCellBuilder is correct, Tika will only pull the contents out of the container document and will miss attachments.

See: https://issues.apache.org/jira/browse/SOLR-7189 and http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3CCAN4YXve24W++MKK1U-n0rp6JKNf-FQB10_ggRw4W4-Xy8dgP-w@mail.gmail.com%3E

Out of date POM?

I could not find a mailing list. Kindly redirect me to a discussion list if one exist and the issue tracker is not appropriate for user questions.

I am trying to install the Kite-SDK using mvn install -Dhadoop.profile=2 since I have Hadoop 2+ "installed" and maven on my local machine (the binaries were added to the env. path).

I am getting stuck on the installation step of kite:

[INFO] Reactor Summary:
[INFO]
[INFO] Kite Development Kit ............................... SUCCESS [ 0.804 s]
[INFO] Kite Hadoop Dependencies Module .................... SUCCESS [ 0.767 s]
[INFO] Kite Hadoop Default Dependencies Module ............ SUCCESS [ 0.244 s]
[INFO] Kite Hadoop Default Test Dependencies Module ....... SUCCESS [ 0.118 s]
[INFO] Kite Hadoop-1 Dependencies Module .................. SUCCESS [ 0.113 s]
[INFO] Kite Hadoop-1 Test Dependencies Module ............. SUCCESS [ 0.040 s]
[INFO] Kite Hadoop CDH4 Dependencies Module ............... SUCCESS [ 0.079 s]
[INFO] Kite Hadoop CDH4 Test Dependencies Module .......... SUCCESS [ 0.060 s]
[INFO] Kite Hadoop CDH5 Dependencies Module ............... SUCCESS [ 0.150 s]
[INFO] Kite Hadoop CDH5 Test Dependencies Module .......... SUCCESS [ 0.088 s]
[INFO] Kite Hadoop Compatibility Module ................... SUCCESS [ 7.327 s]
[INFO] Kite HBase Dependencies Module ..................... SUCCESS [ 0.034 s]
[INFO] Kite HBase Default Dependencies Module ............. SUCCESS [ 0.128 s]
[INFO] Kite HBase Default Test Dependencies Module ........ SUCCESS [ 0.071 s]
[INFO] Kite HBase Hadoop-1 Dependencies Module ............ SUCCESS [ 0.057 s]
[INFO] Kite HBase Hadoop-1 Test Dependencies Module ....... SUCCESS [ 0.031 s]
[INFO] Kite HBase CDH4 Dependencies Module ................ SUCCESS [ 0.046 s]
[INFO] Kite HBase CDH4 Test Dependencies Module ........... SUCCESS [ 0.032 s]
[INFO] Kite HBase CDH5 Dependencies Module ................ SUCCESS [ 0.076 s]
[INFO] Kite HBase CDH5 Test Dependencies Module ........... SUCCESS [ 0.047 s]
[INFO] Kite Data Module ................................... SUCCESS [ 0.384 s]
[INFO] Kite Data Core Module .............................. SUCCESS [02:11 min]
[INFO] Kite Data Oozie Module ............................. SUCCESS [ 11.892 s]
[INFO] Kite Data Hive Module .............................. SUCCESS [01:35 min]
[INFO] Kite Data S3 Module ................................ SUCCESS [ 5.735 s]
[INFO] Kite Data HBase Module ............................. SUCCESS [02:17 min]
[INFO] Kite Data MapReduce Module ......................... SUCCESS [ 43.277 s]
[INFO] Kite Data Crunch Module ............................ SUCCESS [04:03 min]
[INFO] Kite Data Flume Module ............................. SUCCESS [ 7.773 s]
[INFO] Kite Data Spark Module ............................. FAILURE [ 17.148 s]
[INFO] Kite Application POM for CDH4 ...................... SKIPPED
[INFO] Kite Application POM for CDH5 ...................... SKIPPED
[INFO] Kite Application POM Modules ....................... SKIPPED
(etc, I am omiting the rest since it is all skipped.)
This is the generated error (Running the specified command with -e and -X:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project kite-data-spark: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/carlos/hadoop/kite/kite-data/kite-data-spark/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project kite-data-spark: There are test failures.

Please refer to /Users/carlos/hadoop/kite/kite-data/kite-data-spark/target/surefire-reports for the individual test results.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.

Please refer to /Users/carlos/hadoop/kite/kite-data/kite-data-spark/target/surefire-reports for the individual test results.
at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:91)
at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:320)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:892)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:755)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :kite-data-spark

This post was the only solution I've found so far. I've attempted to execute mvn dependency::tree as suggested, which may have fixed the issue (as it is now stuck on the CDH5 step:

[INFO] Kite Data Crunch Module ............................ SUCCESS [ 0.116 s]
[INFO] Kite Data Flume Module ............................. SUCCESS [ 0.044 s]
[INFO] Kite Data Spark Module ............................. SUCCESS [ 0.173 s]
[INFO] Kite Application POM for CDH4 ...................... SUCCESS [ 42.860 s]
[INFO] Kite Application POM for CDH5 ...................... FAILURE [01:18 min]
[INFO] Kite Application POM Modules ....................... SKIPPED
[INFO] Kite Maven Plugin .................................. SKIPPED
[INFO] Kite Tools Module .................................. SKIPPED
[INFO] Kite Tools Runtime Module .......................... SKIPPED
[INFO] Kite Minicluster ................................... SKIPPED

The error now is:

[ERROR] Failed to execute goal on project kite-app-parent-cdh5: Could not resolve dependencies for project org.kitesdk:kite-app-parent-cdh5:pom:1.1.1-SNAPSHOT: Could not find artifact org.apache.parquet:parquet-avro:jar:1.5.0-cdh5.4.2 in com.cloudera.releases (https://repository.cloudera.com/artifactory/cloudera-repos/) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :kite-app-parent-cdh5

Exploring the URL provided, however, it appears to be out of date. Could someone clarify what is the best way to fix this? Which version should I point instead? Where do I edit it so it runs after the appropriate url?

Thanks.

Enhance ipynb support

I'm using vscode for coding python for almost two years, and jupyter in vscode provides me much convience and fun while coding. But vscode has shortage in autocompletion especially in ipynb file for which I've searched so many extensions but nothing very handy shows. Kite shows me the probability but ipynb is not support by kite in vscode. I really hope this can be enhanced by either kite or vscode developers.
Much thanks!

Java command - class resolving with Java 1.8

Class is not correctly resolved in ScriptEvaluator() -> FastJavaScriptEngine.compile() call.
I have Java code snippet in java command:

byte[] bytes = (byte[]) record.getFirstValue("time");
long time = (Long)org.apache.phoenix.schema.types.PLong.INSTANCE.toObject(bytes);
. . .
return child.process(record);

When this is evaluated in JavaBuilder -> ScriptEvaluator I get exception:
javax.script.ScriptException: Cannot compile script: ... caused by java.lang.NoSuchMethodException: edu.umd.cs.findbugs.annotations.SuppressWarnings.eval(org.kitesdk.morphline.api.Record, com.typesafe.config.Config, org.kitesdk.morphline.api.Command, org.kitesdk.morphline.api.Command, org.kitesdk.morphline.api.MorphlineContext, org.slf4j.Logger)
at org.kitesdk.morphline.scriptengine.java.ScriptEvaluator.throwScriptCompilationException(ScriptEvaluator.java:141)
at org.kitesdk.morphline.scriptengine.java.ScriptEvaluator.(ScriptEvaluator.java:108)
at TestScriptEvaluator.main(TestScriptEvaluator.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
It fails when compiling to Java 1.8, ok when 1.6 is set. But 1.8 is required by project.

ScriptEvaluator builds up string representation of temporary Java class and calls FastJavaScriptEngine to compile it. FastJavaScriptEngine tries to find class where eval method is expected. This is the point of failure under 1.8 bytecode. Here classBytes is map of size 2 with edu.umd.cs.findbugs.annotations.SuppressWarnings and org.kitesdk.morphline.scriptengine.java.scripts.MyJavaClass1 classes. Logic of method parse() will load first of these classes which is unfortunately SuppressWarnings and so the exception NoSuchMethodException is throwed.
When analyzing this I located the SuppressWarnings appears because of line:
...org.apache.phoenix.schema.types.PLong.INSTANCE.toObject(bytes);
but I don't see any compilation warning which can happen here.
But that is not important for this issue.

ScriptEvaluator knows the name of class which contains command code snippet and can tell this to FastJavaScriptEngine so it doesn't have to search for class. This is also workaround I used in my case (via mainClass context attribute).

Please verify such behaviour when Java 1.8 is used and consider some fix.
I attached TestScriptEvaluator class which simulates this error.

https://gist.github.com/brunatm/2c7b8b7a56c4a4a5ca96167b9bdb2786

Creating dataset gives IncompatibleSchemaException when using schema with union to allow null values

I just upgraded to 0.15.0 and code that used to work now throws IncompatibleSchemaException when I create a new dataset.

It only seems to happen when the schema I use has unions to allow for null values. Here is an example:

I have a POJO with three fields -

  private Long id;
  private String name;
  private Date birthDate;

Using this schema (created using ReflectData.AllowNull.get().getSchema(datasetClass)):

{
    "type": "record",
    "name": "TestPojo",
    "namespace": "org.springframework.data.hadoop.store.dataset",
    "fields": [{
        "name": "id",
        "type": ["null", "long"],
        "default": null
    }, {
        "name": "name",
        "type": ["null", "string"],
        "default": null
    }, {
        "name": "birthDate",
        "type": ["null",
        {
            "type": "record",
            "name": "Date",
            "namespace": "java.util",
            "fields": []
        }],
        "default": null
    }]
}

I get the following exception:

org.kitesdk.data.IncompatibleSchemaException: The reader schema derived from class org.springframework.data.hadoop.store.dataset.TestPojo is not compatible with the dataset's given writer schema.
    at org.kitesdk.data.spi.DataModelUtil.resolveType(DataModelUtil.java:104)
    at org.kitesdk.data.spi.AbstractDataset.<init>(AbstractDataset.java:44)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:83)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:109)
    at org.kitesdk.data.spi.filesystem.FileSystemDataset$Builder.build(FileSystemDataset.java:526)
    at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:142)
    at org.kitesdk.data.spi.AbstractDatasetRepository.create(AbstractDatasetRepository.java:35)
    at org.springframework.data.hadoop.store.dataset.DatasetUtils.getOrCreateDataset(DatasetUtils.java:88)
    at org.springframework.data.hadoop.store.dataset.AvroPojoDatasetStoreWriter.write(AvroPojoDatasetStoreWriter.java:57)
    at org.springframework.data.hadoop.store.dataset.DatasetTemplate.write(DatasetTemplate.java:270)
    at org.springframework.data.hadoop.store.dataset.DatasetTemplateTests.testReadSavedPojoWithNullValues(DatasetTemplateTests.java:70)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:74)
    at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:83)
    at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:233)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:87)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
    at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:71)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:176)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

csv-import from hdfs fails

I am trying to run
kite-dataset csv-import hdfs:/user/Florian/ratings.csv ratings

but it fails with the following error message:

org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:hive://{fullyqualifiedname}:9083/default/ratings
Check that JARs for hive datasets are on the classpath

I have created a schema and a table by running:

kite-dataset csv-schema ratings.csv -o rating.avsc
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet

I use Kite version "1.0.0-cdh5.8.0"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.