Code Monkey home page Code Monkey logo

datasync's Introduction

Socrata Datasync

Last updated: June 2, 2017

Looking for the latest release? Get it here: https://github.com/socrata/datasync/releases

General Information

DataSync is an executable Java application which serves as a general solution to automate publishing data on the Socrata platform. It can be used through a easy-to-use graphical interface or as a command-line tool ('headless mode'). Whether you are a non-technical user, developer, or ETL specialist DataSync makes data publishing simple and reliable. DataSync takes a CSV or TSV file on a local machine or networked hard drive and publishes it to a Socrata dataset so that the Socrata dataset stays up-to-date. DataSync can also publish geospatial files such as zipped shapefiles, geoJSON, KML, and KMZ files. DataSync jobs can be integrated into an ETL process, scheduled using a tool such as the Windows Task Scheduler or Cron, or used to perform updates or create new datasets in batches. DataSync works on any platform that runs Java version 1.8 or higher (i.e. Windows, Mac, and Linux). This simple, yet powerful publishing tool lets you easily update Socrata datasets programmatically and automatically (scheduled), without writing a single line of code.

Comprehensive DataSync Documentation

The Socrata University Class: Socrata Introduction to Integration

Standard Jobs

Standard jobs can be set up to take a CSV data file from a local machine or networked folder and publish it to a specific dataset. A job can be automated easily using the Windows Task Scheduler or similar tool to run the job at specified intervals (i.e. once per day). standard job tab

GIS Jobs

GIS jobs can be set up to handle geospatial datasets, such as zipped shapefiles, geoJSON, KML, or KMZ files and replace specific datasets on Socrata. The job can then be automated in a similar fashion to standard jobs.

Port Jobs

Port jobs are used for moving data around that is already on the Socrata platform. Users that have publisher rights can make copies of datasets through this tool. Port jobs allow the copying of both dataset schemas (metadata and columns) and data (rows). port job tab

Developers

This repository is our development basecamp. If you find a bug or have questions, comments, or suggestions, you can contribute to our issue tracker.

Apache Maven

DataSync uses Maven for building and package management. For more information: What is Maven?

To build the project, first you'll need to create an application token on your profile page. Put the random string it produces in a file called "api-key.txt" in the root directory of this project, then run

mvn clean install

To compile the project into an executable JAR file (including all dependencies) run:

mvn clean compile -Dmaven.test.skip=true assembly:single

This puts the JAR file into the "target" directory inside the repo. So to open DataSync, simply:

cd target
java -jar DataSync-1.8.2-jar-with-dependencies.jar

Java SDK

DataSync can be used as a Java SDK, for detailed documentation refer to: http://socrata.github.io/datasync/guides/datasync-library-sdk.html

datasync's People

Contributors

aescobarcruz avatar alaurenz avatar bhwilliamson avatar catstavi avatar charlottewest avatar chitang avatar courtneyspurgeon avatar dependabot[bot] avatar gregorrichardson avatar louisfettet avatar malindac avatar michaelb990 avatar peteraustinmoore avatar rjmac avatar spaceballone avatar urmilan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasync's Issues

Implement "SmartUpdate" capability for highly efficient replace operations

SmartUpdate will allow customers to efficiently do replace operations in DataSync on very large datasets (1 million+ rows). SmartUpdate works by DataSync sending a zipped CSV file with all data over FTP and then the Socrata platform determines which rows have been added/updated and only publishes those to the dataset. This leads to dramatic gains in efficiency/performance so customs can have uploads complete faster and data will also be geocoded and indexed for searching much more quickly. SmartUpdate will enable using the "replace" method for datasets with upwards of 1 million rows which should remove the need for customers to handle determining which rows have been added/updated/deleted and using upsert. Instead it is as simple as dump all the data into a CSV and upload the CSV through DataSync "SmartUpdate".

Trim leading and trailing whitespace in CSV header column

If the the header row of a CSV file has leading or trailing whitespace around any of the column names the upload will fail for those columns. This can be fixed by simply trimming off whitespace around each header column name automatically before publishing the CSV.

UnrecognizedPropertyException occurs when publishing via 'upsert' on some datasets

When running jobs to publish data to certain datasets (with specific sharing configurations) this error occurs when using 'upsert' in DataSync:

com.sun.jersey.api.client.ClientHandlerException: org.codehaus.jackson.map.exc.UnrecognizedPropertyException: Unrecognized field "userEmail" (Class com.socrata.model.importer.Grant), not marked as ignorable

This is a bug in the soda-java library which was fixed in version 0.9.4

Enable user to (optionally) specify a :deleted column marking rows to be deleted

This issue is related another issue: #14

There should be an upload method call append/upsert/delete which allows specifying a column in the CSV with header :deleted and then having any value of true in that column trigger a delete of that row. Be sure to notify user if there is no Row ID set in the dataset. Include a Help icon explaining how this works.

Add the ability to specify a job (.sij) file to open when launching DataSync from the command line or a shortcut

This would allow for creating shortcuts that not only launched DataSync but also opened a specified job file, similar to what can do with many applications that open/edit files. I know you can specify a job file when running a job but there could be situations where someone wants to open it before running it interactively.

It is possible this can already be done and I am just not figuring out how to do it.

Thanks.

Trying to populate a URL Type field with an invalid URL format produces an error

A dataset will allow, although grey out, an invalid URL but DataSync produces an error message. My preference would be that if the portal, itself, allows the value, DataSync allow it. Dropping invalid values might be ok if necessary.

I am not in a position to give the specific example to reproduce the error since it is a private dataset under development (happy to provide the link to Adrian or someone else at Socrata but cannot post it quite this publicly) and I am changing the field to Plain Text in order to be able to proceed for now. However, I suspect it would be easy to reproduce with a test dataset. Leaving off the top level domain and using a comma instead of a period both create this error, although they are not the only conditions that produce it.

Restructure Publish methods to be less confusing

Although it is unclear within the existing UI of DataSync the 'upsert' and 'append' methods behave in exactly the same way. When an upsert or append is performed and no Row ID is set for the dataset being published to, the rows are all appended. If a Row ID is set, whether you use 'upsert' or 'append' the rows being uploaded that match existing rows will be updated.

Proposed change: When the user selects append as the publish method DataSync should check if a row ID is set in dataset. If one it set have a message appear that warns the user about it. Have a help bubble explaining everything next to "Publish method" field in UI as well.

Also need to update knowledge base articles.

Enable uploading CSV files that do not contain a column header row

DataSync should give the user the option of uploading a CSV file without column headers. Have a checkbox to user checks to tell DataSync their CSV does not contain a column header row. If this box is checked DataSync should refer to the order of the columns in the CSV.

It might also be useful to give the user feedback in the UI if the column headers of the CSV file (or lack thereof) does not match that of the dataset.

Migrate to maven

Use Maven for package management. Move mail, soda-java, and org.json packages to be imported from Maven. Also update the README to reflect this change.

Integrate DataPort into DataSync

DataPort is a command line tool that allows you to transfer data from one Socrata data to another or duplicate an existing Socrata dataset. DataSync will have a new job type called a Port job where you can configure a job to do the things DataPort supports as part of DataSync.

Implement SmartUpdate feature for efficient publishing of data from very large files

Enable normal DataSync jobs to be 'smart' about the data they publish by only publishing the rows of data that were updated or added in the CSV (File to publish). DataSync will maintain a record of the changes made to the CSV file to be published since the last publish operation was performed (i.e. last time job was run) in order to identify which rows in the CSV file were updated or changed since the last run of the job. Then when it runs the job it will only publish those newly updated or added rows (rather than all rows in the CSV as it does now). This feature will only work for append/upsert (not replace).

Optimize DataSync upserting API calls and chunking strategy

To optimize the upsert API calls we should investigate using the upsert CSV method directly (but this may only work for files that contain headers). This may be challenging to support with chunking.

It makes a lot more sense to have chunk size be based on file size rather than number of rows.

Allow saving settings on a machine that is not able to run the GUI interface

With the GUI interface, one can set certain settings and preferences once and have them applied to all jobs (unless specifically overridden by a command-line parameter, I assume). They are saved in the Windows registry or other OS-specific location.

However, this only works for a machine capable of running the GUI. In some cases, such as a Linux server accessed through terminal emulator, that is not possible. It would be great if there were a way to save these settings through the command line. Maybe some parameter that effectively said "Save all these other parameters as if entered through the GUI"? Would that allow for substantial reuse of the code the GUI uses to save these settings?

I realize this does not add a lot of security vs. using command-line parameters or a configuration JSON file but it adds a little, even if only security through obscurity and adds convenience vs. the command-line.

Thank you.

Silent failure when running a job

I ran a job (interactively) and it fails silently -- meaning no alert, no apparent change to the dataset, no entry in the Log Dataset. The only signal is that the Run Job Now button becomes clickable again.

It is a Replace job and the source CSV is 76 MB. I have the chunking threshold at 64 MB. I initially had the chunk size at the default of 25,000 rows but lowered it as low as 10,000, with no success.

What other diagnostic information can I provide?

Thank you.

Enable file chunking support for 'replace' method

Currently chunking (which enables uploading very large files) is only supported by 'append' and 'upsert'.

This should be implemented by creating a dataset working copy via .copySchema method and then pushing rows to the resulting working copy in chunks. This is to prevent the dataset being in a bad/inconsistent state if a job fails part-way through.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.