socrata / datasync Goto Github PK

View Code? Open in Web Editor NEW

76.0 83.0 33.0 32.89 MB

Desktop / Console application for updating Socrata datasets automatically.

Home Page: http://socrata.github.io/datasync/

License: MIT License

Java 99.96% Shell 0.04%

engineering core-platform

datasync's Introduction

Socrata Datasync

Last updated: June 2, 2017

Looking for the latest release? Get it here: https://github.com/socrata/datasync/releases

General Information

DataSync is an executable Java application which serves as a general solution to automate publishing data on the Socrata platform. It can be used through a easy-to-use graphical interface or as a command-line tool ('headless mode'). Whether you are a non-technical user, developer, or ETL specialist DataSync makes data publishing simple and reliable. DataSync takes a CSV or TSV file on a local machine or networked hard drive and publishes it to a Socrata dataset so that the Socrata dataset stays up-to-date. DataSync can also publish geospatial files such as zipped shapefiles, geoJSON, KML, and KMZ files. DataSync jobs can be integrated into an ETL process, scheduled using a tool such as the Windows Task Scheduler or Cron, or used to perform updates or create new datasets in batches. DataSync works on any platform that runs Java version 1.8 or higher (i.e. Windows, Mac, and Linux). This simple, yet powerful publishing tool lets you easily update Socrata datasets programmatically and automatically (scheduled), without writing a single line of code.

Comprehensive DataSync Documentation

The Socrata University Class: Socrata Introduction to Integration

Standard Jobs

Standard jobs can be set up to take a CSV data file from a local machine or networked folder and publish it to a specific dataset. A job can be automated easily using the Windows Task Scheduler or similar tool to run the job at specified intervals (i.e. once per day).

GIS Jobs

GIS jobs can be set up to handle geospatial datasets, such as zipped shapefiles, geoJSON, KML, or KMZ files and replace specific datasets on Socrata. The job can then be automated in a similar fashion to standard jobs.

Port Jobs

Port jobs are used for moving data around that is already on the Socrata platform. Users that have publisher rights can make copies of datasets through this tool. Port jobs allow the copying of both dataset schemas (metadata and columns) and data (rows).

Developers

This repository is our development basecamp. If you find a bug or have questions, comments, or suggestions, you can contribute to our issue tracker.

Apache Maven

DataSync uses Maven for building and package management. For more information: What is Maven?

To build the project, first you'll need to create an application token on your profile page. Put the random string it produces in a file called "api-key.txt" in the root directory of this project, then run

mvn clean install

To compile the project into an executable JAR file (including all dependencies) run:

mvn clean compile -Dmaven.test.skip=true assembly:single

This puts the JAR file into the "target" directory inside the repo. So to open DataSync, simply:

cd target
java -jar DataSync-1.8.2-jar-with-dependencies.jar

Java SDK

DataSync can be used as a Java SDK, for detailed documentation refer to: http://socrata.github.io/datasync/guides/datasync-library-sdk.html

datasync's People

Contributors

Stargazers

Watchers

datasync's Issues

Progress bar / feedback when running job from UI

NOTE: requested by CGI, Chicago and other customers.

Allow Delete upserts -- deleting records that match the Row Identifiers of the file to publish

Illegal address error when trying to send e-mail

Command Line from Linux:

Error sending email to: [email protected] addressjava.lang.IllegalArgument

Relevant line from config.json:

"adminEmail": "[email protected]",

Ability to run Port Job to transfer data to a destination dataset that contains a subset of columns of the source dataset

Currently porting data requires that the schemas of source and destination datasets are identical. We should enable porting to a destination dataset that has subset of columns of the source.

Implement "SmartUpdate" capability for highly efficient replace operations

SmartUpdate will allow customers to efficiently do replace operations in DataSync on very large datasets (1 million+ rows). SmartUpdate works by DataSync sending a zipped CSV file with all data over FTP and then the Socrata platform determines which rows have been added/updated and only publishes those to the dataset. This leads to dramatic gains in efficiency/performance so customs can have uploads complete faster and data will also be geocoded and indexed for searching much more quickly. SmartUpdate will enable using the "replace" method for datasets with upwards of 1 million rows which should remove the need for customers to handle determining which rows have been added/updated/deleted and using upsert. Instead it is as simple as dump all the data into a CSV and upload the CSV through DataSync "SmartUpdate".

When opening an .sij file that was moved since it was first saved 'Command to execute with scheduler' will be incorrect

This can be remedied by re-generating the command to execute with scheduler upon opening an .sij file.

Trim leading and trailing whitespace in CSV header column

If the the header row of a CSV file has leading or trailing whitespace around any of the column names the upload will fail for those columns. This can be fixed by simply trimming off whitespace around each header column name automatically before publishing the CSV.

UnrecognizedPropertyException occurs when publishing via 'upsert' on some datasets

When running jobs to publish data to certain datasets (with specific sharing configurations) this error occurs when using 'upsert' in DataSync:

com.sun.jersey.api.client.ClientHandlerException: org.codehaus.jackson.map.exc.UnrecognizedPropertyException: Unrecognized field "userEmail" (Class com.socrata.model.importer.Grant), not marked as ignorable

This is a bug in the soda-java library which was fixed in version 0.9.4

Enable user to (optionally) specify a :deleted column marking rows to be deleted

This issue is related another issue: #14

There should be an upload method call append/upsert/delete which allows specifying a column in the CSV with header :deleted and then having any value of true in that column trigger a delete of that row. Be sure to notify user if there is no Row ID set in the dataset. Include a Help icon explaining how this works.

User documentation for Port Jobs

Somehow we should create a "How-to" guide for Port Jobs. This should be a lot like this article:
http://support.socrata.com/entries/24241271-Setting-up-a-basic-DataSync-job

Add the ability to specify a job (.sij) file to open when launching DataSync from the command line or a shortcut

This would allow for creating shortcuts that not only launched DataSync but also opened a specified job file, similar to what can do with many applications that open/edit files. I know you can specify a job file when running a job but there could be situations where someone wants to open it before running it interactively.

It is possible this can already be done and I am just not figuring out how to do it.

Thanks.

Port job does not replicate column formatting for a Date/Time field

In my case, the field in the original dataset was mm/dd/yyyy but in the copy, time of day (midnight since the data had no times set) was shown.

For Port jobs, allow source and destination sites to be defaulted

Could be either as settable parameters or use the domain in the credentials.

When datatype is invalid in one row in the Row Identifier column the error message is incorrect.

Repro steps:
use upsert or replace method to upload a CSV file where one value in the Row Identifier column is the wrong datatype (i.e. text in a number column)

Expected: Unknown number format 'X' (line X of file)

Actual: Cannot find column X...

Trying to populate a URL Type field with an invalid URL format produces an error

A dataset will allow, although grey out, an invalid URL but DataSync produces an error message. My preference would be that if the portal, itself, allows the value, DataSync allow it. Dropping invalid values might be ok if necessary.

I am not in a position to give the specific example to reproduce the error since it is a private dataset under development (happy to provide the link to Adrian or someone else at Socrata but cannot post it quite this publicly) and I am changing the field to Plain Text in order to be able to proceed for now. However, I suspect it would be easy to reproduce with a test dataset. Leaving off the top level domain and using a comma instead of a period both create this error, although they are not the only conditions that produce it.

Restructure Publish methods to be less confusing

Although it is unclear within the existing UI of DataSync the 'upsert' and 'append' methods behave in exactly the same way. When an upsert or append is performed and no Row ID is set for the dataset being published to, the rows are all appended. If a Row ID is set, whether you use 'upsert' or 'append' the rows being uploaded that match existing rows will be updated.

Proposed change: When the user selects append as the publish method DataSync should check if a row ID is set in dataset. If one it set have a message appear that warns the user about it. Have a help bubble explaining everything next to "Publish method" field in UI as well.

Also need to update knowledge base articles.

Ability to (optionally) pass job and authentication parameters via command line

Relevant support ticket:
http://support.socrata.com/tickets/3342

Customers who requested this: City of Seattle, Hawaii

Enable uploading CSV files that do not contain a column header row

DataSync should give the user the option of uploading a CSV file without column headers. Have a checkbox to user checks to tell DataSync their CSV does not contain a column header row. If this box is checked DataSync should refer to the order of the columns in the CSV.

It might also be useful to give the user feedback in the UI if the column headers of the CSV file (or lack thereof) does not match that of the dataset.

Migrate to maven

Use Maven for package management. Move mail, soda-java, and org.json packages to be imported from Maven. Also update the README to reflect this change.

Improve error messaging: alert user when their account does not have publish rights to dataset being published to

Presently DataSync returns an error saying 'Not found' when the authenticated user (i.e. user whos username and password are entered into bottom left fields in DataSync) does not have rights to publish data to the given dataset. Instead the error message should be '[email protected] does not have permission to publish data to the dataset with ID X'

Are port job datasets automatically published by DS public or private?

If I select the option to publish the destination dataset, will it be public, private, or whichever the source dataset was? Thanks.

Integrate DataPort into DataSync

DataPort is a command line tool that allows you to transfer data from one Socrata data to another or duplicate an existing Socrata dataset. DataSync will have a new job type called a Port job where you can configure a job to do the things DataPort supports as part of DataSync.

Add support for TSV files (for append/upsert)

Allow users to upload TSV (in addition to CSV files).

This was requested by City of Seattle.

Implement JUnit tests for major functionality

Enable DataSync to automatically chunk up large files

Currently publishing very large files causes DataSync to crash due to running out of memory. NOTE: chunking will only be supported by 'upsert' and 'replace' for this release.

Implement SmartUpdate feature for efficient publishing of data from very large files

Enable normal DataSync jobs to be 'smart' about the data they publish by only publishing the rows of data that were updated or added in the CSV (File to publish). DataSync will maintain a record of the changes made to the CSV file to be published since the last publish operation was performed (i.e. last time job was run) in order to identify which rows in the CSV file were updated or changed since the last run of the job. Then when it runs the job it will only publish those newly updated or added rows (rather than all rows in the CSV as it does now). This feature will only work for append/upsert (not replace).

For Port jobs, allow a name for the new dataset to be specified

If you wanted to be really slick, maybe offer a set of standard options -- such as appending "Copy" or "Backup" to the source name -- in addition to the ability to enter one's own value.

Enable DataSync to be used as a Java library to enable more seamless integration into Java-based ETL processes

Ability to configure jobs to run through a proxy

This may help customers who have issues using a standard API connection due to firewall issues.

Customers who may benefit: Lombardia, CMS, Dept of Health (NY)

When porting a dataset with a resource name set, a port job gives the error "Validation failed. This resource name is taken. Try [suggested name]"

Add a progress bar for a "current" run of a job.

For larger datasets, we can update the progress of a job and report it to the user as it's running.

Error message gives incorrect CSV line number when row level errors exist when file chunking is enabled

The line number of the problematic row is given relative to the chunk being uploaded so it does not reflect the line number in the CSV file making it difficult for the user to find the invalid row in their CSV.

NOTE: This bug only occurs when file chunking is enabled.

Add license?

It's awesome that this is open source, but it would be great to have an explicit license attached.

GitHub has some great boilerplate options:
https://github.com/blog/1530-choosing-an-open-source-license

Allow chunk size and chunking filesize threshold to be configured in Preferences

Currently the chunk size (in number of rows) as well as the filesize threshold that triggers chunking are both hard-coded into DataSync. Instead the user should be able to optionally configure these settings by going to Preferences.

Enable DataSync to act as an adaptor to publish data from Pentaho

The simplest way to achieve this would be to enable configuring a DataSync job by passing command line parameters as an alternative to running a saved job file. Then a Pentaho template could be created that calls DataSync in this way.

Update DataSync README to be up-to-date for potential developers

Mainly this means documenting that Maven is used for package management (rather than importing JARs as external libraries).

Record error in log dataset when job file argument (which triggers command line mode) points to a non-existent file

This will be useful to notify people if they accidentally moved or deleted a .sij file that a scheduling or other external tool was using. Currently an error is thrown but no information is recorded in log and thus auto-emailing would not occur.

Add support for TSV files (when publishing via replace method)

Currently only append/upsert methods support uploading TSV files. Support for replace method should be added for TSV format.

Error messages in DataSync 0.3 show up as blank

This is a bad bug that somehow didn't get noticed before the last release (v0.3). I already fixed it and re-uploaded the JAR for the release.

Provide user with useful feedback if header row is invalid

In the case where none of the header row column names match any existing column in the dataset the user should be notified. Currently the job just dies with no error or success message when this is the case.

Optimize DataSync upserting API calls and chunking strategy

To optimize the upsert API calls we should investigate using the upsert CSV method directly (but this may only work for files that contain headers). This may be challenging to support with chunking.

It makes a lot more sense to have chunk size be based on file size rather than number of rows.

Silent failure when saving a job file that was previously saved to a file location that no longer exists

To fix this the user should be notified that saving the job failed and be asked to Save the job to a new location. This fix should include error handling for any issues that occur when a job is saved to a file.

Allow scheduling jobs that upload non-CSV filetypes

This would be great to have!

Our particular use case is shapefiles (.zip). The Java API client seems to support that, per this support thread:
http://support.socrata.com/entries/28307593-Is-it-possible-to-programmatically-upload-or-update-shapefiles-

Allow saving settings on a machine that is not able to run the GUI interface

With the GUI interface, one can set certain settings and preferences once and have them applied to all jobs (unless specifically overridden by a command-line parameter, I assume). They are saved in the Windows registry or other OS-specific location.

However, this only works for a machine capable of running the GUI. In some cases, such as a Linux server accessed through terminal emulator, that is not possible. It would be great if there were a way to save these settings through the command line. Maybe some parameter that effectively said "Save all these other parameters as if entered through the GUI"? Would that allow for substantial reuse of the code the GUI uses to save these settings?

I realize this does not add a lot of security vs. using command-line parameters or a configuration JSON file but it adds a little, even if only security through obscurity and adds convenience vs. the command-line.

Thank you.

Enable Port jobs to allow transferring data between two existing datasets

Also allow the user to select the Port publish method that would be used to load data into the other dataset (upsert, append or replace).
NOTE: this feature has been requested by customer: CGI.

Enable Sync jobs to also schedule exports

Suggested by Matthew VB (from unknown client)

Add help bubbles to explain data input fields within UI

Explain Dataset ID, Publish Method, etc within the UI by putting help bubbles that popup explanations upon hovering.

Silent failure when running a job

I ran a job (interactively) and it fails silently -- meaning no alert, no apparent change to the dataset, no entry in the Log Dataset. The only signal is that the Run Job Now button becomes clickable again.

It is a Replace job and the source CSV is 76 MB. I have the chunking threshold at 64 MB. I initially had the chunk size at the default of 25,000 rows but lowered it as low as 10,000, with no success.

What other diagnostic information can I provide?

Thank you.

Test issue

Enable file chunking support for 'replace' method

Currently chunking (which enables uploading very large files) is only supported by 'append' and 'upsert'.

This should be implemented by creating a dataset working copy via .copySchema method and then pushing rows to the resulting working copy in chunks. This is to prevent the dataset being in a bad/inconsistent state if a job fails part-way through.