Code Monkey home page Code Monkey logo

dspace-packer's Introduction

##About

This is a script I'm writing to port our current workflow to a better bash solution (consolidating code, etc.). Ideally, this script should be able to replace everything in this repository:

https://github.com/yorkulibraries/dspacescripts

The function of this script is to do the create Dspace Simple Archives in order to do batch uploading. So the output of the script is many directories with the following files: the object for upload, the contents file, and a valid dublin_core.xml.

##Installation

  1. Clone the repo to wherever you want it to live:

    git clone https://github.com/nmpx/dspace-packer.git

  2. Install the dependency/submodule

    This does have one submodule dependency, a python script that takes the xlsx and converts it to a csv with escaped newlines.

    Grab the submodule by running this command with the script repo:

    git submodule update --recursive

  3. Make the script executable

    chmod +x dspace-simple-archive-packager.sh

##Usage

A typical call to the script will look like this:

./dspace-simple-archive-packager.sh -d [delimiter] -o [path/to/objects/] -s [filetype] foo.xlsx

Explanation:

The script takes three mandatory flags:

  1. -d, is to set the delimiter for the resulting CSV. You should pick a character that is not contained within any field, otherwise, the parsing will be wrong.
  2. -o, is the directory of the objects/item you want to ingest/upload into DSpace.
  3. -s, is for the suffix of the objects without the period (eg. 'pdf' 'jpg' but not '.jpg' or '.pdf').

The argument the script takes is the metadata for the objects as an XLSX spreadsheet.

All these elements are necessary for a successful execution of the script.

##Why xlsx2csv

My initial hope was that I could write something that would just allow us to use the regular 'save as... csv' option in Excel or LibreOffice. But this proved impossible.

The major stumbling block? Newlines. The nature of the type of data we can have is that someone's abstract might contain newlines. When you use a GUI tool to export to csv, these newlines are not escaped. Which essentially ruins your ability to work with the resulting csv on the commandline, which processes text files line by line. Except that an abstract field with new lines might actually occupy 3 lines.

And bash only tools are rather limited in their ability to manipulate CSVs. I'm sure that converting from xlsx to csv using bash tools is possible, but its beyond my current skill level.

xlsx2csv was the only tool I was able to find that escaped and preserved newlines, so that when the csv is converted to xml, the escaped newlines can be expanded again, to ensure that the metadata looks the way it ought to.

I'd love it if someone were to point me in the direction (either by tutorial or help with the code) of doing this wholly within bash to remove the python dependency.

##One note of warning...

Depending on your system, you may or may not need to adjust the python command to include 'sudo'. In my local environment I need this to run python scripts but not on our server.

Something to keep in mind.

dspace-packer's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.