linkedconnections / gtfs2lc Goto Github PK

View Code? Open in Web Editor NEW

28.0 8.0 14.0 5.69 MB

GTFS to Linked Connections converter

Home Page: http://linkedconnections.org

License: MIT License

JavaScript 91.80% Shell 8.20%

gtfs rdf opentransport linkeddata smartcity

gtfs2lc's People

Contributors

Stargazers

Watchers

Forkers

brechtvdv rodklerc irail podigg zazukoians cuulee julianrojas87 bertware smazzoleni verocity-ug derhuerst ryanbyloos j-steinbach

gtfs2lc's Issues

An in-range update of fast-csv is breaking the build 🚨

The dependency fast-csv was updated from `3.2.0` to `3.3.0`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

fast-csv is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).

Release Notes for v3.3.0

[FIXED] First row of CSV is removed when headers array is provided #252

Commits

The new version differs by 2 commits.

475103b Merge pull request #269 from C2FO/v3.3.0-rc
557cf5a v3.3.0

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Split in workers

In order to achieve a faster translation, we could split the connections.txt file in parts and launch multiple gtfs2lc.js workers, depending on the number of cores of the machine.

PickupType and DropOffType unavailable

If both are unavailable, we just discard this connection right now. However, when the Real-Time version is than ran, it will not be able to match a RT update with this connection. We should keep it in there after all.

Fatal crash on GTFS files depending on calendar_dates

If a lot of calendar entries are present in a GTFS file for the same trip id, the script will crash. This is caused in calendar.js:88.


RangeError: Maximum call stack size exceeded
    at RegExp.test (<anonymous>)
    at expandFormat (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:627:48)
    at configFromStringAndFormat (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:2407:18)
    at prepareConfig (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:2575:13)
    at createFromConfig (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:2544:44)
    at createLocalOrUTC (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:2631:16)
    at createLocal (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:2635:16)
    at hooks (/home/bert/Desktop/linked-connections-server/node_modules/moment/moment.js:12:29)
    at StreamIterator.next (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/StreamIterator.js:34:56)
    at CalendarToServices._processCalendarDates (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/services/calendar.js:86:33)
    at /home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/services/calendar.js:88:14
    at StreamIterator.next (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/StreamIterator.js:36:5)
    at CalendarToServices._processCalendarDates (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/services/calendar.js:86:33)
    at /home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/services/calendar.js:88:14
    at StreamIterator.next (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/StreamIterator.js:36:5)
    at CalendarToServices._processCalendarDates (/home/bert/Desktop/linked-connections-server/node_modules/gtfs2lc/lib/services/calendar.js:86:33)

Example GTFS file (Too big for git): https://filehost.net/89f4172762918be7

I am aware that this GTFS file does not follow GTFS best practices (calendars.txt is not used, but instead calendar_dates is used), but if mass adoption is to follow, it might be best to support this.

Platform numbers for SNCB/NMBS

We are starting to talk about using Itinero-transit commercially, but not having platform numbers is a bit of a deal breaker for that.

I know this is a bit of an issue, so I wanted to log this here and (re)start the effort to get the data.

Error when calendar.txt is not set

calendar.txt is required according to the GTFS reference, yet not all GTFS feeds actually give a calendar.txt. We should handle the case when there's no calendar.txt available.

Expose trips info of `block_id`s to enable stable URI generation

GTFS block_ids are local identifiers that might change from one version to the next. By exposing the information of their associated trips we enable the creation of stable URIs via the URI template configuration.

Correct headsigns

Connection’s headsign is incorrect according to the spec.

See fcd12e6#r30889344

An in-range update of fast-csv is breaking the build 🚨

The dependency fast-csv was updated from `4.0.3` to `4.1.0`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

fast-csv is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).

Commits

The new version differs by 4 commits.

682710d v4.1.0
b9dd314 Merge pull request #327 from C2FO/v4.1.0-rc
22e4fb7 Added benchmarks for files of 1000 and 10000
c0d8f72 Added headers event #321

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

URI strategy for connections

How are we going to give a unique and persistent identifier to the connections, even when the data gets erased and added again?

First of all, for a real-world connection published by us, I suggest using http://id.linkedconnections.org/{feedid}/{version}/{localid}.

The localid is then composed out of:

gtfs:trip local identifier
the YYYY-MM-DD string for when the trip is executed
a count for the x-th time the trip is being executed that day

Mind that this URI strategy is specific to gtfs2lc. A different URI strategy can be chosen for different systems.

I have created a separate repository for redirecting (303) GET requests to URIs of pages containing this contection

Put database files in path where gtfs is extracted

Map a mode to a trip

A trip is executed with a certain mode. E.g., Bus, Tram, Gondola, etc.

How can we map this into gtfs2lc?

URI template for start date of a trip: expose service day in the uri templates

Wait for #60

Only transform connections between a certain interval

Make sure we can add -s and -e to the command line parameters with a startDate and endDate for what we want to transform. Both parameters should be optional

Crash when a trip isn't in trips.txt

When trying to convert the attached GTFS file, GTFS2LC crashes


3|datasets | Indexing services and routes succesful!
3|datasets | Error: Unhandled "error" event. (Did not find this trip id in trips.txt: 55408414)
3|datasets |     at ConnectionsBuilder.emit (events.js:186:19)
3|datasets |     at ConnectionsBuilder.onerror (_stream_readable.js:663:12)
3|datasets |     at emitOne (events.js:116:13)
3|datasets |     at ConnectionsBuilder.emit (events.js:211:7)
3|datasets |     at onwriteError (_stream_writable.js:417:12)
3|datasets |     at onwrite (_stream_writable.js:439:5)
3|datasets |     at ConnectionsBuilder.afterTransform (_stream_transform.js:90:3)
3|datasets |     at _expandTrip.then.catch (/var/www/dk.lc.bertmarcelis.be/node_modules/gtfs2lc/lib/ConnectionsBuilder.js:79:5)
3|datasets |     at <anonymous>
3|datasets |     at process._tickCallback (internal/process/next_tick.js:189:7)

DSB.zip

My guess: FeedValidator makes no problem of unused trips. Only trips which are in trips.txt should be used, not all trips in stop_times.txt.

Removing references to this trip from stop_times resolves the issue.

UTF-8 artefacts in routes.txt

routes.txt still doesn’t get UTF-16 artefacts removed

GTFS to test with: https://www.data.gouv.fr/en/datasets/offre-de-transport-rbus-de-la-c-a-de-rochefort-ocean/

Fix by adding routes.txt to gtfs2lc-sort.

Thanks @l-vincent-l for showing this bug!

Trying gtfs2lc with EMT data from Madrid

After installing with "npm install" the modules csv, level, n3, q and unzip, I started the execution of gtfs2lc at 10:35 am: ./gtfs-csv2connections path/to/data/transitEMT.zip > path/to/data/emtConnections.ttl

Let me know if you want to access the EMT GTFS data.

After completing some steps, at 4:15 pm it crashed with the following message:
Draining Agencies
Transforming Calendar
Transforming CalendarDates
Transforming Frequencies
Transforming Routes
Draining Shapes and Shape Segments
Draining Stops
Transforming Stop Times
Transforming Trips
Transforming GTFS store to arrival/departures
FATAL ERROR: JS Allocation failed - process out of memory
Aborted (core dumped)

The system crash message is titled "nodejs crashed with SIGABRT in v8::Function::Call()". A crash report was created at /var/crash/ (~90MB).

The following folders were created at the execution path: arrivals, dates, departures, and stop_times. They all contain .ldb documents and a LOG, among others. Let me know if you need to see any of the logs.

I am using Ubuntu 14.04.1 LTS "Trusty Tahr", running on a Toshiba Portégé with Intel CORE i7 and 16GB RAM.

Add a URI template parameter feed_version to indicate the version of the feed

Different options to implement the identifier strategy for e.g., keeping a block ID persistent:

Using just local identifiers for e.g., a block id will give 2 problems:

break federation, as multiple GTFS feeds get translated into Linked Connections and will reuse the same block IDs.
When an updated GTFS feed gets published, the block ids might conflict with the earlier version.

Suggestion solution to introduce a global identifier

https://example.org/blocks/{block_id}

Solved the problem with federating over different source, but not yet the problem of making it work when an updated GTFS file gets translated to LC (unless for your GTFS feed, block ids are incremental over time and you can rely on this).

So we need to scope it to the specific GTFS feed and this brings us to another problem: how do you identify this specific GTFS feed or the fact it got translated to RDF here.

Suggestion for a version number:

Rely on feed_version in feed_info.txt -- Design issue: don’t include patch version so that a block id stays the same when the minor and major version number didn’t change? (e.g., 1.2.0 → 1.2.1)
When the GTFS feed’s version is not set, instead use a timestamp from the moment we started gtfs2lc

URI template for e.g., block then becomes:
https://example.org/blocks/{feed_version}/{block_id}

Jsonld and mongold format are slow

When converting to jsonld (or mongold), our process takes a lot longer due to a jsonld compacting taking place, after raw triples are brought together in a json object.

We could however do the conversation to jsonld a lot faster by doing just converting the json objects that come out of the transformer instead of using the jsonld-stream library.

Trip rule object mutates inside Transform's for loop.

When modifying the trip rule object to add its start time we need to make sure a new trip rule object gets created for each iteration of this for loop to avoid wrongful mutation when doing fast stream reading.

Add a way to configure base URIs

Right now, example.org is used in every RDF output. This should be changed to something configurable

I suggest -b --baseUris <baseUri> : a mapping file with base URIs for RDF outputs

Pick-up type is not the dropOffType

See fcd12e6#r30889323

Add mongodb importer

Add a way to import the data into mongodb

Problems when using different end lines

\n, \n\r
De Lijn uses them randomly. Just make sure that when sorting, we also make sure the endline is only a \n

Url-encode the local identifiers when creating linked data

Make sure that if a leveldb already exists in the directory, it's thrown away before starting

Move from moment.js to date-fns

moment.js has performance issues (date-fns/date-fns#275 (comment))
We should move to date-fns for our dates and work on top of the native Date JS object

https://github.com/date-fns/date-fns

@julianrojas87 Also relevant for GTFS-RT2LC and the LC Server

fails to convert

I tried to convert the 2021-02-12 VBB GTFS feed.

npm init --yes
npm i gtfs2lc -D
wget -r --no-parent --no-directories -P gtfs -N 'https://vbb-gtfs.jannisr.de/2021-02-12/'
# rename all .csv to #.txt …
env NODE_ENV=production gtfs2lc gtfs -f jsonld | head -n 3
# GTFS to linked connections converter use --help to discover more functions
# Indexing of stops, services, routes and trips completed successfully!
# Created worker thread (PID 1)
# Created worker thread (PID 2)
# Created worker thread (PID 3)
# Created worker thread (PID 4)
# [Error: ENOENT: no such file or directory, open 'gtfs/connections_0.txt'] {
#   errno: -2,
#   code: 'ENOENT',
#   syscall: 'open',
#   path: 'gtfs/connections_0.txt'
# }
# Error: Worker stopped with exit code 1
#     at Worker.<anonymous> (/Users/j/playground/vbb-gtfs-lc/node_modules/gtfs2lc/lib/gtfs2connections.js:145:27)
#     at Worker.emit (node:events:378:20)
#     at Worker.[kOnExit] (node:internal/worker:260:10)
#     at Worker.<computed>.onexit (node:internal/worker:187:20)
# [Error: ENOENT: no such file or directory, open 'gtfs/connections_1.txt'] {
#   errno: -2,
#   code: 'ENOENT',
#   syscall: 'open',
#   path: 'gtfs/connections_1.txt'
# }
# Error: Worker stopped with exit code 1
#     at Worker.<anonymous> (/Users/j/playground/vbb-gtfs-lc/node_modules/gtfs2lc/lib/gtfs2connections.js:145:27)
#     at Worker.emit (node:events:378:20)
#     at Worker.[kOnExit] (node:internal/worker:260:10)
#     at Worker.<computed>.onexit (node:internal/worker:187:20)
# [Error: ENOENT: no such file or directory, open 'gtfs/connections_2.txt'] {
#   errno: -2,
#   code: 'ENOENT',
#   syscall: 'open',
#   path: 'gtfs/connections_2.txt'
# }
# Error: Worker stopped with exit code 1
#     at Worker.<anonymous> (/Users/j/playground/vbb-gtfs-lc/node_modules/gtfs2lc/lib/gtfs2connections.js:145:27)
#     at Worker.emit (node:events:378:20)
#     at Worker.[kOnExit] (node:internal/worker:260:10)
#     at Worker.<computed>.onexit (node:internal/worker:187:20)
# [Error: ENOENT: no such file or directory, open 'gtfs/connections_3.txt'] {
#   errno: -2,
#   code: 'ENOENT',
#   syscall: 'open',
#   path: 'gtfs/connections_3.txt'
# }
# Error: Worker stopped with exit code 1
#     at Worker.<anonymous> (/Users/j/playground/vbb-gtfs-lc/node_modules/gtfs2lc/lib/gtfs2connections.js:145:27)
#     at Worker.emit (node:events:378:20)
#     at Worker.[kOnExit] (node:internal/worker:260:10)
#     at Worker.<computed>.onexit (node:internal/worker:187:20)

It has also created 4 files inside gtfs:

ls -l gtfs
# -rw-r--r--@  1 j  staff       3537 Feb 22 01:10 agency.txt
# -rw-r--r--   1 j  staff      79382 Feb 22 01:14 calendar.txt
# -rw-r--r--   1 j  staff     859354 Feb 22 01:14 calendar_dates.txt
# -rw-r--r--@  1 j  staff         64 Feb 22 01:10 frequencies.txt
# -rw-r--r--@  1 j  staff        140 Feb 22 01:10 pathways.txt
# -rw-r--r--   1 j  staff          0 Feb 22 01:23 raw_0.json
# -rw-r--r--   1 j  staff          0 Feb 22 01:23 raw_1.json
# -rw-r--r--   1 j  staff          0 Feb 22 01:23 raw_2.json
# -rw-r--r--   1 j  staff          0 Feb 22 01:23 raw_3.json
# -rw-r--r--@  1 j  staff      48812 Feb 22 01:10 routes.txt
# -rw-r--r--@  1 j  staff  143590907 Feb 22 01:10 shapes.txt
# -rw-r--r--   1 j  staff  269753688 Feb 22 01:14 stop_times.txt
# -rw-r--r--@  1 j  staff    4723089 Feb 22 01:10 stops.txt
# -rw-r--r--@  1 j  staff    4200935 Feb 22 01:10 transfers.txt
# -rw-r--r--@  1 j  staff   14019736 Feb 22 01:10 trips.txt

Last calendar_date gets processed twice

time handling might be wrong

I stumbled upon this because, when building gtfs-utils, I identifed this as a bug: GTFS Time values are defined relative to "12 hours before noon", so the implementation in this repo seems to fail during DST switches.

gtfs2lc/lib/ConnectionsBuilder.js

Lines 15 to 17 in cb0bdac

    
           const addDuration = function (date, duration) { 
        
             return addSeconds(addMinutes(addHours(date, duration.hours), duration.minutes), duration.seconds); 
        
           }

related: google/transit#15

Transfer times

We need to map the minimum transfer times to Linked Connections as well. Not sure how to model that. Any ideas?

Basically we want to express that if you’re at a gtfs:Station, you can transfer from its gtfs:Stops onlly if you take into account a minimum transfer time of X seconds.

Change -p option as the main argument

The path is always needed. Thus don't require the -p option any longer

Frequencies

We need to process frequencies as well...

This means reading an extra file if it exists, and adding extra connections based on the connectionRules stream

Support different file formats

Support:

csv
json
jsonld
turtle
ntriples

gtfs2lc-sort Does not work when gtfs2lc is installed globally.

When installed globally gtfs2lc is copied to /bin/ but then the file stoptimes2connections.js cannot be found anymore as it is not in $CURDIR/stoptimes2connections.js

Interpolation support

Is valid GTFS to define stop_times with both empty arrival_time and departure_time. However we cannot have Connections with empty departure or arrival times.

According to the spec, this implies that the consumer needs to interpolate these stop times, which is difficult in this case given that Connections are created in a streaming way.

A possible fix could be to do this as part of the pre-processing step that already takes place to order stop_times. Reusing an existing tool that handles this scenario, might make things easier.

time zone

Make sure that timezone is configurable or readable from the feed info

gtfs2lc in macOS

I'm trying to run gtfs2lc in a macOS but some problems appear:

When I run gtfs2lc-sort it seems like the command remove all "r" from the files and change their format, an example: https://github.com/dachafra/spaingtfs/tree/master/gtfs2lc/metro. The output doesn't say anything about an error:
dchaves$ gtfs2lc-sort metro/
Converting newlines dos2unix
Removing UTF-8 artifacts in directory metro/
Sorting files in directory metro/
After that, if I run gtfs2lc and the output is:
dchaves$ gtfs2lc metro/
GTFS to linked connections converter use --help to discover more functions

The same dataset in an Ubuntu dist works perfectly.

Make a bash script that ensures the ordering

Make another script which is exposed through the gtfs2lc command which ensures the ordering of the extracted GTFS files.

Use methods for the base URIs instead

After thinking about this with @smazzoleni, we need to have any flexibility for the baseURIs to be generated with any kind of javascript functionality instead of only uri templates.

The solution we favored was to make the baseURIs.json config file a JS file with methods instead. Every kind of GTFS file will need to extend this class into their own system.

One such class could actually be an implementation where a config file can be taken account (for backwards compatibility) with URI templates, and possibly link it as follows for slightly more functionality (idea by @smazzoleni):

{
  "stop": "http://data.gtfs.org/example/stops/{stop_id}",
  "route": "http://data.gtfs.org/example/routes/{route_short_id}",
  "trip": "http://data.gtfs.org/example/trips/{trip_id}/{trip_startTime}",
  "connection": "http://example/linkedconnections.org/connections/{trip_startTime}/{departureStop}/{trip_id}",
  "resolve": {
    "route_short_id": "connection.trip.route.route_id.substring(0,5)",
    "trip_id": "connection.trip.trip_id",
    "trip_startTime": "format(connection.trip.startTime, 'YYYYMMDDTHHMM');",
    "departureStop": "connection.departureStop"
  }
}

Mongo output for datetime is wrong

Does this expect miliseconds instead of ISO8601?

Trains that split or merge

The NMBS operates several trains which are split into two indepent trains at some point in their journey. The reverse situation also occurs, with two trains merging into one. Splitting or merging always takes place in a station where traveller can (dis)embark the train.

Splitting

First of all, the train drives the first part of its journey as a whole, during which it is identified by a single identifier, in this case IC4310. When the train splits, one part of the train keeps the identifier (IC4310) for the remaining of its journey. The other part gets a new identifier, in this case (IC4410). Even though it is clear to see that this is one train, even the NMBS website indicates that there are 2 trips. This causes routeplanning to think a transfer is needed, even though travellers can remain seated if they are in the right part of the train. NMBS' own routeplanning takes this into account, but we need a way to determine this for 3rd party routeplanning. NMBS does not publish data on which carriages travel where.

IC4310 in trips.txt:

220,000095,88____:007::8821006:8832409:13:1102:20180625,Hamont,4310,,5481,,1

IC4410 in trips.txt:

225,000297,88____:007::8832409:8831005:9:1152:20180625,Hasselt,4410,,5658,,1

Merging

Merging trains are similar, and have different identifiers until they merge. Once they merge, one of the trains will continue to travel with the identifier of one of said two trains. It should be noted that the train which loses its identifier on merging doesn't seem to have platform information. Again, routeplanning will consider this as a transer, even though passengers can remain seated (an thus don't have to transfer). NMBS' own routeplanning takes this into account, but we need a way to determine this for 3rd party routeplanning.

Solutions for routeplanning

A possible solution for this would be to label the common parts of a journey with both trip ids, by publishing two connection objects for each connection, both objects being identical except for the trip/route id.

Another possible solution would be to create an index of splitting trains, where each row contains two identifiers which belong together in one splitting or merging train. This would be less 'intrusive' to the Linked Connections list, and would prevent certain edge cases (splitting Linked Connections into fragments of a certain filesize could cause two "identical" connections to be split over multiple pages, having to combine and check multiple connection objects would be more complicated compared to only checking a list to determine whether a transfer is 'real' or a split/merge during routeplanning).

Current status

Investigate how the NMBS handles this
Investigate how this is published in GTFS

This data is not published in GTFS. There doesn't seem to be any field which implies a train split, or that a train drove part of its journey attached to another train.

	const addDuration = function (date, duration) {
	return addSeconds(addMinutes(addHours(date, duration.hours), duration.minutes), duration.seconds);
	}