adobe / helix-data-embed Goto Github PK

View Code? Open in Web Editor NEW

4.0 34.0 2.0 25.06 MB

Turn data into embed-friendly JSON arrays

License: Apache License 2.0

JavaScript 100.00%

helix data embed helix2 service

helix-data-embed's Introduction

Helix Data Embed

Turns structured data from data sources around the web into nice JSON arrays that can be embedded using Helix-Pipeline

Status

Installation

Usage

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3/https://blogs.adobe.com/psirt/?feed=atom

While the above is simple to type, it is more safe to escape the url and optionally pass it as src query parameter:

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3/https%3A%2F%2Fblogs.adobe.com%2Fpsirt%2F%3Ffeed%3Datom

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3?src=https%3A%2F%2Fblogs.adobe.com%2Fpsirt%2F%3Ffeed%3Datom

Data Sources

Supported data sources include:

Microsoft Excel (in Excel, share with [email protected] then copy the sharable URL)
Google Sheets (in Google Sheets, share with [email protected] and copy the URL from the browser)
Atom Feeds (must have atom somewhere in the URL)

Filtering Results

helix-data-embed supports the AEM Query Builder syntax for reducing the result set.

In order to avoid collisions with existing URL parameters, each QueryBuilder parameter must start with hlx_. For example to filter entries that have a property bar with the value foo, append the following to the URL:

hlx_property=foo&hlx_value=bar

If you want to restrict by range, use:

hlx_rangeproperty.property=age&hlx_rangeproperty.lowerBound=18&hlx_rangeproperty.upperBound=99

The predicates supported so far include:

(Just remember to add hlx_ before each URL parameter name)

Furthermore, it is possible to limit the result set using hlx_p.limit and page through the result set using hlx_p.offset.

For more, see the API documentation.

Working with Excel and Google Sheets

The sheet inside an Excel workbook or Google spreadsheet can be addressed using the sheet parameter.
Only sheets having the helix- prefix can be addressed.
If the workbook or spreadsheet does not have any helix- prefixed sheets, the first sheet is returned.
By default, the used range of the selected sheet is returned.
For excel, A table can be addressed using the table request parameter, which can be a table name or an index. For example, table=Table1 will return the table with the name Table1, table=1 will return the second table in the sheet.

Development

Deploying Helix Data Embed

Deploying Helix Data Embed requires the wsk command line client, authenticated to a namespace of your choice. For Project Helix, we use the helix namespace.

All commits to main that pass the testing will be deployed automatically. All commits to branches that will pass the testing will get commited as /helix-services/data-embed@ci<num> and tagged with the CI build number.

helix-data-embed's People

Contributors

Stargazers

Watchers

Forkers

isabella232 ghas-results

helix-data-embed's Issues

Explore Microsoft Lists

MS Lists is an AirTable lookalike. If it has a reasonable API, it could become another useful data source for us. If it doesn't it might have a good Excel integration, which would be as good.

Filter values using URL parameters

If you have a long table of data, but want to restrict the response to a small subset (maybe just one row), it would be useful to have a number of URL parameters that can be used to filter the result set.

I'm seeing following options:

port the AEM QueryBuilder – the query builder was designed to express complex queries in the form of URL parameters that are easily assembled and would provide one additional layer of compatibility with AEM
use ODATA Query Parameters – this would make the implementation for Excel trivial because we can just pass on the parameter
use MongoDB Query Operators as URL parameters – relatively easy to build with sift

@davidnuescheler @rofe @tripodsan which of the three options would you like to see?

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&property=url&property.value=https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/&property.operation=equals

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&$filter=url eq 'https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/'

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&query={"url": "https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/"}

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: config
Error type: Invalid allowedVersions
Message: The following allowedVersions does not parse as a valid version or range: "<15>"

Allow access token to be passed via authentication header

instead of authenticating in word2md it should be possible to pass an access token via the authentication header.

CI Build is failing due to unexpected arguments

https://app.circleci.com/pipelines/github/adobe/helix-data-embed/1/workflows/3406f8ed-9e2d-4d23-8b91-d5149f8206aa/jobs/1

#!/bin/sh -eo pipefail
# Error calling workflow: 'build'
# Error calling job: 'semantic-release'
# Error calling command: 'helix-post-deploy/monitoring'
# Unexpected argument(s): incubator
# 
# -------
# Warning: This configuration was auto-generated to show you the message above.
# Don't rerun this job. Rerunning will have no effect.
false

support passing the drive-item-id as src parameter

Is your feature request related to a problem? Please describe.
the sharelink to itemid is done twice: once in the content-proxy and once in data-embed. if data-embed would support datasources based on item ids, the 2nd lookup could be omitted.

with the new uri support in onedrive-support, the content-proxy can now pass the item it as onedrive:/drive/... uri.
see adobe/helix-onedrive-support#68

Data Discrepancy between Excel and API

$ curl "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1?src=https:%2F%2Fadobe.sharepoint.com%2F:x:%2Fs%2FTheBlog%2FEaikC7JAUNpAoZy604FUP7YB2oTdxZVtrWNihbJ6k-BCnA?email%3D%26e%3DNd2OF4"
[
  {
    "Destination": "https://blog.adobe.com/en/topics/covid-19.html",
    "Notes": "",
    "Source": "/tag/coronavirus/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/adobe-culture.html",
    "Notes": "",
    "Source": "/tag/inside-adobe/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/xd-updates.html",
    "Notes": "No such tag in the taxonomy - these need to be user facing for us to redirect to them",
    "Source": "/tag/xd-updates/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/adobe-summit.html",
    "Notes": "go to Adobe Summit topic page",
    "Source": "/series/adobe-summit-2020-sneaks/ "
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/covid-19.html",
    "Notes": "go to COVID-19 topic tag page  ",
    "Source": "/series/covid-19/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/data--privacy.html",
    "Notes": "go to Content Authenticity topic page",
    "Source": "/series/content-authenticity-initiative/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/customer-stories.html",
    "Notes": "go to Customer Stories topic/menu page - this is internal tag",
    "Source": "/series/behind-the-transformation/ "
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/education.html",
    "Notes": "go to Education topic page",
    "Source": "/series/doing-good-with-design-education/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "Go to Partner Stories topic page - this is internal tag so cant redirect",
    "Source": "/series/adobe-partners/ "
  },
  {
    "Destination": "https://blog.adobe.com/",
    "Notes": "go to Events topic/menu page - no such tag in taxonomy",
    "Source": "/series/imagine-2019/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/adobechat/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/experience-business-overheard-in-the-c-suite/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "Go to Digital Transformation topic/menu page",
    "Source": "/series/experience-business-now-do-it-at-scale/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/the-data-rush-how-to-strike-it-rich/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/events.html",
    "Notes": "go to Events topic/menu page",
    "Source": "/series/cannes-lions/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/creativity.html",
    "Notes": "go to Visual Trends topic page - this is internal so cant redirect",
    "Source": "/series/2018-visual-trends/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/stock.html",
    "Notes": "go to Adobe Stock topic page",
    "Source": "/series/adobe-stock-contributors/ "
  },
  {
    "Destination": "https://blog.adobe.com/",
    "Notes": "go to Document Cloud product topic page - dont havea product landing pages in new system",
    "Source": "/series/sales-essentials/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-literacy.html",
    "Notes": "",
    "Source": "/series/digital-literacy/"
  },
  {
    "Destination": "https://blog.adobe.com/feed.xml",
    "Notes": "RSS Feed URL",
    "Source": "/feed"
  }
]

i.e. only the first 20 entries are included.

Cache recently updated spreadsheets less agressively?

We are caching all spreadsheets for 10 minutes now. What about reducing that time to 1 minute for sheets updated in the last day?

Excel Support is Broken

https://adobeioruntime.net/api/v1/web/helix/helix-services/[email protected]/https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl returns an error 500.

simplify multi-sheet responses

in order to remove unexpected, unused properties we should change the format (again):

single sheet response:

HTTP/2 200
content-type: application/json
x-helix-data-version: 4
x-helix-data-type: sheet
x-helix-data-names: sheet1

{
  "offset": 0,
  "limit": 100,
  "total": 200,
  "data": [ ... ]
}

multi sheet response:

HTTP/2 200
content-type: application/json
x-helix-data-version: 4
x-helix-data-type: multi-sheet
x-helix-data-names: sheet1, sheet2

{
  "sheets": {
    "sheet1": {
      "offset": 0,
      "limit": 100,
      "total": 200,
      "data": [
        ...
      ]
    },
    "sheet2": {
      "offset": 0,
      "limit": 100,
      "total": 200,
      "data": [
        ...
      ]
    }
  }
}

/cc @trieloff @davidnuescheler

Explore GitHub issues

GitHub issues could give us a lightweight database by providing a wrapper around the GitHub API

Remove query-builder support?

As we have been breaking compatibility with v3 anyway, should we remove query builder support while we are at it?

Set MY_GITHUB_TOKEN secret

Return last modified date of a file in a header

In order to detect changes to a file, it would be helpful if data-embed would return the last modified date of that file, e.g. in the Last-Modified HTTP response header.

This feature would particularly help in solving adobe/theblog#550

escape or don't use `__ow_path` to pass external resource url

Description
this action receives the external resource URL as unescaped path suffix via the __ow_path parameter. this has the drawback, that the query parameters of the resource URL bleed into the action parameters. this makes it volatile to changes in the resource URL, since each extractor needs to anticipate the potential query parameters needed. also, it prevents using query params for real action params.

also see:

Suggestion A

The minimal change that could be done is to url-escape the external resource.

so instead of:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1/https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl

we'd use:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1/https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2Fguestaccess.aspx%3Fshare%3DESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew%26email%3Dhelix%2540adobe.com%26e%3Dhx0OUl

Suggestion B

an alternative approach is not to use the __ow_path but a normal action parameter, eg:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1?r=https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2Fguestaccess.aspx%3Fshare%3DESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew%26email%3Dhelix%2540adobe.com%26e%3Dhx0OUl

Support Data Queries

based on helix-run-query

@MarquiseRosier let me know if you want to take this one.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Check this box to trigger a request for Renovate to run again on this repository

Add option to save data to s3

@trieloff's idea to store (larger) data directly in the underlying storage would make the data processing faster, especially for larger data-sets.

suggest to:

content-bus generates a presigned url for the respective destination object
send request to content-proxy, including a presignedStorageUrl parameter
content-proxy sends along the presignedStorageUrl to data-embed, ideally using a PUT¹
data-embed stores the .json directly in the storage using the presignedStorageUrl
data-embeds responds with a 307², including a location header to the location of the stored object (if possible)
content-proxy returns the same

¹ I'm not sure about using PUT or GET, but writing content on GET feels wrong
² I'm not sure about the redirect response. maybe a 200 when using PUT is better.

add support for per-path metadata extraction

TBD

Modernize data representation

Description
If the response size of the dataset is too large, the response is invalid. better to truncate it.

use workbook API from onedrive-support

Is your feature request related to a problem? Please describe.
onedrive-support has support for workbooks.

Expose real file name

The data are requested via a json request which means the Excel / Sheet document is known via its url. When you need to expose a "friendly" name of the document, only option at the moment is to extract it from the URL. Main issue: the real file name can be different from the original document: 7174_Helix_WS_Pilot_Sep would become 7174-Helix-WS-Pilot-Sep.

2 options:

educate the author to use only dashes and be aware of the URL rules
expose the real file name in the json response as a name property

While 1 is important for what is exposed externally (protect url namespace), we are starting to use spreadsheets for work tracking (translation batches) where here the filename matches user project names and won't be exposed as real public urls.
2 is simple and should not hurt.

For publicly exposed urls, the name property will anyway be equal to the url last segment (minus the extension).

@trieloff @tripodsan WDYT ?

support addressing excel sheets

Is your feature request related to a problem? Please describe.
currently it is not possible to address a specific sheet in an excel workbook. Further it is not possible to define which sheet should be the default.

Describe the solution you'd like
use a query parameter, eg sheet= to select the sheet to fetch. in order to make it impossible to fetch sheets that are considered 'private', only sheets with helix- or helix_ prefix should be addressable.
furthermore, if there is a sheet names helix-default or helix_default, it should be used instead of the first sheet.

/cc @trieloff @davidnuescheler

Pass offset and limit query parameters to MS Graph API

The current implementation of data-embed for excel sheets fetches the usedRange of a sheet, which includes not only all values, but also:

all cell formulas
all cell values as text
all cell value types

and more (see Range for a list of properties implicitly returned as result). The complete response size for the current index definition amounts to 7MB.

It then converts the values into JSON and finally filters them for the rows requested by offset and limit.

Instead, one could compute the requested range and limit the result returned as follows:

Get only the address of the usedRange, e.g. A1-I6200
Compute the address of the requested rows, e.g. A3073-I3328 (for offset=3072&limit=256)
Get the column names (using syntax /worksheets/('sheet')/range(address='A1:I1')?$select=values)
Get the column values (using syntax /worksheets/('sheet')/range(address='A3073:I3328')?$select=values)
Create the expected JSON output

selecting multiple sheets doesn't work

https://master--helix-test--tripodsan.hlx.page/data-embed-unit-tests/multisheet.json?sheet=foo&sheet=tables

src parameter and query builder don't mix

# works
$ http "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1/https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/Doc.aspx?sourcedoc=%7BE7BF6B3A-F323-4076-968B-30A1565F2373%7D&file=query-index.xlsx&action=default&mobileredirect=true&hlx_p.limit=1" --print Hhb

# doesn't (returns all records)
http "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1?src=https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2FDoc.aspx%3Fsourcedoc%3D%257BE7BF6B3A-F323-4076-968B-30A1565F2373%257D%26file%3Dquery-index.xlsx%26action%3Ddefault%26mobileredirect%3Dtrue&hlx_p.limit=1" --print Hhb

adobe/helix-content-proxy#55

fetching data from sharepoint is very slow

Description
the requests to fetch the query index from sharepoint are very slow:

It takes about 4-5 seconds for each request:

getting the worksheets
checking for tables
returning the used range

multisheet support

if a workbook / sheets has several helix-* sheets return all of them.

if no sheet is specified and the workbook has helix-* sheets, return all of the helix-* sheets
unless it has a helix-default then only this one is returned
if a workbook has no helix-* sheets, just return the first sheet (as before)
sheet(s) can be selected with one or several sheet= query parameter

Examples

select 1 sheet

...?sheet=one

{
    "limit": 10,
    "offset": 5,
    "data": [...]
}

select multiple sheets

...?sheet=one&sheet=two&limit=10&offset=5&limit=20&offset=0

or

...?sheet=one&sheet=two&limit.one=10&offset.one=5&limit.two=20&offset.two=0

{
  "type": "multi-sheet",
  "names": ["helix-one", "helix-two"],
  "helix-one": {
    "limit": 10,
    "offset": 5,
    "data": [...]
  },
  "helix-two": {
    "limit": 20,
    "offset": 0,
    "data": [...]
  }
}

migrate to helix-deploy

see adobe/project-helix#508

migrate code to use helix-deploy adapter
update CI to use helix-deploy