Code Monkey home page Code Monkey logo

helix-data-embed's Introduction

Helix Data Embed

Turns structured data from data sources around the web into nice JSON arrays that can be embedded using Helix-Pipeline

Status

codecov CircleCI GitHub license GitHub issues LGTM Code Quality Grade: JavaScript semantic-release

Installation

Usage

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3/https://blogs.adobe.com/psirt/?feed=atom

While the above is simple to type, it is more safe to escape the url and optionally pass it as src query parameter:

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3/https%3A%2F%2Fblogs.adobe.com%2Fpsirt%2F%3Ffeed%3Datom

or

curl https://adobeioruntime.net/api/v1/web/helix/helix-services/data-embed@v3?src=https%3A%2F%2Fblogs.adobe.com%2Fpsirt%2F%3Ffeed%3Datom

Data Sources

Supported data sources include:

  • Microsoft Excel (in Excel, share with [email protected] then copy the sharable URL)
  • Google Sheets (in Google Sheets, share with [email protected] and copy the URL from the browser)
  • Atom Feeds (must have atom somewhere in the URL)

Filtering Results

helix-data-embed supports the AEM Query Builder syntax for reducing the result set.

In order to avoid collisions with existing URL parameters, each QueryBuilder parameter must start with hlx_. For example to filter entries that have a property bar with the value foo, append the following to the URL:

hlx_property=foo&hlx_value=bar

If you want to restrict by range, use:

hlx_rangeproperty.property=age&hlx_rangeproperty.lowerBound=18&hlx_rangeproperty.upperBound=99

The predicates supported so far include:

(Just remember to add hlx_ before each URL parameter name)

Furthermore, it is possible to limit the result set using hlx_p.limit and page through the result set using hlx_p.offset.

For more, see the API documentation.

Working with Excel and Google Sheets

  • The sheet inside an Excel workbook or Google spreadsheet can be addressed using the sheet parameter.
  • Only sheets having the helix- prefix can be addressed.
  • If the workbook or spreadsheet does not have any helix- prefixed sheets, the first sheet is returned.
  • By default, the used range of the selected sheet is returned.
  • For excel, A table can be addressed using the table request parameter, which can be a table name or an index. For example, table=Table1 will return the table with the name Table1, table=1 will return the second table in the sheet.

Development

Deploying Helix Data Embed

Deploying Helix Data Embed requires the wsk command line client, authenticated to a namespace of your choice. For Project Helix, we use the helix namespace.

All commits to main that pass the testing will be deployed automatically. All commits to branches that will pass the testing will get commited as /helix-services/data-embed@ci<num> and tagged with the CI build number.

helix-data-embed's People

Contributors

adobe-bot avatar dominique-pfister avatar marquiserosier avatar renovate-bot avatar renovate[bot] avatar rofe avatar semantic-release-bot avatar stefan-guggisberg avatar trieloff avatar tripodsan avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helix-data-embed's Issues

Expose real file name

The data are requested via a json request which means the Excel / Sheet document is known via its url. When you need to expose a "friendly" name of the document, only option at the moment is to extract it from the URL. Main issue: the real file name can be different from the original document: 7174_Helix_WS_Pilot_Sep would become 7174-Helix-WS-Pilot-Sep.

2 options:

  1. educate the author to use only dashes and be aware of the URL rules
  2. expose the real file name in the json response as a name property

While 1 is important for what is exposed externally (protect url namespace), we are starting to use spreadsheets for work tracking (translation batches) where here the filename matches user project names and won't be exposed as real public urls.
2 is simple and should not hurt.

For publicly exposed urls, the name property will anyway be equal to the url last segment (minus the extension).

@trieloff @tripodsan WDYT ?

support passing the drive-item-id as src parameter

Is your feature request related to a problem? Please describe.
the sharelink to itemid is done twice: once in the content-proxy and once in data-embed. if data-embed would support datasources based on item ids, the 2nd lookup could be omitted.

with the new uri support in onedrive-support, the content-proxy can now pass the item it as onedrive:/drive/... uri.
see adobe/helix-onedrive-support#68

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: config
Error type: Invalid allowedVersions
Message: The following allowedVersions does not parse as a valid version or range: "<15>"

escape or don't use `__ow_path` to pass external resource url

Description
this action receives the external resource URL as unescaped path suffix via the __ow_path parameter. this has the drawback, that the query parameters of the resource URL bleed into the action parameters. this makes it volatile to changes in the resource URL, since each extractor needs to anticipate the potential query parameters needed. also, it prevents using query params for real action params.

also see:

Suggestion A

The minimal change that could be done is to url-escape the external resource.

so instead of:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1/https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl

we'd use:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1/https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2Fguestaccess.aspx%3Fshare%3DESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew%26email%3Dhelix%2540adobe.com%26e%3Dhx0OUl

Suggestion B

an alternative approach is not to use the __ow_path but a normal action parameter, eg:

https://adobeioruntime.net/api/v1/web/helix/helix-serivces/data-embed@v1?r=https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2Fguestaccess.aspx%3Fshare%3DESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew%26email%3Dhelix%2540adobe.com%26e%3Dhx0OUl

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.


  • Check this box to trigger a request for Renovate to run again on this repository

Explore Microsoft Lists

MS Lists is an AirTable lookalike. If it has a reasonable API, it could become another useful data source for us. If it doesn't it might have a good Excel integration, which would be as good.

simplify multi-sheet responses

in order to remove unexpected, unused properties we should change the format (again):

single sheet response:

HTTP/2 200
content-type: application/json
x-helix-data-version: 4
x-helix-data-type: sheet
x-helix-data-names: sheet1

{
  "offset": 0,
  "limit": 100,
  "total": 200,
  "data": [ ... ]
}

multi sheet response:

HTTP/2 200
content-type: application/json
x-helix-data-version: 4
x-helix-data-type: multi-sheet
x-helix-data-names: sheet1, sheet2

{
  "sheets": {
    "sheet1": {
      "offset": 0,
      "limit": 100,
      "total": 200,
      "data": [
        ...
      ]
    },
    "sheet2": {
      "offset": 0,
      "limit": 100,
      "total": 200,
      "data": [
        ...
      ]
    }
  }
}

/cc @trieloff @davidnuescheler

Remove query-builder support?

As we have been breaking compatibility with v3 anyway, should we remove query builder support while we are at it?

support addressing excel sheets

Is your feature request related to a problem? Please describe.
currently it is not possible to address a specific sheet in an excel workbook. Further it is not possible to define which sheet should be the default.

Describe the solution you'd like
use a query parameter, eg sheet= to select the sheet to fetch. in order to make it impossible to fetch sheets that are considered 'private', only sheets with helix- or helix_ prefix should be addressable.
furthermore, if there is a sheet names helix-default or helix_default, it should be used instead of the first sheet.

/cc @trieloff @davidnuescheler

CI Build is failing due to unexpected arguments

https://app.circleci.com/pipelines/github/adobe/helix-data-embed/1/workflows/3406f8ed-9e2d-4d23-8b91-d5149f8206aa/jobs/1

#!/bin/sh -eo pipefail
# Error calling workflow: 'build'
# Error calling job: 'semantic-release'
# Error calling command: 'helix-post-deploy/monitoring'
# Unexpected argument(s): incubator
# 
# -------
# Warning: This configuration was auto-generated to show you the message above.
# Don't rerun this job. Rerunning will have no effect.
false

Add option to save data to s3

@trieloff's idea to store (larger) data directly in the underlying storage would make the data processing faster, especially for larger data-sets.

suggest to:

  1. content-bus generates a presigned url for the respective destination object
  2. send request to content-proxy, including a presignedStorageUrl parameter
  3. content-proxy sends along the presignedStorageUrl to data-embed, ideally using a PUT1
  4. data-embed stores the .json directly in the storage using the presignedStorageUrl
  5. data-embeds responds with a 3072, including a location header to the location of the stored object (if possible)
  6. content-proxy returns the same

1 I'm not sure about using PUT or GET, but writing content on GET feels wrong
2 I'm not sure about the redirect response. maybe a 200 when using PUT is better.

Filter values using URL parameters

If you have a long table of data, but want to restrict the response to a small subset (maybe just one row), it would be useful to have a number of URL parameters that can be used to filter the result set.

I'm seeing following options:

  1. port the AEM QueryBuilder – the query builder was designed to express complex queries in the form of URL parameters that are easily assembled and would provide one additional layer of compatibility with AEM
  2. use ODATA Query Parameters – this would make the implementation for Excel trivial because we can just pass on the parameter
  3. use MongoDB Query Operators as URL parameters – relatively easy to build with sift

@davidnuescheler @rofe @tripodsan which of the three options would you like to see?

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&property=url&property.value=https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/&property.operation=equals

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&$filter=url eq 'https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/'

https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/guestaccess.aspx?share=ESR1N29Z7HpCh1Zfs_0YS_gB4gVSuKyWRut-kNcHVSvkew&email=helix%40adobe.com&e=hx0OUl&query={"url": "https://theblog.adobe.com/silka-miesnieks-designing-immersive-world/"}

Explore GitHub issues

GitHub issues could give us a lightweight database by providing a wrapper around the GitHub API

src parameter and query builder don't mix

# works
$ http "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1/https://adobe.sharepoint.com/sites/TheBlog/_layouts/15/Doc.aspx?sourcedoc=%7BE7BF6B3A-F323-4076-968B-30A1565F2373%7D&file=query-index.xlsx&action=default&mobileredirect=true&hlx_p.limit=1" --print Hhb

# doesn't (returns all records)
http "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1?src=https%3A%2F%2Fadobe.sharepoint.com%2Fsites%2FTheBlog%2F_layouts%2F15%2FDoc.aspx%3Fsourcedoc%3D%257BE7BF6B3A-F323-4076-968B-30A1565F2373%257D%26file%3Dquery-index.xlsx%26action%3Ddefault%26mobileredirect%3Dtrue&hlx_p.limit=1" --print Hhb

adobe/helix-content-proxy#55

multisheet support

if a workbook / sheets has several helix-* sheets return all of them.

  • if no sheet is specified and the workbook has helix-* sheets, return all of the helix-* sheets
  • unless it has a helix-default then only this one is returned
  • if a workbook has no helix-* sheets, just return the first sheet (as before)
  • sheet(s) can be selected with one or several sheet= query parameter

Examples

select 1 sheet

...?sheet=one

{
    "limit": 10,
    "offset": 5,
    "data": [...]
}

select multiple sheets

...?sheet=one&sheet=two&limit=10&offset=5&limit=20&offset=0

or

...?sheet=one&sheet=two&limit.one=10&offset.one=5&limit.two=20&offset.two=0

{
  "type": "multi-sheet",
  "names": ["helix-one", "helix-two"],
  "helix-one": {
    "limit": 10,
    "offset": 5,
    "data": [...]
  },
  "helix-two": {
    "limit": 20,
    "offset": 0,
    "data": [...]
  }
}

Data Discrepancy between Excel and API

Screen Shot 2020-08-11 at 21 15 30

$ curl "https://adobeioruntime.net/api/v1/web/helix-pages/helix-services/data-embed@v1?src=https:%2F%2Fadobe.sharepoint.com%2F:x:%2Fs%2FTheBlog%2FEaikC7JAUNpAoZy604FUP7YB2oTdxZVtrWNihbJ6k-BCnA?email%3D%26e%3DNd2OF4"
[
  {
    "Destination": "https://blog.adobe.com/en/topics/covid-19.html",
    "Notes": "",
    "Source": "/tag/coronavirus/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/adobe-culture.html",
    "Notes": "",
    "Source": "/tag/inside-adobe/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/xd-updates.html",
    "Notes": "No such tag in the taxonomy - these need to be user facing for us to redirect to them",
    "Source": "/tag/xd-updates/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/adobe-summit.html",
    "Notes": "go to Adobe Summit topic page",
    "Source": "/series/adobe-summit-2020-sneaks/ "
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/covid-19.html",
    "Notes": "go to COVID-19 topic tag page  ",
    "Source": "/series/covid-19/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/data--privacy.html",
    "Notes": "go to Content Authenticity topic page",
    "Source": "/series/content-authenticity-initiative/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/customer-stories.html",
    "Notes": "go to Customer Stories topic/menu page - this is internal tag",
    "Source": "/series/behind-the-transformation/ "
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/education.html",
    "Notes": "go to Education topic page",
    "Source": "/series/doing-good-with-design-education/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "Go to Partner Stories topic page - this is internal tag so cant redirect",
    "Source": "/series/adobe-partners/ "
  },
  {
    "Destination": "https://blog.adobe.com/",
    "Notes": "go to Events topic/menu page - no such tag in taxonomy",
    "Source": "/series/imagine-2019/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/adobechat/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/experience-business-overheard-in-the-c-suite/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "Go to Digital Transformation topic/menu page",
    "Source": "/series/experience-business-now-do-it-at-scale/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-transformation.html",
    "Notes": "go to Digital Transformation topic/menu page",
    "Source": "/series/the-data-rush-how-to-strike-it-rich/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/events.html",
    "Notes": "go to Events topic/menu page",
    "Source": "/series/cannes-lions/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/creativity.html",
    "Notes": "go to Visual Trends topic page - this is internal so cant redirect",
    "Source": "/series/2018-visual-trends/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/stock.html",
    "Notes": "go to Adobe Stock topic page",
    "Source": "/series/adobe-stock-contributors/ "
  },
  {
    "Destination": "https://blog.adobe.com/",
    "Notes": "go to Document Cloud product topic page - dont havea product landing pages in new system",
    "Source": "/series/sales-essentials/"
  },
  {
    "Destination": "https://blog.adobe.com/en/topics/digital-literacy.html",
    "Notes": "",
    "Source": "/series/digital-literacy/"
  },
  {
    "Destination": "https://blog.adobe.com/feed.xml",
    "Notes": "RSS Feed URL",
    "Source": "/feed"
  }
]

i.e. only the first 20 entries are included.

fetching data from sharepoint is very slow

Description
the requests to fetch the query index from sharepoint are very slow:

image

It takes about 4-5 seconds for each request:

  • getting the worksheets
  • checking for tables
  • returning the used range

Pass offset and limit query parameters to MS Graph API

The current implementation of data-embed for excel sheets fetches the usedRange of a sheet, which includes not only all values, but also:

  • all cell formulas
  • all cell values as text
  • all cell value types

and more (see Range for a list of properties implicitly returned as result). The complete response size for the current index definition amounts to 7MB.

It then converts the values into JSON and finally filters them for the rows requested by offset and limit.

Instead, one could compute the requested range and limit the result returned as follows:

  1. Get only the address of the usedRange, e.g. A1-I6200
  2. Compute the address of the requested rows, e.g. A3073-I3328 (for offset=3072&limit=256)
  3. Get the column names (using syntax /worksheets/('sheet')/range(address='A1:I1')?$select=values)
  4. Get the column values (using syntax /worksheets/('sheet')/range(address='A3073:I3328')?$select=values)
  5. Create the expected JSON output

migrate to helix-deploy

see adobe/project-helix#508

  • migrate code to use helix-deploy adapter
  • update CI to use helix-deploy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.