Code Monkey home page Code Monkey logo

helix-theblog-importer's Introduction

Helix Service

Helix TheBlog importer downloads the content associated to the provided url (blog post entry) and creates a markdown version stored in OneDrive.

  • if url is not part of the urls list - OneDrive XLSX file (/importer/urls.xlsx)
  • download url content
  • parse the dom, remove undesired blocks, extracts author, post, products and topics
  • transform to various snippets into markdown
  • upload to OneDrive
  • update the urls list

The importer cannot be called directly but is invoked by the scanner.

Options

  • FASTLY_SERVICE_ID: Service ID for "theblog"
  • FASTLY_TOKEN: a Fastly API Token

If you don't provide FASTLY_SERVICE_ID and FASTLY_TOKEN, then no redirects will be created for imported blog posts.

Status

codecov CircleCI GitHub license GitHub issues LGTM Code Quality Grade: JavaScript semantic-release

Installation

Setup

Installation

Deploy the action:

npm run deploy

Required env variables:

Connection to OneDrive:

  • AZURE_ONEDRIVE_CLIENT_ID
  • AZURE_ONEDRIVE_CLIENT_SECRET
  • AZURE_ONEDRIVE_REFRESH_TOKEN

Blob storage credentials (store images):

  • AZURE_BLOB_URI
  • AZURE_BLOB_SAS

OneDrive shared folder that contains the /importer/urls.xlsx file:

  • AZURE_ONEDRIVE_ADMIN_LINK

OneDrive shared folder: destination of the markdown file:

  • AZURE_ONEDRIVE_CONTENT_LINK

Openwhish credentials to invoke the helix-theblog-importer action:

  • OPENWHISK_API_KEY
  • OPENWHISK_API_HOST

Coralogix credentials to log:

  • CORALOGIX_API_KEY
  • CORALOGIX_LOG_LEVEL

Fastly credentials to store keys in dictionary (url shortcuts mapping):

  • FASTLY_SERVICE_ID
  • FASTLY_TOKEN

Development

Deploying Helix Service

Deploying Helix Service requires the wsk command line client, authenticated to a namespace of your choice. For Project Helix, we use the helix namespace.

All commits to master that pass the testing will be deployed automatically. All commits to branches that will pass the testing will get commited as /helix-theblog/helix-theblog-importer@ci<num> and tagged with the CI build number.

helix-theblog-importer's People

Contributors

dominique-pfister avatar greenkeeper[bot] avatar kptdobe avatar renovate[bot] avatar rofe avatar semantic-release-bot avatar trieloff avatar tripodsan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helix-theblog-importer's Issues

Change import destination to en/drafts/migrated

With the new folder structure, the new import destination needs to be (temporarily) changed from en/archive to en/drafts/migrated with yyy/mm/dd sub folders.

PS: the final import destination will be en/publish directly

Detect and fix incomplete URLs

Articles can contain incomplete links (e.g. missing schema prefix) and will therefore get rendered as relative links. I think if a URL starts with www, it is safe to assume that https:// should be added.

An example has been posted here: adobe/theblog#314

Map topics and products to categories and filtered product list

/admin/importer/mappings.xlsx contains 2 tables with:

  • a mapping between topic names found on theblog and new category names
  • a mapping between product names found on theblog and filtered / cleaned up product names

This allows to import topics and products and do a cleanup (on the fly). The importer should consume those data and map accordingly during import.

Decode file names before saving as md

Files like en/publish/2019/04/01/k%d3%93rcher-cleans-up-software-licensing-with-an-adobe-etla-mobilizing-for-growth.md must be saved as en/publish/2019/04/01/kärcher-cleans-up-software-licensing-with-an-adobe-etla-mobilizing-for-growth.md otherwise the file is not found during rendering (404).

Fix import identified issues

2 types of issues.

Pure import process issues
Errors resulting of import edge cases.

  • blob storage image copy fails if image is being a redirect (x-ms-copy-source does not support redirect)
  • some images have weird / unsupported urls. Need to examine case by case

**Unsupported md data structure

Non exhaustive list - WIP/

Cannot re-import an entry

Remove a line from the urls.xlsx file and run the import for the removed url - while import happens, it fails in the last step:

[ERROR] Duplicate dictionary_item: 'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1'
[ERROR] error { params: { url: 'https://theblog.adobe.com/10-must-haves-for-your-data-toolkit/', force: true }, error: { FastlyError: Duplicate dictionary_item: 'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1'
       at request.then (/Users/acapt/work/dev/helix/theblog/helix-theblog-importer/node_modules/@adobe/fastly-native-promises/src/httpclient.js:107:17)
       at process._tickCallback (internal/process/next_tick.js:68:7) data: { msg: 'Duplicate record', detail: 'Duplicate dictionary_item: \'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1\'' }, status: 409, code: 'Duplicate record', name: 'FastlyError' } }

We should be able to re-import

Action required: Greenkeeper could not be activated 🚨


☝️ Important announcement: Greenkeeper will be saying goodbye 👋 and passing the torch to Snyk on June 3rd, 2020! Find out how to migrate to Snyk and more at greenkeeper.io


🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet.
We recommend using:

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

superfluous updates to product md files

The importer seems to be modifying product (and topic) .md files in SharePoint a lot:
Screenshot 2020-03-23 at 23 20 28

I think we should avoid that, unless we really need to change something.

Topics: only store the leafs

Recent changes store the topics including the whole parent tree (all based on mappings document). We should change that to store only the leafs.
We'll be able to create queries based on the taxonomy document to find topics / categories based on their parents. Same to display the full tree at the bottom of the posts (list of topics), we can use the taxonomy document to display all if needed.

cc @rofe

Importer recreates topics that have been moved to .docx

Importer checks if a file (topics, products, authors...) already exists to not override it but it only checks for the .md version. When file requires modification, authors "migrate" the file to .docx and delete the .md version. Which comes back on next import...
Importer should check for .md and .docx version of the file.

Trends Research should be Trends & Research

According to the link in the top navigation, the category's name is "Trends & Research", not "Trends Research". This error stems from the mapping document where the ampersand has been omitted.

embeds not being imported properly

When importing from the original blog, we search iframes for the src:

const EMBED_PATTERNS = [{
// w.soundcloud.com/player
match: (node) => {
const f = node.find('iframe');
const src = f.attr('src');
return src && src.match(/w.soundcloud.com\/player/gm);
},
extract: async (node, logger) => {
const f = node.find('iframe');
const src = f.attr('src');
try {
const html = await rp({
uri: src,
timeout: 60000,
simple: false,
headers: {
// does not give the canonical rel without the UA.
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
},
});
if (html && html !== '') {
const $ = cheerio.load(html);
return $('link[rel="canonical"]').attr('href') || src;
}
} catch (error) {
logger.warn(`Cannot resolve soundcloud embed ${src}`);
return src;
}
return src;
},
}, {
// www.instagram.com
match: (node) => node.find('.instagram-media').length > 0,
extract: async (node) => node.find('.instagram-media').data('instgrm-permalink'),
}, {
// www.instagram.com v2
match: (node) => node.find('.instagram-media').length > 0,
extract: async (node) => node.find('.instagram-media a').attr('href'),
}, {
// twitter.com
match: (node) => node.find('.twitter-tweet a').length > 0,
extract: async (node) => {
// latest <a> seems to be the link to the tweet
const aTags = node.find('.twitter-tweet a');
return aTags[aTags.length - 1].attribs.href;
},
}, {
// spark
match: (node) => node.find('a.asp-embed-link').length > 0,
extract: async (node) => node.find('a.asp-embed-link').attr('href'),
}, {
// media.giphy.com
match: (node) => {
const img = node.find('img');
const src = img ? img.attr('src') : null;
return src && src.match(/media.giphy.com/gm);
},
extract: async (node) => {
const img = node.find('img');
return img.attr('src');
},
}, {
// fallback to iframe src
match: (node) => {
const f = node.find('iframe');
return f.attr('src') || f.data('src');
},
extract: (node) => {
const f = node.find('iframe');
return f.attr('src') || f.data('src');
},
}];

The problem is these links don't work with our embed services; what we need is link to the original content not the link given in the src attribute of an <iframe>.

For example, these link cannot be embedded properly by Iframely but have been extracted during an import. You can check them at https://iframely.com/embed/:

https://embed.spotify.com/?uri=spotify4gzpq5DPGxSnKTe4SA8HAU

https://www.linkedin.com/video/embed/live/urn:li:ugcPost:6666809526391570432

https://w.soundcloud.com/player/?url=https%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F673363193&color=%23ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&show_teaser=true&visual=true

@trieloff @kptdobe do you think this should be a helix-embed issue? I was thinking maybe we can make helix-embed more robust to these kinds of links.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.