adobe / helix-theblog-importer Goto Github PK

License: Apache License 2.0

JavaScript 100.00%

helix-theblog-importer's Introduction

Helix Service

Helix TheBlog importer downloads the content associated to the provided url (blog post entry) and creates a markdown version stored in OneDrive.

if url is not part of the urls list - OneDrive XLSX file (/importer/urls.xlsx)
download url content
parse the dom, remove undesired blocks, extracts author, post, products and topics
transform to various snippets into markdown
upload to OneDrive
update the urls list

The importer cannot be called directly but is invoked by the scanner.

Options

FASTLY_SERVICE_ID: Service ID for "theblog"
FASTLY_TOKEN: a Fastly API Token

If you don't provide FASTLY_SERVICE_ID and FASTLY_TOKEN, then no redirects will be created for imported blog posts.

Status

Installation

Setup

Installation

Deploy the action:

npm run deploy

Required env variables:

Connection to OneDrive:

AZURE_ONEDRIVE_CLIENT_ID
AZURE_ONEDRIVE_CLIENT_SECRET
AZURE_ONEDRIVE_REFRESH_TOKEN

Blob storage credentials (store images):

AZURE_BLOB_URI
AZURE_BLOB_SAS

OneDrive shared folder that contains the /importer/urls.xlsx file:

AZURE_ONEDRIVE_ADMIN_LINK

OneDrive shared folder: destination of the markdown file:

AZURE_ONEDRIVE_CONTENT_LINK

Openwhish credentials to invoke the helix-theblog-importer action:

OPENWHISK_API_KEY
OPENWHISK_API_HOST

Coralogix credentials to log:

CORALOGIX_API_KEY
CORALOGIX_LOG_LEVEL

Fastly credentials to store keys in dictionary (url shortcuts mapping):

FASTLY_SERVICE_ID
FASTLY_TOKEN

Development

Deploying Helix Service

Deploying Helix Service requires the wsk command line client, authenticated to a namespace of your choice. For Project Helix, we use the helix namespace.

All commits to master that pass the testing will be deployed automatically. All commits to branches that will pass the testing will get commited as /helix-theblog/helix-theblog-importer@ci<num> and tagged with the CI build number.

helix-theblog-importer's People

Contributors

Stargazers

Watchers

Forkers

trieloff isabella232 ghas-results

helix-theblog-importer's Issues

Import centralized assets in OneDrive

Import process must align with adobe/theblog#194: assets must be dowloaded, stored in the proper folder and referenced accordingly.

This requires https://github.com/adobe/helix-word2md/issues/109 to be useful.

protocol-less embeds seem to be imported wrongly

this page:

https://theblog.adobe.com/where-is-personalization-missing-the-mark/

has a slideshare embed at the bottom, which gets imported as:

//www.slideshare.net/slideshow/embed_code/key/scA1akaUei9IOH

into the .md, when i prepended https: everything seems to work as expected.

inline elements lose spaces

https://theblog-adobe.hlx.page/en/drafts/migrated/2020/04/14/using-creativity-to-honorheroes-doing-extraordinary-things-during-covid-19.html

Many spaces are missing. Source DOM contains spans where space is trimmed.

Broken avatar URL

Assuming it's an importer problem: the avatar URL here has a funny .bin extension and the image is broken on all posts by this author. Haven't seen other broken ones so far...

https://staging--theblog--adobe.hlx.page/en/authors/adobe-communications-team.html

Archive?

No activity since 2020.

Move imported files to "/en/publish"

Imported files go into /en/publish

Insert "Social:" before social list on author page

Generated md should look like:

Bla bla bla.

Social:
- [](twitter)
- [](linkedin)
...

Instead of

Bla bla bla.

- [](twitter)
- [](linkedin)
...

cc @rofe

Update image urls to blogstorage urls in the whole document

See https://theblog-adobe.hlx.page/en/drafts/migrated/2015/03/12/mobile-reality.html

Images in the content are wrapped with a link to themselves which allows reader to open the image in full size. The link still points to the original image on theblog.adobe.com while it should point to the blobstore image link.

Import directly from WordPress instance

Change import destination to en/drafts/migrated

With the new folder structure, the new import destination needs to be (temporarily) changed from en/archive to en/drafts/migrated with yyy/mm/dd sub folders.

PS: the final import destination will be en/publish directly

superfluous comments / replies after import

i seems that depending on the post there is a section about comments / replies at the end of the .md

see:
https://theblog.helix-demo.xyz/en/archive/2020/your-tutorial-resource-guide-to-photography-editing.html

i think this is a section that we probably shouldn't import.

Use IA Writer-Style embeds instead of Image-Style embeds

Embeds currently use Image-Style. This should be changed to IA Writer-Style for readability and simplicity. Also we'll have less frictions with (complex) image handling.

See https://github.com/adobe/helix-pipeline/blob/master/docs/markdown.md#ia-writer-style-embeds for definition.

Detect and fix incomplete URLs

Articles can contain incomplete links (e.g. missing schema prefix) and will therefore get rendered as relative links. I think if a URL starts with www, it is safe to assume that https:// should be added.

An example has been posted here: adobe/theblog#314

Map topics and products to categories and filtered product list

/admin/importer/mappings.xlsx contains 2 tables with:

a mapping between topic names found on theblog and new category names
a mapping between product names found on theblog and filtered / cleaned up product names

This allows to import topics and products and do a cleanup (on the fly). The importer should consume those data and map accordingly during import.

Decode file names before saving as md

Files like en/publish/2019/04/01/k%d3%93rcher-cleans-up-software-licensing-with-an-adobe-etla-mobilizing-for-growth.md must be saved as en/publish/2019/04/01/kärcher-cleans-up-software-licensing-with-an-adobe-etla-mobilizing-for-growth.md otherwise the file is not found during rendering (404).

Feed Excel spreadsheet with all posts data

Idea is to have an Excel spreadsheet (we already have urls.xlsx with the url imported and the publication date) with all the data available for a post entry, similar to what is stored in Algolia. The spreadsheet could then become a replacement for Algolia leveraging Helix Data Embed.

cc @davidnuescheler @dominique-pfister

Fix import identified issues

2 types of issues.

Pure import process issues
Errors resulting of import edge cases.

blob storage image copy fails if image is being a redirect (x-ms-copy-source does not support redirect)
some images have weird / unsupported urls. Need to examine case by case

**Unsupported md data structure

<figure> tag transformed to code block: https://theblog--adobe.hlx.page/en/drafts/migrated/2020/04/01/get-to-great-faster.html
transform to embeds for list of supported embeds - see adobe/theblog#166
transform banners - see adobe/theblog#162

Non exhaustive list - WIP/

Remove dots from filenames

https://theblog.adobe.com/a-place-of-belonging

Author name contains a dot (Dr. Eddie Webb). This should just be filtered out.

Fix inline elements handling (em, strong...)

This page has a large number of emphasis mess: https://blog.adobe.com/en/drafts/migrated/2020/04/30/protect-your-most-sensitive-digital-assets-while-working-from-home.html

Looks quite different from original: https://theblog.adobe.com/protect-your-most-sensitive-digital-assets-while-working-from-home

spacing before and after emphasis is wrongly handled
links at wrong place (outside of em)
same applies to strong

Publication date does not match the folder date

Open https://theblog-adobe.hlx.page/en/drafts/migrated/2015/03/12/mobile-reality.html

Folder: 2015/03/12 (YYYY/MM/DD)
Publication date in the article: 03-11-2015 (MM/DD/YYYY)

Importer takes the property="article:published_time" meta entry in the DOM (basically, the "real" publication date).
It should take the same than the one displayed for consistency.

Cannot re-import an entry

Remove a line from the urls.xlsx file and run the import for the removed url - while import happens, it fails in the last step:

[ERROR] Duplicate dictionary_item: 'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1'
[ERROR] error { params: { url: 'https://theblog.adobe.com/10-must-haves-for-your-data-toolkit/', force: true }, error: { FastlyError: Duplicate dictionary_item: 'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1'
       at request.then (/Users/acapt/work/dev/helix/theblog/helix-theblog-importer/node_modules/@adobe/fastly-native-promises/src/httpclient.js:107:17)
       at process._tickCallback (internal/process/next_tick.js:68:7) data: { msg: 'Duplicate record', detail: 'Duplicate dictionary_item: \'dictionary_id=3iRICZS9xdH9gSxfkWupUT item_key=/10-must-haves-for-your-data-toolkit service_id=6v0sHgrPTGUGS5PHOXZ0H1\'' }, status: 409, code: 'Duplicate record', name: 'FastlyError' } }

We should be able to re-import

Action required: Greenkeeper could not be activated 🚨

☝️ Important announcement: Greenkeeper will be saying goodbye 👋 and passing the torch to Snyk on June 3rd, 2020! Find out how to migrate to Snyk and more at greenkeeper.io

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet.
We recommend using:

CircleCI
Travis CI
Buildkite
CodeShip
Azure Pipelines
TeamCity
Buddy
AppVeyor
But Greenkeeper will work with every other CI service as well.

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

superfluous updates to product md files

The importer seems to be modifying product (and topic) .md files in SharePoint a lot:

I think we should avoid that, unless we really need to change something.

Topics: only store the leafs

Recent changes store the topics including the whole parent tree (all based on mappings document). We should change that to store only the leafs.
We'll be able to create queries based on the taxonomy document to find topics / categories based on their parents. Same to display the full tree at the bottom of the posts (list of topics), we can use the taxonomy document to display all if needed.

cc @rofe

Fix embeds

Some embeds are not imported properly and even fire an exception during import.

See for the list adobe/theblog#186

Support iframe only embeds

Until now, "embeds" were surrounded with a div and a .embed-wrapper class. iframe was in that div. Now, we found a case were only an iframe was in the DOM.
In general, all iframes can be transformed into embeds since we cannot do anything with them in md.

See https://theblog.adobe.com/using-creativity-to-honorheroes-doing-extraordinary-things-during-covid-19/

Legacy design with no image leads to broken layout

Import of https://theblog.adobe.com/adobe-withdraws-from-nab-show-2020 leads to broken layout: https://blog.adobe.com/en/drafts/migrated/2020/03/09/adobe-withdraws-from-nab-show-2020.html#gs.8lkhs3

(Premiere Pro link + date on top left).

OneDrive: use username/password based login instead of refresh token

Follow up of adobe/theblog#330 (comment)

Importer recreates topics that have been moved to .docx

Importer checks if a file (topics, products, authors...) already exists to not override it but it only checks for the .md version. When file requires modification, authors "migrate" the file to .docx and delete the .md version. Which comes back on next import...
Importer should check for .md and .docx version of the file.

Trends Research should be Trends & Research

According to the link in the top navigation, the category's name is "Trends & Research", not "Trends Research". This error stems from the mapping document where the ampersand has been omitted.

embeds not being imported properly

When importing from the original blog, we search iframes for the src:

helix-theblog-importer/src/index.js

Lines 50 to 121 in b19ace2

    
           const EMBED_PATTERNS = [{ 
        
             // w.soundcloud.com/player 
        
             match: (node) => { 
        
               const f = node.find('iframe'); 
        
               const src = f.attr('src'); 
        
               return src && src.match(/w.soundcloud.com\/player/gm); 
        
             }, 
        
             extract: async (node, logger) => { 
        
               const f = node.find('iframe'); 
        
               const src = f.attr('src'); 
        
               try { 
        
                 const html = await rp({ 
        
                   uri: src, 
        
                   timeout: 60000, 
        
                   simple: false, 
        
                   headers: { 
        
                     // does not give the canonical rel without the UA. 
        
                     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36', 
        
                   }, 
        
                 }); 
        
                 if (html && html !== '') { 
        
                   const $ = cheerio.load(html); 
        
                   return $('link[rel="canonical"]').attr('href') || src; 
        
                 } 
        
               } catch (error) { 
        
                 logger.warn(`Cannot resolve soundcloud embed ${src}`); 
        
                 return src; 
        
               } 
        
               return src; 
        
             }, 
        
           }, { 
        
             // www.instagram.com 
        
             match: (node) => node.find('.instagram-media').length > 0, 
        
             extract: async (node) => node.find('.instagram-media').data('instgrm-permalink'), 
        
           }, { 
        
             // www.instagram.com v2 
        
             match: (node) => node.find('.instagram-media').length > 0, 
        
             extract: async (node) => node.find('.instagram-media a').attr('href'), 
        
           }, { 
        
             // twitter.com 
        
             match: (node) => node.find('.twitter-tweet a').length > 0, 
        
             extract: async (node) => { 
        
               // latest <a> seems to be the link to the tweet 
        
               const aTags = node.find('.twitter-tweet a'); 
        
               return aTags[aTags.length - 1].attribs.href; 
        
             }, 
        
           }, { 
        
             // spark 
        
             match: (node) => node.find('a.asp-embed-link').length > 0, 
        
             extract: async (node) => node.find('a.asp-embed-link').attr('href'), 
        
           }, { 
        
             // media.giphy.com 
        
             match: (node) => { 
        
               const img = node.find('img'); 
        
               const src = img ? img.attr('src') : null; 
        
               return src && src.match(/media.giphy.com/gm); 
        
             }, 
        
             extract: async (node) => { 
        
               const img = node.find('img'); 
        
               return img.attr('src'); 
        
             }, 
        
           }, { 
        
             // fallback to iframe src 
        
             match: (node) => { 
        
               const f = node.find('iframe'); 
        
               return f.attr('src') || f.data('src'); 
        
             }, 
        
             extract: (node) => { 
        
               const f = node.find('iframe'); 
        
               return f.attr('src') || f.data('src'); 
        
             }, 
        
           }];

The problem is these links don't work with our embed services; what we need is link to the original content not the link given in the src attribute of an <iframe>.

For example, these link cannot be embedded properly by Iframely but have been extracted during an import. You can check them at https://iframely.com/embed/:

https://embed.spotify.com/?uri=spotify4gzpq5DPGxSnKTe4SA8HAU

https://www.linkedin.com/video/embed/live/urn:li:ugcPost:6666809526391570432

https://w.soundcloud.com/player/?url=https%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F673363193&color=%23ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false&show_teaser=true&visual=true

@trieloff @kptdobe do you think this should be a helix-embed issue? I was thinking maybe we can make helix-embed more robust to these kinds of links.

	const EMBED_PATTERNS = [{
	// w.soundcloud.com/player
	match: (node) => {
	const f = node.find('iframe');
	const src = f.attr('src');
	return src && src.match(/w.soundcloud.com\/player/gm);
	},
	extract: async (node, logger) => {
	const f = node.find('iframe');
	const src = f.attr('src');
	try {
	const html = await rp({
	uri: src,
	timeout: 60000,
	simple: false,
	headers: {
	// does not give the canonical rel without the UA.
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
	},
	});
	if (html && html !== '') {
	const $ = cheerio.load(html);
	return $('link[rel="canonical"]').attr('href') \|\| src;
	}
	} catch (error) {
	logger.warn(`Cannot resolve soundcloud embed ${src}`);
	return src;
	}
	return src;
	},
	}, {
	// www.instagram.com
	match: (node) => node.find('.instagram-media').length > 0,
	extract: async (node) => node.find('.instagram-media').data('instgrm-permalink'),
	}, {
	// www.instagram.com v2
	match: (node) => node.find('.instagram-media').length > 0,
	extract: async (node) => node.find('.instagram-media a').attr('href'),
	}, {
	// twitter.com
	match: (node) => node.find('.twitter-tweet a').length > 0,
	extract: async (node) => {
	// latest <a> seems to be the link to the tweet
	const aTags = node.find('.twitter-tweet a');
	return aTags[aTags.length - 1].attribs.href;
	},
	}, {
	// spark
	match: (node) => node.find('a.asp-embed-link').length > 0,
	extract: async (node) => node.find('a.asp-embed-link').attr('href'),
	}, {
	// media.giphy.com
	match: (node) => {
	const img = node.find('img');
	const src = img ? img.attr('src') : null;
	return src && src.match(/media.giphy.com/gm);
	},
	extract: async (node) => {
	const img = node.find('img');
	return img.attr('src');
	},
	}, {
	// fallback to iframe src
	match: (node) => {
	const f = node.find('iframe');
	return f.attr('src') \|\| f.data('src');
	},
	extract: (node) => {
	const f = node.find('iframe');
	return f.attr('src') \|\| f.data('src');
	},
	}];

adobe / helix-theblog-importer Goto Github PK

helix-theblog-importer's Introduction

Helix Service

Options

Status

Installation

Setup

Installation

Required env variables:

Development

Deploying Helix Service

helix-theblog-importer's People

Contributors

Stargazers

Watchers

Forkers

helix-theblog-importer's Issues

Recommend Projects

Recommend Topics

Recommend Org