When rebuilding the search index on a site that I recently moved from IIS to Azure, I

Index PDF files on Azure about c1-cms-foundation HOT 4 OPEN

slacto commented on August 17, 2024

Index PDF files on Azure

from c1-cms-foundation.

Comments (4)

burningice2866 commented on August 17, 2024

You can use this class as a drop-in solution to index pdf files - it uses PdfSharp and PdfSharpTextExtractor

public class PdfContentSearchExtension : ISearchDocumentBuilderExtension
    {
        public void Populate(SearchDocumentBuilder searchDocumentBuilder, IData data)
        {
            if (!(data is IMediaFile mediaFile))
            {
                return;
            }

            if (searchDocumentBuilder.TextParts.Any() && !String.IsNullOrEmpty(searchDocumentBuilder.Url))
            {
                return;
            }

            var mimeType = MimeTypeInfo.GetCanonical(mediaFile.MimeType);
            if (!IsIndexableMimeType(mimeType))
            {
                return;
            }

            var text = GetText(mediaFile);
            if (String.IsNullOrWhiteSpace(text))
            {
                return;
            }

            searchDocumentBuilder.TextParts.Add(text);
            searchDocumentBuilder.Url = MediaUrls.BuildUrl(mediaFile, UrlKind.Internal);

            Log.LogInformation("PdfContentSearchExtension", $"{mediaFile.FileName} indexed successfully");
        }

        private static string GetText(IMediaFile mediaFile)
        {
            var sb = new StringBuilder();

            using (var pdfDocument = PdfReader.Open(mediaFile.GetReadStream(), PdfDocumentOpenMode.ReadOnly))
            {
                var extractor = new Extractor(pdfDocument);
                foreach (var page in pdfDocument.Pages)
                {
                    extractor.ExtractText(page, sb);

                    sb.AppendLine();
                }
            }

            return sb.ToString();
        }

        private static bool IsIndexableMimeType(string mimeType)
        {
            return mimeType == "application/pdf";
        }
    }

Just register it in your startup handler like this

public static void ConfigureServices(IServiceCollection serviceCollection)
        {
            serviceCollection.AddSingleton<ISearchDocumentBuilderExtension>(new PdfContentSearchExtension());

            Log.LogInformation("Searching", "PdfContentSearchExtension registered");
        }

from c1-cms-foundation.

slacto commented on August 17, 2024

Fantastic... Thanks, it works!

It sounds like Orckestra.Search.MediaContentIndexing cannot be used at all on Azure. Is that correct?

from c1-cms-foundation.

burningice2866 commented on August 17, 2024

Since all the indexing of MediaContentIndexing relies on using the IFilter interface it must be safe to assume that it can't do any indexing on Azure.

The above code was even made for a regular Windows server - i believe IFilter has been depricated for many years - not only on Azure.

That leaves us with docx and other types of non-pdfs not being indexed and searchable without writing custom code for that too.

from c1-cms-foundation.

burningice2866 commented on August 17, 2024

It should be fairly easy though to replace PdfSharpTextExtractor with TikaOnDotnet.TextExtractor which is a library that on paper supports a various of formats

https://kevm.github.io/tikaondotnet/

from c1-cms-foundation.

Recommend Projects

Index PDF files on Azure about c1-cms-foundation HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent