Home » elasticsearch full text pdf search

PDF search

elasticsearch full text pdf search

Expertrec Marketing

Apr 29, 2023

Rate this article

Share this article

How to create a PDF full-text search engine using an elastic search?

Ingest Attachment Processor Plugin

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.

The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install ingest-attachment

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.5.0.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove ingest-attachment

Example

The below code here Pdf to elastic search, the code extracts pdf and put into elastic search

import PyPDF2
import re
import requests
import json
import os
from datetime import date

class ElasticModel:

    name = ""
    msg = ""

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, 
            sort_keys=True, indent=4)

def __readPDF__(path):
    # pdf file object
    # you can find find the pdf file with complete code in below
    pdfFileObj = open(path, 'rb')
    # pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    # number of pages in pdf
    print(pdfReader.numPages)
    # a page object
    pageObj = pdfReader.getPage(0)
    # extracting text from page.
    # this will print the text you can also save that into String
    line = pageObj.extractText() 
    line = line.replace("\n","")
    print(line)
    return line


#line = pageObj.extractText()

def __prepareElasticModel__(line, name):
    eModel = ElasticModel();

    eModel.name = name
    eModel.msg = line
    return eModel


def __sendToElasticSearch__(elasticModel):
    print("Name : " + str(eModel))

############################################
####  #CHANGE INDEX NAME IF NEEDED
#############################################
    index = "samplepdf"

    url = "http://localhost:9200/" + index +"/_doc?pretty"
    data = elasticModel.toJSON()
    #data = serialize(eModel)
    response = requests.post(url, data=data,headers={
                    'Content-Type':'application/json',
                    'Accept-Language':'en'

                })
    print("Url : " + url)
    print("Data : " + str(data))

    print("Request : " + str(requests))
    print("Response : " + str(response))


#################################
#Change pdf dir path
###################################
pdfdir = "C:/Users/muthali/Desktop/TemplatesPDF/SamplePdf"

listFiles = os.listdir(pdfdir)
for file in listFiles :
    path = pdfdir + "/" + file
    print(path)

    line = __readPDF__(path)
    eModel = __prepareElasticModel__(line, file)
    __sendToElasticSearch__(eModel)

No code PDF search engine using expertrec

If you want to skip all the coding, you can just create a PDF search engine using expertrec.

PDF SEARCH ENGINE – CREATE YOUR OWN

How do I index a PDF as an Elasticsearch index?

Follow these steps to index a PDF file as an Elasticsearch index:

Install the PDF plugin for Elasticsearch: PDF indexing is not natively supported by Elasticsearch. You will need to install a plugin in order to index PDF files. The Elasticsearch PDF plugin is one such plugin. The plugin can be installed on your Elasticsearch server by downloading it from the official Elasticsearch website.

Make a text file out of the PDF file: You must convert the PDF file to a text file before you can index it. Apache PDFBox, Tika, and Poppler are just a few of the open-source libraries that you can use to convert a PDF file to a text file. To extract the text from the PDF file, you can use any of these libraries.

Create an index for Elasticsearch: You must create an Elasticsearch index to store the data after installing the PDF plugin and converting the PDF file to a text file. An index can be made with a program like Kibana or the Elasticsearch API.

The PDF file’s index: Finally, you can index the PDF file by sending an Elasticsearch PUT request with the text extracted from the PDF file and specifying the document type and index. To index a PDF file, for instance, you can use the cURL command listed below.

FAQs

How do you create an indexed link in a PDF?

Use PDF editing software that supports bookmarks and hyperlinks to create an indexed link in a PDF. The steps are as follows:

In your PDF editing software, open the PDF file.
Locate the item or text you want to link to. This can be accomplished with either the object selection tool or the text selection tool.
Right-click the selected text or object and select “Create Hyperlink” or “Create Link” from the context menu.
Depending on the kind of link you want to make, select “Go to a named position” or “Go to a page view” in the “Create Link” or “Create Hyperlink” dialog box.
Select the page you want the link to point to if you select “Go to a page view.” You can likewise decide to zoom in or out on the page and select a particular region to show.
You must first create a bookmark if you select “Go to a named position.” To do this, select the text or item that you need to bookmark, right-click and pick “Add Bookmark”. Click OK after giving the bookmark a name.
In the “Make Connection” or “Make Hyperlink” discourse box, pick “Go to a named position” and select the bookmark that you recently made.To save the link, click OK.

Expertrec Marketing

Are you showing the right products, to the right shoppers, at the right time? Contact us to know more.

elasticsearch full text pdf search

How to create a PDF full-text search engine using an elastic search?

Ingest Attachment Processor Plugin

Installation

Removal

Example

No code PDF search engine using expertrec

How do you create an indexed link in a PDF?

Expertrec Marketing

Products

Get Started

Company

Company

Follow Us