How to create a PDF full-text search engine using an elastic search?
Ingest Attachment Processor Plugin
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.
The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install ingest-attachment
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.5.0.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove ingest-attachment
Example
The below code here Pdf to elastic search, the code extracts pdf and put into elastic search
import PyPDF2 import re import requests import json import os from datetime import date class ElasticModel: name = "" msg = "" def toJSON(self): return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=4) def __readPDF__(path): # pdf file object # you can find find the pdf file with complete code in below pdfFileObj = open(path, 'rb') # pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # number of pages in pdf print(pdfReader.numPages) # a page object pageObj = pdfReader.getPage(0) # extracting text from page. # this will print the text you can also save that into String line = pageObj.extractText() line = line.replace("\n","") print(line) return line #line = pageObj.extractText() def __prepareElasticModel__(line, name): eModel = ElasticModel(); eModel.name = name eModel.msg = line return eModel def __sendToElasticSearch__(elasticModel): print("Name : " + str(eModel)) ############################################ #### #CHANGE INDEX NAME IF NEEDED ############################################# index = "samplepdf" url = "http://localhost:9200/" + index +"/_doc?pretty" data = elasticModel.toJSON() #data = serialize(eModel) response = requests.post(url, data=data,headers={ 'Content-Type':'application/json', 'Accept-Language':'en' }) print("Url : " + url) print("Data : " + str(data)) print("Request : " + str(requests)) print("Response : " + str(response)) ################################# #Change pdf dir path ################################### pdfdir = "C:/Users/muthali/Desktop/TemplatesPDF/SamplePdf" listFiles = os.listdir(pdfdir) for file in listFiles : path = pdfdir + "/" + file print(path) line = __readPDF__(path) eModel = __prepareElasticModel__(line, file) __sendToElasticSearch__(eModel)
No code PDF search engine using expertrec
If you want to skip all the coding, you can just create a PDF search engine using expertrec.
0 Comments