pip install pypdfium2以下函数以PDF作为输入,并将PDF的每一页作为图像列表返回。
def convert_pdf_to_images(file_path, scale=300/72): // 堆代码 duidaima.com pdf_file = pdfium.PdfDocument(file_path) page_indices = [i for i in range(len(pdf_file))] renderer = pdf_file.render( pdfium.PdfBitmap.to_pil, page_indices = page_indices, scale = scale, ) final_images = [] for i, image in zip(page_indices, renderer): image_byte_array = BytesIO() image.save(image_byte_array, format='jpeg', optimize=True) image_byte_array = image_byte_array.getvalue() final_images.append(dict({i:image_byte_array})) return final_images现在,我们可以使用display_images函数来可视化PDF文件的所有页面。
def display_images(list_dict_final_images): all_images = [list(data.values())[0] for data in list_dict_final_images] for index, image_bytes in enumerate(all_images): image = Image.open(BytesIO(image_bytes)) figure = plt.figure(figsize = (image.width / 100, image.height / 100)) plt.title(f"----- Page Number {index+1} -----") plt.imshow(image) plt.axis("off") plt.show()通过组合上述两个函数,我们可以得到以下结果:
convert_pdf_to_images = convert_pdf_to_images('Experimentation_file.pdf') display_images(convert_pdf_to_images)
pip install pytesseract以下的辅助函数使用了 Pytesseract 的 image_to_string() 函数从输入图像中提取文本。
from pytesseract import image_to_string def extract_text_with_pytesseract(list_dict_final_images): image_list = [list(data.values())[0] for data in list_dict_final_images] image_content = [] for index, image_bytes in enumerate(image_list): image = Image.open(BytesIO(image_bytes)) raw_text = str(image_to_string(image)) image_content.append(raw_text) return "\n".join(image_content)可以使用 extract_text_with_pytesseract 函数提取文本,如下所示:
text_with_pytesseract = extract_text_with_pytesseract(convert_pdf_to_images) print(text_with_pytesseract)成功执行以上代码将生成以下结果:
This document provides a quick summary of some of Zoumana’s article on Medium. It can be considered as the compilation of his 80+ articles about Data Science, Machine Learning and Machine Learning Operations. ... Pytesseract was able to extract the content of the image. Here is how it managed to do it! Pytesseract starts by identifying rectangular shapes within the input image from top-right to bottom-right. Then it extracts the content of the individual images, and the final result is the concatenation of those extracted content. This approach works perfectly when dealing with column-based PDFs and image documents. ...Pytesseract 首先通过从图像的右上角到右下角识别矩形形状。然后它提取各个图像的内容,最终的结果是这些提取内容的串联。这种方法在处理基于列的 PDF 和图像文档时效果非常好。
!pip install opencv-python-headless==4.1.2.30根据您的操作系统,安装 Pytorch 模块的方法可能不同。但所有的说明都可以在官方页面上找到。现在我们来安装 easyOCR 库:
!pip install easyocr在使用 easyOCR 时,因为它支持多语言,所以在处理文档时需要指定语言。通过其 Reader 模块设置语言,指定语言列表。例如,fr 用于法语,en 用于英语。语言的详细列表在此处可用。
from easyocr import Reader # Load model for the English language language_reader = Reader(["en"])文本提取过程在extract_text_with_easyocr 函数中实现:
def extract_text_with_easyocr(list_dict_final_images): image_list = [list(data.values())[0] for data in list_dict_final_images] image_content = [] for index, image_bytes in enumerate(image_list): image = Image.open(BytesIO(image_bytes)) raw_text = language_reader.readtext(image) raw_text = " ".join([res[1] for res in raw_text]) image_content.append(raw_text) return "\n".join(image_content)我们可以如下执行上述函数:
text_with_easy_ocr = extract_text_with_easyocr(convert_pdf_to_images) print(text_with_easy_ocr)
图2.easyOCR 的结果
!pip install PyPDF2提取逻辑实现在 extract_text_with_pyPDF 函数中:
def extract_text_with_pyPDF(PDF_File): pdf_reader = PdfReader(PDF_File) raw_text = '' for i, page in enumerate(pdf_reader.pages): text = page.extract_text() if text: raw_text += text return raw_text text_with_pyPDF = extract_text_with_pyPDF("Experimentation_file.pdf") print(text_with_pyPDF)
!pip install langchain(1) 从图像中提取文本
from langchain.document_loaders.image import UnstructuredImageLoader以下是提取文本的函数:
def extract_text_with_langchain_image(list_dict_final_images): image_list = [list(data.values())[0] for data in list_dict_final_images] image_content = [] for index, image_bytes in enumerate(image_list): image = Image.open(BytesIO(image_bytes)) loader = UnstructuredImageLoader(image) data = loader.load() raw_text = data[index].page_content image_content.append(raw_text) return "\n".join(image_content)现在,我们可以提取内容:
text_with_langchain_image = extract_text_with_langchain_image(convert_pdf_to_images) print(text_with_langchain_image)
from langchain.document_loaders import UnstructuredFileLoader def extract_text_with_langchain_pdf(pdf_file): loader = UnstructuredFileLoader(pdf_file) documents = loader.load() pdf_pages_content = '\n'.join(doc.page_content for doc in documents) return pdf_pages_content text_with_langchain_files = extract_text_with_langchain_pdf("Experimentation_file.pdf") print(text_with_langchain_files)类似于 PyPDF 模块,langchain 模块能够生成准确的结果,同时保持原始字体大小。