Extract PDF images text using custom LLM model and placing image doc in proper order of pdf #1049

Navanit-git · 2025-02-25T06:21:11Z

Navanit-git
Feb 25, 2025

Hi,
I am working on extracting the text and image from the pdf.
This is the pdf I am using link

In this I have used below code

IMAGE_RESOLUTION_SCALE = 2.0
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang = ["es"]
ocr_options = EasyOcrOptions(force_full_page_ocr=True)
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.CUDA
)
pipeline_options.ocr_options = ocr_options

and this is using my 4gb GPU.

Parallely I am using vlm model too for the images

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "Qwen/Qwen2.5-VL-7B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)


messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "image_link",
            },
            {"type": "text", "text": "Extract the text from the image"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

and this is taking around 16gb of GPU. Is there a way to combine both of these to get the image details in the place of image placeholder in md file.

Also if you view the pdf page three I am getting md file response like this


## Step 2: Verification and Authentication.

- A Confirm the mobile number displayed or enter a mobile phone number to receive a 6-digit code, a one-time passcode for authentication. Select Continue to proceed
- If a mobile number is on file; it will be pre-populated in the field
- You can choose to change the mobile number to receive a 6-digit code
- If a mobile device is not available, select Continue for the 6-digit code to be sent via email (Option B)
- B If you do not have a mobile phone number; we can send the 6-digit passcode to your email address
- If there is an email address on file; it will pre'populate in the field
- You can choose to change the email address to receive the 6-digit code
- Use pre-populated email or input your email address and select Continue to proceed
- Select Return to verify by phone if you wish to g0 back and receive the 6-digit code via a mobile device
- C Enter the 6-digit code received either via mobile phone or email and select Continue
- Note: do not mistake the sender's phone number in the text message for the 6-digit code
- If you did not receive the 6-digit code using a mobile number or email on file, then you must enter your residential zip code as an additional authentication step

A

<!-- image -->

B.

<!-- image -->

C.

<!-- image -->

<!-- image -->

So how should I do that the A image gets down with A doc and lastly instead of image I want that the image description from the vlm model, without overusing the GPU.

dolfim-ibm · 2025-02-25T07:00:07Z

dolfim-ibm
Feb 25, 2025
Maintainer

I think you are doing what the picture description option allows you to do. See https://ds4sd.github.io/docling/examples/pictures_description/. You will be able to define the the vision model you prefer.

5 replies

Navanit-git Feb 25, 2025
Author

thanks I was able to do so.
But how to save in md file these image descriptions instead of the image placeholders ?

Navanit-git Feb 25, 2025
Author

https://ds4sd.github.io/docling/reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown

I can see this, but in this the image description is not given

dolfim-ibm Feb 25, 2025
Maintainer

This is part of the serialization of the picture data which will be in the next releases.

Navanit-git Feb 25, 2025
Author

Thank you for the prompt response @dolfim-ibm , is there any fix timeline by when it will be released or it might take time.

Navanit-git Feb 25, 2025
Author

also I was not able to use my local path where I have downloaded the vlm to be used. But that feature was not present so I have raised a PR
#1051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract PDF images text using custom LLM model and placing image doc in proper order of pdf #1049

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extract PDF images text using custom LLM model and placing image doc in proper order of pdf #1049

Navanit-git Feb 25, 2025

Replies: 1 comment · 5 replies

dolfim-ibm Feb 25, 2025 Maintainer

Navanit-git Feb 25, 2025 Author

Navanit-git Feb 25, 2025 Author

dolfim-ibm Feb 25, 2025 Maintainer

Navanit-git Feb 25, 2025 Author

Navanit-git Feb 25, 2025 Author

Navanit-git
Feb 25, 2025

Replies: 1 comment 5 replies

dolfim-ibm
Feb 25, 2025
Maintainer

Navanit-git Feb 25, 2025
Author

Navanit-git Feb 25, 2025
Author

dolfim-ibm Feb 25, 2025
Maintainer

Navanit-git Feb 25, 2025
Author

Navanit-git Feb 25, 2025
Author