Automation solution for PDF content validation using Java
I am doing POC on PDF content validation which will basically validate the content of PDF files. But I didn't find any solutions.
Solution which I am looking for:
- Read the PDF file from a specific location
- Extract PDF content and maybe put it in some structured format
- Validate actual extracted content against a expected values
First of all, you mentioned a few different tags like Python and Java. You need to make it clear what language you want to use. I'd suggest using a language that you already know and/or your colleagues know, a language that's already used on your project(s), and a language that's generally used in your company. What you do in this example should be consistent with other projects and situations.
In Python, there're a few ways, one of them is MyPDF2, more precisely method
extractText(). Read the documentation and try it out in your example, it might not work well in all cases. It also depends on what examply you want to check, text might be a bit more difficult than e.g. title, number of pages, author etc.
However, Selenium will not be the solution here, it's a framework for testing web applications, not a tool/framework/library for reading data/text from PDF files. Some basic information could be found here on Wikipedia.