I scanned a bunch of documents before I figured out that I could have it OCR them automatically. I know that I can go back and OCR them so they are content-searchable, but I'm wondering if there is a way in Everything to be able to give me a list of all of the PDFs that are like this.
Any ideas? If not in Everything, any other tools out there?
Thanks.
Find all PDFs that are not searchable (image-based, non-OCRd)?
-
- Posts: 2
- Joined: Thu Oct 21, 2021 1:51 pm
Re: Find all PDFs that are not searchable (image-based, non-OCRd)?
In XYplorer I have script which uses the xpdftools to do that.
It looks like this and returns an S if a PDF is searchable.
$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
$output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
if ($output) { return "S"; }
It looks like this and returns an S if a PDF is searchable.
$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
$output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
if ($output) { return "S"; }
Re: Find all PDFs that are not searchable (image-based, non-OCRd)?
XYplorer forum thread: https://www.xyplorer.com/xyfc/viewtopic.php?f=3&t=22803
Total Commander forum thread: https://www.ghisler.ch/board/viewtopic.php?t=73928
Everything forum thread: viewtopic.php?f=5&t=9621
Total Commander forum thread: https://www.ghisler.ch/board/viewtopic.php?t=73928
Everything forum thread: viewtopic.php?f=5&t=9621
Re: Find all PDFs that are not searchable (image-based, non-OCRd)?
If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.
*.pdf !content:"e"
There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.
(This probably won't work in Windows 7 with no PDF iFilter. I don't know where to get a PDF iFilter in Windows 7.)
((I looked through several examples of PDF 1.3, 1.4, 1.5 and 1.6 to determine any catchall verb in the specification that identifies the presence of printable text, but I could find none, even staring at a hex editor. But, there's probably a commonality between your non-OCR'd and post-OCR'd PDFs that they could be identified by other common signatures left behind by the authoring software. Try *.pdf ansicontent:%PDF-1.3 and *.pdf ansicontent:%PDF-1.4 to find PDF files of different protocol versions.))
(((You could also look at the date-created or date-modified times to determine if you created this PDF before or after you started using OCR software.)))
*.pdf !content:"e"
There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.
(This probably won't work in Windows 7 with no PDF iFilter. I don't know where to get a PDF iFilter in Windows 7.)
((I looked through several examples of PDF 1.3, 1.4, 1.5 and 1.6 to determine any catchall verb in the specification that identifies the presence of printable text, but I could find none, even staring at a hex editor. But, there's probably a commonality between your non-OCR'd and post-OCR'd PDFs that they could be identified by other common signatures left behind by the authoring software. Try *.pdf ansicontent:%PDF-1.3 and *.pdf ansicontent:%PDF-1.4 to find PDF files of different protocol versions.))
(((You could also look at the date-created or date-modified times to determine if you created this PDF before or after you started using OCR software.)))
Last edited by raccoon on Fri Oct 22, 2021 9:02 am, edited 5 times in total.
Re: Find all PDFs that are not searchable (image-based, non-OCRd)?
Works perfect together with content indexingraccoon wrote: ↑Fri Oct 22, 2021 8:09 am If we assume that every readable PDF contains the letter "e", but image PDFs do not, then this search term should do ya. Seems to work on my end.
*.pdf !content:"e"
There's more to read about content indexing in Everything 1.5 Alpha to speed things up across multiple queries.
-
- Posts: 2
- Joined: Thu Oct 21, 2021 1:51 pm
Re: Find all PDFs that are not searchable (image-based, non-OCRd)?
Thanks all! I'll give these ideas a shot.