I will not have time to have a look at it probably within next 24 hours. no rush at all, i have working solutions and this isn't an urgent project for me.
Could you try to use similar code structure as in the binary (cli/Cli.hs#printText) in your program? Namely something like:
import qualified Data.Text.IO as T
f <- openFile "file.pdf" case f of Just doc -> do txt <- pdftotextIO Physical doc T.putStrLn txt _ -> putStrLn "Is not a valid PDF document"
same crashes.
i stuck some debug prints after each call to Pdftotext.Internal. after succeeding on several files in the directory of pdf's i sent you, it reliably fails on the same page in the middle of the same file every time, in pageTextIO. when i remove that file, the same thing happens on a different file. after removing ~10/30 files, it reliably succeeds on all of the remainder. when i then add one of those failing files back, it fails on a different (previously succeeding) file. when i move all the failing files to a different directory and run it on that directory, they all succeed.
code attached.
Duplicated, see #4.