If your major message performed not remove efficiently coming from the PDF, and if your principal message is “gobble-di-gook”, it quite probably won’t draw out any far better when removing the “watermark” coming from the initial PDF data before applying pdftotext either.
If your “watermark” content looks like gobble-di-gook in output.txt, after that the font style (Kind 3) or even font encoding (customized) made use of for the watermark text carries out certainly not enable easy message removal, or even a legitimate “ToUnicode” map is actually missing out on for the font used.
Anyway, without viewing the documents its own impossible to point out whether a watermark could be taken out, due to the fact that it relies on exactly how the watermark has been actually administered, there go to the very least 3 various manner ins which I may think of off-hand as well as 2 of those I could get rid of the watermark later. There is not likely to be actually a ‘watermark coating’ in the PDF file.
Without finding the file, there’s actually incredibly little bit of I can easily provide. The watermark might be actually applied as a note, through which situation you could pick not to deliver the note. Otherwise then it might be actually used as text in the body of the document, through which situation it might be actually gotten rid of through unwinding the data flow and changing with white area. Or it may be performed through writing angle or visual records right into the content flow in which situation it will be actually feasible but harder to pull the same secret. There is actually no chance to say to without considering the documents. Or at least a similar one.
The simple fact that “the watermark simply shows up on the very first page” isn’t a pest, it’s exactly what this code was actually created to perform. I see no attempt to customize the code to add the watermark to every pdf page in c#. Be sincere about the circumstance and what effort you’ve placed in to transform it, even when “None”.
To become sincere, this seems like trying to thwart copyright. Obviously I can not say to because I have not found your initial PDF documents however watermarks are frequently put on ‘demonstration’ or even paid-for PDF data.
So the problem needs to either be actually in the loop or in the creating. Can any person aid me figure it out?
I don’t desire the TIFF to possess the watermark that is actually on every page of the PDF. Is actually there an alternative to disregard the watermark coating when creating out to a TIFF?
I believed that it remained in truth including it during the loop. I comprehend now from your review that it was actually only designing the pdf, and also certainly not including the watermark. That had not been crystal clear to me merely coming from examining the code.
Currently, if your “watermark” certainly is actually also text, that watermark strand will certainly also show up in your result text message for each page. If unnecessary), it should be simple to clear away that cord from the message (and replace it by absolutely nothing.
The PDF has an ingrained font issue resulting in incorrect message removal. I am actually attempting to convert the PDF to Tiff and also then use an OCR tool to create a searchable PDF and then extracting the content. The watermark is actually certainly not a copyright, it’s a diagonal Non-negotiable notification which creates text extraction problems thus prefer to eliminate it before removal.
Right now, if your “watermark” is actually no text message, yet some kind of image or vector graphic, it is going to certainly not be actually part of your output.txt
, if you submit a LINK to the initial PDF report I may look at it.
I am working with guide ‘Automate the dull stuff along with C#’ and also I am actually making an effort to run the code to watermark a.pdf on all webpages yet the watermark simply seems on the 1st page.