Pdf to odt/docx conversion has me weeping!

Maroon@lemmy.world · 4 months ago

Pdf to odt/docx conversion has me weeping!

fossilesque@mander.xyz · 4 months ago

StirlingPDF does this. I’ll dm you the one I host for my writing group.

anamethatisnt@sopuli.xyz · 4 months ago

Would an alternative be to simply edit the pdfs?

The german software FlexiPDF still allows you to buy a yearly version for a one off sum and allow you to use a free trial with watermark to check whether it works well enough for you before you buy.
https://www.softmaker.com/en/products/flexipdf

Botzo@lemmy.world · 4 months ago

https://pdf2docx.readthedocs.io/ seems to fit the bill. I can’t vouch for it.

PDF is such a curse. I say this as a person currently tasked with deploying new mysteriously complex enterprise PDF conversion software for technical documents. The rabbit hole is so deep.

observantTrapezium@lemmy.ca · 4 months ago

It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.

Botzo@lemmy.world · 4 months ago

This is probably my first time ever using it for an appropriate purpose as this team’s technical docs are destined for the press (and digital distribution). They just have no idea how to software, so I was brought in to build bridges between and ultimately simplify all their tools.

mesa@piefed.social · 4 months ago

As a dev the reason pdf is so strange is because it’s a compound format. It can be just images strung together. It can also be pure text with fonts, ect…etc …

If you open the file as a text file, you can see this. It’s many different formats in a trenchcoat.

Botzo@lemmy.world · 4 months ago

Yeah, also a dev here. I’d be so happy if they’d parted ways with the 90s legacy bits at some point. Just glad there are enough parsing libraries that I’ll never need to care (right? Please tell me I’m right!).

JASN_DE@feddit.org · 4 months ago

I haven’t tested that part of it yet, but the self-hostable StirlingPDF offers conversion from PDF to a number of formats.

The rest I use it for works fine, so maybe that could be an option.

observantTrapezium@lemmy.ca · 4 months ago

I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.

What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…

fossilesque@mander.xyz · 4 months ago

StirlingPDF is basically 1 size fits all.

observantTrapezium@lemmy.ca · 4 months ago

Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).

A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.

fossilesque@mander.xyz · 4 months ago

https://news.ycombinator.com/item?id=44287043

cmnybo@discuss.tchncs.de · 4 months ago

The only real solution is to always keep your source files. PDFs are not intended to be edited.

whimsy@lemmy.zip · 4 months ago

Maybe LibreOffice Draw can help you out? It has PDF editing capabilities

bizdelnick@lemmy.ml · 4 months ago

There’s no any solution. It is impossible to convert from PDF to any editable format correctly. The exception is a “hybrid PDF” that has an embedded editable document. If you need to edit PDFs that you created yourself, store them in hybrid format.

grue@lemmy.world · 4 months ago

Your question is like asking how to convert a cake into flour, sugar, milk, butter and (unbroken!) eggs.

chakli@lemmy.world · 4 months ago

deleted by creator