You will see that I have posted about this before asking for suggestions on which software I can use to convert PDF to docx/odt.

I am a teacher. During my time as a researcher I wrote a lot of documents and regularly draw upon them to teach my students. I often have to take the text, modify them, or build upon them. A lot of my material is bound up in PDFs. Sometimes, I have grant applications to write where a previous draft I wrote was stored as a PDF. Converting them to text has become the bane of my life.

I am forced to use online tools because none of the software I have seem to do the trick. Lot of people keep saying pandoc. Pandoc does not convert PDF to any other format. It can only be the output format.

Is there a magic open source solution that I have missed out?

  • Botzo@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    12 days ago

    https://pdf2docx.readthedocs.io/ seems to fit the bill. I can’t vouch for it.

    PDF is such a curse. I say this as a person currently tasked with deploying new mysteriously complex enterprise PDF conversion software for technical documents. The rabbit hole is so deep.

    • observantTrapezium@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      ·
      12 days ago

      It’s a curse because it’s used for things other than what it’s intended to. It’s doing a good job representing printed material, but unfortunately people very commonly expect it to be something more akin to a word processor file.

      • Botzo@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 days ago

        This is probably my first time ever using it for an appropriate purpose as this team’s technical docs are destined for the press (and digital distribution). They just have no idea how to software, so I was brought in to build bridges between and ultimately simplify all their tools.

    • mesa@piefed.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      12 days ago

      As a dev the reason pdf is so strange is because it’s a compound format. It can be just images strung together. It can also be pure text with fonts, ect…etc …

      If you open the file as a text file, you can see this. It’s many different formats in a trenchcoat.

      • Botzo@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 days ago

        Yeah, also a dev here. I’d be so happy if they’d parted ways with the 90s legacy bits at some point. Just glad there are enough parsing libraries that I’ll never need to care (right? Please tell me I’m right!).

  • JASN_DE@feddit.org
    link
    fedilink
    English
    arrow-up
    4
    ·
    12 days ago

    I haven’t tested that part of it yet, but the self-hostable StirlingPDF offers conversion from PDF to a number of formats.

    The rest I use it for works fine, so maybe that could be an option.

  • observantTrapezium@lemmy.ca
    link
    fedilink
    English
    arrow-up
    2
    ·
    12 days ago

    I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.

    What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…

  • whimsy@lemmy.zip
    link
    fedilink
    English
    arrow-up
    1
    ·
    12 days ago

    Maybe LibreOffice Draw can help you out? It has PDF editing capabilities

  • bizdelnick@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 days ago

    There’s no any solution. It is impossible to convert from PDF to any editable format correctly. The exception is a “hybrid PDF” that has an embedded editable document. If you need to edit PDFs that you created yourself, store them in hybrid format.