PDA

View Full Version : How do I copy PDF pages and make them into a new PDF?



TaiLiu
2022-09-20, 11:34 PM
For those computer-inclined, this question probably has an obvious answer. I'm not, so here I am.

I have multiple PDFs. I want to copy (not extract/remove) pages from them and then put them together. Is there a tool that'll let me do that? CLI tools preferred, but I'll take GUI, too.

Thank you! :smallsmile:


I used ripgrep-all on a directory and now have a list of pages with the words "foo bar". Ideally I'd like for the PDF copier-put-togetherer tool to work from my search results. That probably won't happen, so I'm happy to do it manually.

Whoracle
2022-09-21, 02:45 AM
IIRC you're on an Arch based Linux distribution, so here we go:

Bash script to search for STRING in DIRECTORY, copying all pages that contain said string and merging them together:


#!/usr/bin/env bash
# usage:
# pass search string as first argument
# pass search path as second argument
# pass output file as third argument
#
# ex: sh myscript.sh "My String" pdffiles/ merged.pdf

# get everything to extract
mylist=$(pdfgrep -Hnr $1 $2 | cut -d\: -f1-2)

# make a folder to hold the extracts
tmpdir=pdfextract
mkdir -p $tmpdir

# create var with output filenam efor better readability
mergedfile=$3


# iterate over the results
for line in $mylist; do
# get the file path to extract stuff from
infile=$(echo $line | cut -d\: -f1)

# get the pagenumber
page=$(echo $line | cut -d\: -f2)

# generate names for the outfiles in format filename_pagenumber.pdf
filename=$(basename -- "$infile")
extension="${filename##*.}"
filename="${filename%.*}"
tmpfile=${filename}_${page}.${extension}

# extract pagenumber from infile, storing in
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
-dFirstPage=$page -dLastPage=$page \
-sOutputFile=$tmpdir/$tmpfile $infile > /dev/null 2>&1
done

echo "Found the following pages:"
ls $tmpdir

# merge the results
echo # insert a newline
echo # insert a newline
read -p "Do you want to merge these files? " -n 1 -r
echo # insert a newline
if [[ $REPLY =~ ^[Yy]$ ]]
then
# merge the extracted pages together
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=$mergedfile -dBATCH $tmpdir/* > /dev/null 2>&1
fi

# clean up after us
read -p "Do you want to delete tmpfiles? " -n 1 -r
echo # insert a newline
if [[ $REPLY =~ ^[Yy]$ ]]
then
rm $tmpdir/* -f
sleep 2
rmdir $tmpdir
fi

echo "Merged file written to $mergedfile


ex: sh myscript.sh MYSTRING DIRECTORY OUTFILE

Dependencies:
- ghostscript (most likely installed on your system)
- pdfgrep (on arch-based systems that's in the community repo - install with pacman -S pdfgrep)

Script will
- recursively search for MYSTRING in DIRECTORY with pdfgrep
- store the filepath<->pagenumber bit of the output in a list
- create a tmpdir to hold copied pages
- iterate over the list, copying pages out to the tmpdir in the format originalfilename_pagenumer.pdf
- ask you if you want to merge these (in alphabetical order) into OUTFILE
- ask you if you want to remove the created tmpdir and its contens

Telok
2022-09-21, 10:06 AM
I've used PDF shuffler, http://pdfshuffler.sourceforge.net/

to reorganize and put together. But you'll lose some metadata and any calculating forms javascript.

TaiLiu
2022-09-22, 11:45 PM
IIRC you're on an Arch based Linux distribution, so here we go:

Bash script to search for STRING in DIRECTORY, copying all pages that contain said string and merging them together:


#!/usr/bin/env bash
# usage:
# pass search string as first argument
# pass search path as second argument
# pass output file as third argument
#
# ex: sh myscript.sh "My String" pdffiles/ merged.pdf

# get everything to extract
mylist=$(pdfgrep -Hnr $1 $2 | cut -d\: -f1-2)

# make a folder to hold the extracts
tmpdir=pdfextract
mkdir -p $tmpdir

# create var with output filenam efor better readability
mergedfile=$3


# iterate over the results
for line in $mylist; do
# get the file path to extract stuff from
infile=$(echo $line | cut -d\: -f1)

# get the pagenumber
page=$(echo $line | cut -d\: -f2)

# generate names for the outfiles in format filename_pagenumber.pdf
filename=$(basename -- "$infile")
extension="${filename##*.}"
filename="${filename%.*}"
tmpfile=${filename}_${page}.${extension}

# extract pagenumber from infile, storing in
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
-dFirstPage=$page -dLastPage=$page \
-sOutputFile=$tmpdir/$tmpfile $infile > /dev/null 2>&1
done

echo "Found the following pages:"
ls $tmpdir

# merge the results
echo # insert a newline
echo # insert a newline
read -p "Do you want to merge these files? " -n 1 -r
echo # insert a newline
if [[ $REPLY =~ ^[Yy]$ ]]
then
# merge the extracted pages together
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=$mergedfile -dBATCH $tmpdir/* > /dev/null 2>&1
fi

# clean up after us
read -p "Do you want to delete tmpfiles? " -n 1 -r
echo # insert a newline
if [[ $REPLY =~ ^[Yy]$ ]]
then
rm $tmpdir/* -f
sleep 2
rmdir $tmpdir
fi

echo "Merged file written to $mergedfile


ex: sh myscript.sh MYSTRING DIRECTORY OUTFILE

Dependencies:
- ghostscript (most likely installed on your system)
- pdfgrep (on arch-based systems that's in the community repo - install with pacman -S pdfgrep)

Script will
- recursively search for MYSTRING in DIRECTORY with pdfgrep
- store the filepath<->pagenumber bit of the output in a list
- create a tmpdir to hold copied pages
- iterate over the list, copying pages out to the tmpdir in the format originalfilename_pagenumer.pdf
- ask you if you want to merge these (in alphabetical order) into OUTFILE
- ask you if you want to remove the created tmpdir and its contens
Oh, wow, thank you! It must've taken a lot of time to put the script together. I only understand the broadest strokes of the script, even with your comments, but this is great. :smallsmile:

The last line turned out to be missing a quotation mark, but it's simple to just add it. But for some reason the PDF it generates just has empty pages. That's okay, though. I'll figure it out. Thanks so much again!


I've used PDF shuffler, http://pdfshuffler.sourceforge.net/

to reorganize and put together. But you'll lose some metadata and any calculating forms javascript.
Oh, thank you! I'll look it up. Not interested in the metadata and it's not a form, luckily. :smallsmile:

Whoracle
2022-09-23, 01:44 AM
Oh, wow, thank you! It must've taken a lot of time to put the script together. I only understand the broadest strokes of the script, even with your comments, but this is great. :smallsmile:

You're welcome, but it didn't take that long. That's like bash scripting 102 or maybe 103, and I've been using linux professionally for two decades now. was like maybe 15 minutes including making it look nice-ish, and I was bored anyways :)


The last line turned out to be missing a quotation mark, but it's simple to just add it. But for some reason the PDF it generates just has empty pages. That's okay, though. I'll figure it out. Thanks so much again!

As for the empty pages: If you say "N" on the last prompt (or comment the last few lines out, after "# clean up after us") you can see the extracted pages in a newly created folder named "pdfextract". If those are empty, something went wrong with the extraction, if not, then it was the merge.

Depending on what it is you'll need to play around with the gs options - package is named "ghostscript" for easier googling.

I tested it with a bunch of my PDFs, but those were all created pretty sanely by myself in LibreOffice, so maybe it's something with the source files. If you can and want to, you can always zip me a few example files and the search string and I'll have a look. If so, let me know here (not via DMs - I tend to overlook those) and I'll fix you an upload directory on my server.

Otherwise: Have fun figuring it out - that'll help you do stuff like this in the future, and IMHO it's always worth trying stuff for yourself. The hard part usually is finding out how to get started :smallbiggrin:

TaiLiu
2022-09-24, 02:37 PM
You're welcome, but it didn't take that long. That's like bash scripting 102 or maybe 103, and I've been using linux professionally for two decades now. was like maybe 15 minutes including making it look nice-ish, and I was bored anyways :)
This would've taken me hours. I would never even have thought of using a shell script. I just use the command line for basic operations.


As for the empty pages: If you say "N" on the last prompt (or comment the last few lines out, after "# clean up after us") you can see the extracted pages in a newly created folder named "pdfextract". If those are empty, something went wrong with the extraction, if not, then it was the merge.

Depending on what it is you'll need to play around with the gs options - package is named "ghostscript" for easier googling.

I tested it with a bunch of my PDFs, but those were all created pretty sanely by myself in LibreOffice, so maybe it's something with the source files. If you can and want to, you can always zip me a few example files and the search string and I'll have a look. If so, let me know here (not via DMs - I tend to overlook those) and I'll fix you an upload directory on my server.

Otherwise: Have fun figuring it out - that'll help you do stuff like this in the future, and IMHO it's always worth trying stuff for yourself. The hard part usually is finding out how to get started :smallbiggrin:
Oh, thank you! This is helpful. I'm excited to start learning. I'll probably take a look at a book on Bash or something. :smallsmile: