Extract

Pdf4me Extract lets you extract pages from a Pdf document. As a result, forms a new PDF consisting of the pages which have been extracted from an existing PDF document. These can be single pages or a range of pages.

FeatureParameterResponseActionDescriptionLinks
extractExtractExtractResExtractAction Generates a new PDF consisting of the pages extracted from a given pdf.swagger
sample
extractPagespageNrs,
file
file stream
List of the pages which will be extracted.
Page number 1 corresponds to the first page.
swagger
sample
extractResourcesExtractResourcesExtractResourcesResExtractResourcesActionExtracts resources from a Pdf document like metadata.
swagger
sample

Samples

Extract

  • curl
  • C#
  • Java
  • JavaScript
  • PHP
  • Python
  • Ruby
curl No Sample
// create extract object
Extract extract = new Extract()
{
    // document
    Document = new Document()
    {
        DocData = File.ReadAllBytes("myPdf.pdf"),
        Name = "myPdf.pdf",
    },
    // action
    ExtractAction = new ExtractAction()
    {
        // list of pages to be extracted
        ExtractPages = new System.Collections.Generic.HashSet() { 1, 4 },
    }
};

// extraction
ExtractRes res = await Pdf4meClient.Pdf4me.Instance.ExtractClient.ExtractAsync(extract);

// extracting the generated PDF and writing it to disk
byte[] extractedPdf = res.Document.DocData;
File.WriteAllBytes("extractedPdf.pdf", extractedPdf);
// setup the extractClient
ExtractClient extractClient = new ExtractClient(pdf4meClient);

// create extract object
Extract extract = new Extract();
// document
Document document = new Document();
document.setDocData(Files.readAllBytes(Paths.get("myPdf.pdf")));
extract.setDocument(document);
// action
ExtractAction extractAction = new ExtractAction();
extractAction.setExtractPages(Arrays.asList(1, 4));
extract.setExtractAction(extractAction);

// extraction
ExtractRes res = extractClient.extract(extract);

// extracting the generated PDF and writing it to disk
byte[] extractedPdf = res.getDocument().getDocData();
FileUtils.writeByteArrayToFile(new File("extractedPdf.pdf"), extractedPdf);
// setup the pdf4meClient
const pdf4meClient = pdf4me.createClient('YOUR API KEY')

// create extract object
const extractReq = {
  // document
  document: {
    docData: fs.readFileSync(path.join(__dirname, 'myPdf.pdf')).toString('base64'),
  },
  // action
  extractAction: {
    extractPages: [1, 4],
  },
}

// extraction
pdf4meClient.extract(extractReq)
  .then(function(extractRes) {
    // extracting the generated PDF and writing it to disk
    const pdfDocument = Buffer.from(extractRes.document.docData, 'base64')
    fs.writeFileSync(path.join(__dirname, 'extractedPdf.pdf'), pdfDocument)
  })
  .catch(error => {
    console.log(error)
  })
// create extract object
$create_exrtract = [
    //document
    "document" => [
        "docData" => $client->getFileData('myPdf.pdf')
    ],
    //action
    "extractAction" => [
        "extractPages" => [
            1,
            4
        ]
    ]
];

// extraction
$extractedPdf = $client->pdf4me()->extract($create_extract);

// extracting the generated PDF and writing it to disk
$extractedPdf = base64_decode($createExtract->document->docData);
file_put_contents('extractedPdf.pdf', $extractedPdf);
# setup the extract_client
extract_client = ExtractClient(pdf4me_client)

# create the extract object
extract = Extract(
    # document
    document=Document(
        doc_data=FileReader().get_file_data('myPdf.pdf')
    ),
    # action
    extract_action=ExtractAction(
        extract_pages=[1,4]
    )
)

# extraction
res = extract_client.extract(extract=extract)

# extracting the generated PDF and writing it to disk 
extracted_pdf = base64.b64decode(res['document']['doc_data'])
with open('extractedPdf.pdf', 'wb') as f:
    f.write(extracted_pdf)
file_path = './myPdf.pdf'

 action = Pdf4me::Extract.new(
        # document
        document: Pdf4me::Document.new(
          doc_data: Base64.encode64(File.open(file_path, 'rb', &:read))
        ),
        # action
        extract_action: Pdf4me::ExtractAction.new(
          extract_pages: [1, 4]
        ),
       
    )
response = action.run

    # saving extracted pages
    File.open('/extractedPdf.pdf', 'wb') do |f|
      f.write(Base64.decode64(response.document.doc_data))
    end

ExtractPages

  • curl
  • C#
  • Java
  • JavaScript
  • PHP
  • Python
  • Ruby
curl https://api.pdf4me.com/Extract/ExtractPages ^
    -H "Authorization: Basic DEV-KEY" ^
    -F pageNrs=1,4 ^
    -F "file=@./myPdf.pdf" ^
    -o ./extractedPdf.pdf
// extraction 
byte[] extractedPdf = await Pdf4meClient.Pdf4me.Instance.ExtractClient.ExtractPagesAsync(File.ReadAllBytes("myPdf.pdf"),"1,4");
// and writing the generated PDF to disk
File.WriteAllBytes("extractedPdf.pdf", extractedPdf);
// setup the extractClient
ExtractClient extractClient = new ExtractClient(pdf4meClient);

// extraction and writing the generated PDF to disk
byte[] extractedPdf = extractClient.extractPages("1,4", new File("myPdf.pdf"));
FileUtils.writeByteArrayToFile(new File("extractedPdf.pdf"), extractedPdf);
// setup the extractClient
const extractClient = new pdf4me.ExtractClient(pdf4meClient);

// extraction
extractClient.extractPages('1,4', fs.createReadStream('./myPdf.pdf'))
    .then(pdf => {
        fs.writeFileSync('./extractedPdf.pdf', pdf);
    })
    .catch(err => {
        console.log(err);
    });
// extraction 
$extractPages = $client->pdf4me()->extractPages(
    [
        "pageNrs" => "1,4"
        "file" => __DIR__.'/myPdf.pdf'
    ]
);

//writing it to file
file_put_contents('extractedPdf.pdf', $extractPages);
# setup the extract_client
extract_client = ExtractClient(pdf4me_client)

# extraction
extracted_pdf = extract_client.extract_pages(
    page_nrs='1,4',
    file=FileReader().get_file_handler(path="myPdf.pdf")
)
# writing the generated PDF to disk
with open('extractedPdf.pdf', 'wb') as f:
    f.write(extracted_pdf)
a = Pdf4me::ExtractPages.new(
        file: '/myPdf.pdf',
        pages: [4],
        save_path: 'extractedPdf.pdf'
    )
a.run

ExtractResources

  • curl
  • C#
  • Java
  • JavaScript
  • PHP
  • Python
  • Ruby
curl No Sample
// create extract resource object
var req = new ExtractResources()
{
    //document
    Document = new Document()
    {
        DocData = File.ReadAllBytes("myPdf.pdf"),
        Name = "myPdf.pdf",
    },
    //action
    ExtractResourcesAction = new ExtractResourcesAction()
    {
        ExtractFonts = true,
        ExtractImages = true,
        Outlines = true,
        XmpMetadata = true,
        ListFonts = true,
        ListImages = true
    }
};

//extracting resources
var res = Pdf4me.Instance.ExtractClient.ExtractResourcesAsync(req).GetAwaiter().GetResult();

//saving extracted resource info to a json file
File.WriteAllText("extractResources_result.json", JsonConvert.SerializeObject(res));
// setup the pdf4meClient
const pdf4meClient = pdf4me.createClient('YOUR API KEY')

// create extract resource object
const extractResourcesReq = {
  // document
  document: {
    docData: fs.readFileSync(path.join(__dirname, 'myPdf.pdf')).toString('base64'),
  },
  // action
  extractResourcesAction: {
    extractFonts: true,
    extractImages: true,
    listFonts: true,
    listImages: true,
    outlines: true,
    xmpMetadata: true,
  },
}

// extract resources
pdf4meClient
  .extractResources(extractResourcesReq)
  .then(function(extractResourcesRes) {
    // and writing it to disk
    fs.writeFileSync(path.join(__dirname, 'extractResources_result.json'), JSON.stringify(extractResourcesRes, null, 2))
  })
  .catch(error => {
    console.log(error)
    process.exit(1)
  })
 // create extract resource object
 $create_extract_resource = [
    'document'=> [
        'name' => 'PDF_10pages.pdf',
        'docData' => $pdf4meclient->getFileData('PDF_10pages.pdf')
    ],
    'ExtractResourcesAction' => [
        'outlines' => 0,
        'xmpMetadata' => 1,
        'listFonts' => 1,
        'extractFonts' => 1,
        'extractImages' => 1,
        'listImages' => 1
    ]
];
 
// extract resources
$res = $pdf4meclient->pdf4me()->extractResources($create_extract_resource);

echo $res["pdfResources"];
# setup the extract_client
extract_client = ExtractClient(pdf4me_client)

# create the extract object
extract_resources = ExtractResources(
    # document
    document=Document(
        doc_data=FileReader().get_file_data('PDF_10pages.pdf')
    ),
    # action
    extract_resources_action=ExtractResourcesAction(
        extract_fonts=1,
        extract_images=1,
        list_fonts=1,
        list_images=0,
        outlines=1,
        xmp_metadata=1
    )
)

# extraction
res = extract_client.extract_resources(extract_resources=extract_resources)

# writing it to disk
with open(testfolder+'\extractResources_result.json', 'w') as f:
    json.dump(res, f)

Models

Extract

Name Type Description Notes
document Document
extractAction ExtractAction
jobId String [optional]
jobIdExtern String [optional]
integrations [String] [optional]

ExtractAction

Name Type Description Notes
extractPages [Integer] Page number of pages that needed to be extracted from the document. [Optional]

ExtractRes

Name Type Description Notes
document Document PDF consisting of the extracted pages.

Document

Name Type Description Notes
jobId String JobId of Documents WorkingSet.
documentId String Document Id
name String Filename inlcuding filetype.
docStatus String Status of the Document, e.g. Stamped.
pages Page Description of pages.
docData [byte] Document bytes.
docMetadata DocMetadata Document metadata such as title, pageCount et al.
docLogs DocLog Logging information about the request, e.g. timestamp.

Page

Name Type Description Notes
documentId String Globally unique Id.
pageId String Globally unique Id.
pageNumber Integer PageNumber, starting with 1.
rotate double By how much the page was rotated from its original orientation.
thumbnail byte Thumbnail representing this particular page.
sourceDocumentId String Id of the document it was created from, e.g. in case of an extraction, the result's sourceDocumentId is the Id of the PDF the pages have been extracted from.
sourcePageNumber Integer Page number of the original page in the original document, e.g. let's assume document B consists of page number 4 of document A (extraction).
Thus, document B's only page's sourcePageNumber is number 4.

DocMetadata

Name Type Description Notes
title String Title of document.
subject String Subject of document.
pageCount long Number of pages.
size long Number bytes of the document.
isEncrypted boolean If the document is Encrypted
pdfCompliance String Pdf Compliance, e.g. PDF/A.
isSigned boolean If the document is Encrypted
uploadedMimeType String Uploaded MimeType, e.g. application/bson.
uploadedFileSize long Uploaded file size.

DocLog

Name Type Description Notes
messageType String MessageType, e.g. PdfALog.
message String Message itself, e.g. a warning.
timestamp dateTime Timestamp.
docLogLevel String Type of message. Supported Values :
"verbose", "info", "warning", "error", "timing"
durationMilliseconds long Timing for requested log information [ms].

How can we help?