While working on one of the hackathon projects, I encountered a tricky issue when uploading a PDF document to a web-based application. As it was a race against time, I quickly decided to use pdf.js to get things going. As a reminder, PDF is a file format developed by Adobe.

Setup

Since I already use Notepad++ , I installed the HEX-Editor plugin via Plugins > Plugins Admin…. Simply search and install it from there.

scholarscan-add-hex-editor-plugin

For the purpose of this post, I will use the reserach paper Attention Is All You Need as the example PDF.

To manipulate and inspect the PDF structure, I installed qpdf , a C++ library that enables structural, content-preserving transformations on PDF files. The latest version e.g. qpdf 12.2.0 can be downloaded from GitHub.

HEX Editing with NotePad++

Dragging the Attention Is All You Need PDF into Notepad++ gives us a hex representation of the file. However, since it’s in raw hex, it’s not easy to interpret.

scholarscan-hex-editor-view

Using qpdf for Extraction

Let’s now try some cli commands . Using the --qdf option, we can create a human-readable version of the PDF:

qpdf 1706.03762v7.pdf --qdf file.pdf

Here’s a snippet of the output:

%PDF-1.5
%¿÷¢þ
%QDF-1.0

%% Original object ID: 5953 0
1 0 obj
<<
  /Names 3 0 R
  /OpenAction 4 0 R
  /Outlines 5 0 R
  /PageMode /UseOutlines
  /Pages 6 0 R
  /Type /Catalog
>>
endobj

%% Original object ID: 5954 0
2 0 obj
<<
  /Author ()
  /CreationDate (D:20240410211143Z)
  /Creator (LaTeX with hyperref)
  /Keywords ()
  /ModDate (D:20240410211143Z)
  /PTEX.Fullbanner (This is pdfTeX, Version 3.141592653-2.6-1.40.25 \(TeX Live 2023\) kpathsea version 6.3.5)
  /Producer (pdfTeX-1.40.25)
  /Subject ()
  /Title ()
  /Trapped /False
>>
endobj

%% Original object ID: 5952 0
3 0 obj
<<
  /Dests 7 0 R
>>
endobj

We can also output the PDF in JSON format:

qpdf 1706.03762v7.pdf --json-output inline.json

Here’s a sample:

{
  "qpdf": [
    {
      "jsonversion": 2,
      "pdfversion": "1.5",
      "pushedinheritedpageresources": false,
      "calledgetallpages": false,
      "maxobjectid": 5957
    },
    {
      "obj:1 0 R": {
        "value": {
          "/D": "u:section.1",
          "/S": "/GoTo"
        }
      },
      "obj:2 0 R": {
        "stream": {
          "data": "eNrFWltv20YWftevmMf2odTcL0BQIE7qJLubNIiD3e4mwYKWxrIaRTQo0o7z6/sdiqZEiRKVhEEBmyOSwzPfucy5kYJxZpjizDLBBQtMKIFfTASJn0xKDJopiduOKYt7nhkhmBTMaDvCbcslk4ZZjcEyFwKTgXkViGgwiinFQsCgmZAcVx1GDZIehJVlGiS1EyM8LkzwTBsmLJ7RtCIuakByTjIDTN5ZBooicIygF4xj+BMBYI0HJsy3hM25kQUcCTDWMKmwuLUYA/gEPg28jgM16DgwaYDDaYxeMufABnh1oOfAkAc9Z8UIt6Q3hnnQ86DjLRjUjnliFA/hkhJgKgiMIC4gFogN3HCSAacfmKKIXyyntPYjwFbO41nMCEqyoJnmjsTGtISQA8SjTGDBQSxcMxKPhWDAlnaYJKAoHUhCmGwENyMhIAnwgSsKIgJ/AuozxkJqwkBYOBMC0vOGbjloD1oXUKoVmp6CwBRdkZCUJqxQtMXskSBVOyAXUjHrPd3SkCKULUicAoIUZACKlpAQoDZkTZw5SzBgVs4ZMirIOEAUAkr2XMsRSc0Lkhvzku5isMQTrIgsUUHoitSPi14Hsh/8sBrrQa/e0hVo13uyXEsagTkJB/ER7wK2EwQUB/aDpHUBNEhiDSoPEnoF05A+yFmoT1k8A+VCrPQwrlgSjCMteDl69IiNL9j4WfY2Y+On7KdVnBTzbJmIn9mvv45+eq+cxb97zzl/QYclHQo65HTI6DClQ0mHSXN33txd/nxwEbm/yBkd0obWRzrM2suVDY7pYdJqn/TLNuBIhwUd3nPNaXzcrFOtfd3wUTTzNwyWzeTYgaK8bIB0ifK3hoVJB6h8G1S64Xb78tM2pGMULhrMO3Jd9QDvUM/jhtZGJssTlQ7SLepdgrloEKaNemI381mz7C90eN1vky09D8WEPGBoZYN/Q7UC+rxZM91jbFBk6oD6bprDoqG6EfnuSpWhbGOsLl99E+rth+Z7V7L2tnq43Llteyy3g/XXDZlVe1/v4Kt0dNfcWDVrPuA5b65sDPOX5kbW4L9rRJrv6flV83jRTN08edLm1Ae8yic6XLbhTdsCn+3p9ICTuWjru2joV/M/9yA0366EdM83H/SYW2wdjgYdwvpP4+Lv91jehIarRrvfszVrFB0CedsoPW0zc5ytjZhNlxc9heqeP03bYaLHNM7aj+wEzJOAd3jO5x17ZmcfxVPtdgMptgPCoj9omy4P8nvjNzcq/9T8+rIJvj2kO4zxTfPwrA1zw/dmkfTrrM4eW27VEaqOex7bZXAv29nFjiG0VLZjm5v4svga3rbxyG9I9f7dIdzD4a8HgDqW4M3agXbVyOcBypOOWFu0vWPZ9jiTPa/1us3O6mTH6Pah7+CZNPjLNukDunmHwoezN2x8Pi8+EPFHj0bjt/c3kY1fp7M4Gj/JlkVcFquqPsfM0fhNXGVlPomrqqSsLr2M03l6ln1m7+iCFVRjyQ8jkMjxLOaFal5NfmvVP/77P9xFtWZc4lEzL8vF4sPBeZ45FIatOeeAx4inc496CUXqmhlUtahH6xOqEVEDrk+o8MSC65OqPvbVCeiNX+fZ5CIW7B2Yf3rOxm/j54I1S3VLRbk9qSjzVVIZP14uM5B6h3KwgoJ6vBrWzIQ1G2HNQFhDD2vQoaJQFeLrUdSjrEdVj7oeTT3aenT16OuxpidrerKmJ2t6sqYna3qypidrerKmp+rnVf28qp9X6+d3RFrxPxpflJdFdf6v+fLjaHyW5dOYV6ITH8bPxy/GT3ACgh9I2BNoSUmTUA/Gep1Q00EplwSwbAOqOG4w7zHb3UmTeRGT62xynUf8ykUI6z01CCAtbUJNmwdAWqkE8IJMvLKH4czy8rfbdCH0gEhESKiRYq1MtAYQAUAA5jB6dxjKqixWH+MtBDMkGKUTZ30FxgCMFgm11nrBXKbX03SZlpILvYxlni4GxKSR0RjdCMigNJS+H9PkOiM4i5jmy/lyNhwgYWErUJjmiaWupzOJtdTXDIm24TCgOxKPnWXZbBEHRON8Ykk8gOPgBUQALPirXjiLMlvOgMjEqysKWrdDgsK+ph7tg4wkN4kxqh/Un9mXeJXdzSdfSFbx880iywdVnpYuoVaesSFxnjZ+SKSi7rBPjDriif5ZTq7TebwFLneeTgrg+pJSqB/WsOCLjPGJc6EyLEACVAchHvFKq+v0S4w5IcvKIkfYy8rV4n44YBLe0hvBDIfQrKI+eeIRUYyA0MTf4xaksrB7X2EyGpg09gH1snswrYq8nBRlHqePC0oMoMFXsbjL8o+r4cApDV8l1wKzEJgyOlH0TqIH3DROsk832Sq9XETAW77MpnFAmSH7gCs1THmV8KpDD18qmAoi8V4ehoXkKi6ncbqc3ZRDmjsyfORFDRpsQQiqH86rdHF2D0SxIJMf0KTgrTRlJ14iO7EwKZdoJES9gP6RLdMVnEJ8NTQklA4JvYdQcFO6klVIBKKfgrVLbU9JnCTnYpan0zmM/XuAUYK4pTyP7YYkUnFsQ0ceHkrEqAScluj28B098GF8OjDQ6x6J0sQaelOoEo0ILWFgmqtjKUKsQqBdZIPGGIF1bQXHEBzpE418vhfOD979lFpqrje4tEropWEvrpu0XJQrMuxpjDdDZnZwANpv9IbAp7jqB7SYLwnNxpUP6AOESJSkV8NIy6EzCUulN73QoQjHYnD58foyTYuUwrAZ0AEgclh6G+rguhWcI2RjURULFxLuj4SSdbz9/6AeWyFpMsFuwHgBSxL9YH6Ux9aUV9K3BgBEpbMWskqZegH9OI+N9BZL01tpjbxEMO1swumtOixJ/S31iiZhaFchspCRByAqFPoA/dCiDnGM06cOtZQM50mwJ4D6IWUv5UW06WFFget1WuToE5XEHcvWEFVvI/lFNYNp56gG2mp7CsrKntY+25nY0T/rmqSRBIeg+ycK8vnhYENuuwfX6s4hwgsVHk6w8zXvatXRNxH6oT0pQUDL+kRUZ+q7GneO7zXurP+6xt2zPCtvqu9B2n08+jCk6nfZui9m3XbfqwJUPzy+GL/N0+XqhkhP7tn4ycX4abydT+KbZ2ds/IIhGsVBcisd4OMpyVNUD6LMQm7luKBPgRLtfadBXs1nCIT1K4TBKmUeKDqjZnFVocwNBWeXKHFkh15HyqiGzRQUYaEPrZCRQ1NKaWChr8koUfBHEoX0PubLLP9EkHa3pg0nbs3tiVJTXkCfYVGbxR+fTDuvSoi7tujuRBngdYI6YSKZg3D9E4VA8SDCN7ThW7t+s51pA//x++WfpBMi8eITeeyerjxmYT8/6dvkfr87781Am9zVTXNXN8193fT24vua3DvblksEVkpCkBVRMoLinzaQRjjR4ui2HbAYUgFFKyQiUVXDmpDwI6Cqyrqs+6Zo/xdsv40P",
          "dict": {
            "/Filter": "/FlateDecode",
            "/First": 816,
            "/N": 100,
            "/Type": "/ObjStm"
          }
        }
      },
      "obj:3 0 R": {
        "value": {
          "/D": [
            "136 0 R",
            "/XYZ",
            108,
            720,
            null
          ]
        }
      },
      "obj:128 0 R": {
        "stream": {
          "data": "eNrs3W+vHNd92PF9JxT14PJ1XMpSboWgQHRFK05rizTNJEpIM6plhTQjwwEZM1JsgjEhxyJo0RZMy6KsNiJhwpHiv7Jk2JRFtVBtklJQwAVvAudBgSZAH7IHWXRwOrs7e2Z2ZnbO7ueDHwr56t79M0teZb49c+buXQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACApRmlcaAAAAAAlm7UiOMGAAAA0L/RwhxDAAAAgH6MWjXwN+hTVt4AAABglU7zE/tMdklHynAQAAAAYCXP8Rs0mVySjpThIAAAAMDqneAvcrI//KQjZTgIAAAAsGJn962c5g856UgZDgIAAACs2Nl9W+f4ek4un7g//wAAAJD1qX27J/jDTDpShoMAAAAAK3Ne3/rZ/TCvupIyHAQAAABYmfP6Ls7u9Zzhf+7+FgAAAEC+5/Udndqvds8ZVcric/e3AAAAADLST39IeZbFX8asRxjVl/jCmj1gg7ew9IMAAAAADEdvp/Ar1nNGjfTwceg5AAAAsPL0nAY9p8FPJb4vPQcAAACode7fZ8+ZfK7uUkb1dzZ45c2eqMH3D+cgAAAAAAPR83qM1es5zV5YxQ/qOQAAAEBb5/49PF12PaeL16bnAAAAAG2d+/fwdHn1nAVf26xH0HMAAACAts79e3i61e45iS9PzwEAAADaOvfv4eky6jmtvP2eb/K14EEAAAAABkLP6bPn3F3qTdsXPAgAAADAQOg5eo6eAwAAAHnRcxr0nE4PuJ4DAAAAtHXu38PTrUPPmfvUeg4AAADQ1rl/D0+3hj0no4MAAAAADISeo+foOQAAAJCXPnvO3F2F9Zy7eg4AAABQ89y/0/N6PSfl0fQcAAAAoMXT/66fSM+5q+cAAAAArZ7+d/1Ees5dPQcAAACof/rf0al9yrO4X/l
          ...
          AAAAAAAAAAAAAAAAAAAAABgz/4/6jXPIg==",
          "dict": {
            "/BitsPerComponent": 8,
            "/ColorSpace": "/DeviceRGB",
            "/Filter": "/FlateDecode",
            "/Height": 2239,
            "/SMask": "175 0 R",
            "/Subtype": "/Image",
            "/Type": "/XObject",
            "/Width": 1520
          }
        }
      },
    }
  ]
}

To decode stream data more effectively, use:

mkdir json
qpdf 1706.03762v7.pdf --json-output --decode-level=generalized --json-stream-data=file json\file.json
Note
Note: This PDF has 5,954 objects, so expect the same number of stream files.

Here’s a sample:

{
  "qpdf": [
    {
      "jsonversion": 2,
      "pdfversion": "1.5",
      "pushedinheritedpageresources": false,
      "calledgetallpages": false,
      "maxobjectid": 5957
    },
    {
      "obj:1 0 R": {
        "value": {
          "/D": "u:section.1",
          "/S": "/GoTo"
        }
      },
      "obj:2 0 R": {
        "stream": {
          "datafile": "json\\file.json-2",
          "dict": {
            "/First": 816,
            "/N": 100,
            "/Type": "/ObjStm"
          }
        }
      },
      "obj:3 0 R": {
        "value": {
          "/D": [
            "136 0 R",
            "/XYZ",
            108,
            720,
            null
          ]
        }
      },
      "obj:128 0 R": {
        "stream": {
          "datafile": "json\\file.json-128",
          "dict": {
            "/BitsPerComponent": 8,
            "/ColorSpace": "/DeviceRGB",
            "/Height": 2239,
            "/SMask": "175 0 R",
            "/Subtype": "/Image",
            "/Type": "/XObject",
            "/Width": 1520
          }
        }
      },
    }
  ]
}

Reconstructing Images from PDFs

From the earlier step, we can reconstruct an image from one of the streams (e.g., file.json-128). This is done using Python and the Pillow library inside a Jupyter plugin using a local .venv environment.

First, install the required packages:

pip install pillow numpy

Then use the script to reconstruct the image.

import numpy as np
from PIL import Image
import struct

def reconstruct_image_from_pdf_stream(stream_file_path, width, height, bits_per_component=8, colorspace="RGB"):
    """
    Reconstruct an image from PDF stream data extracted by qpdf.
    
    Args:
        stream_file_path: Path to the stream data file (e.g., 'json\\file.json-128')
        width: Image width from PDF dictionary
        height: Image height from PDF dictionary
        bits_per_component: Bits per color component (usually 8)
        colorspace: Color space (RGB, CMYK, etc.)
    """
    
    # Read the raw stream data
    with open(stream_file_path, 'rb') as f:
        raw_data = f.read()
    
    print(f"Raw data size: {len(raw_data)} bytes")
    
    # Calculate expected size
    if colorspace == "RGB":
        channels = 3
    elif colorspace == "CMYK":
        channels = 4
    elif colorspace == "Gray":
        channels = 1
    else:
        channels = 3  # Default to RGB
    
    bytes_per_pixel = (bits_per_component * channels) // 8
    expected_size = width * height * bytes_per_pixel
    
    print(f"Expected size: {expected_size} bytes")
    print(f"Image dimensions: {width} x {height}")
    print(f"Channels: {channels}, Bits per component: {bits_per_component}")
    
    try:
        # Method 1: Direct reconstruction (most common case)
        if len(raw_data) == expected_size:
            # Create numpy array from raw data
            if bits_per_component == 8:
                img_array = np.frombuffer(raw_data, dtype=np.uint8)
            elif bits_per_component == 16:
                img_array = np.frombuffer(raw_data, dtype=np.uint16)
                # Convert to 8-bit for PIL
                img_array = (img_array // 256).astype(np.uint8)
            else:
                raise ValueError(f"Unsupported bits per component: {bits_per_component}")
            
            # Reshape to image dimensions
            if channels == 1:
                img_array = img_array.reshape((height, width))
                mode = 'L'  # Grayscale
            elif channels == 3:
                img_array = img_array.reshape((height, width, 3))
                mode = 'RGB'
            elif channels == 4:
                img_array = img_array.reshape((height, width, 4))
                mode = 'CMYK'
            
            # Create PIL Image
            image = Image.fromarray(img_array, mode=mode)
            
            # Convert CMYK to RGB if necessary
            if mode == 'CMYK':
                image = image.convert('RGB')
            
            return image
            
        # Method 2: Try to handle compressed data
        else:
            print("Data size doesn't match expected size. Trying decompression...")
            
            # Try zlib decompression (FlateDecode)
            try:
                import zlib
                decompressed = zlib.decompress(raw_data)
                print(f"Decompressed size: {len(decompressed)} bytes")
                
                if len(decompressed) == expected_size:
                    img_array = np.frombuffer(decompressed, dtype=np.uint8)
                    
                    if channels == 1:
                        img_array = img_array.reshape((height, width))
                        mode = 'L'
                    elif channels == 3:
                        img_array = img_array.reshape((height, width, 3))
                        mode = 'RGB'
                    elif channels == 4:
                        img_array = img_array.reshape((height, width, 4))
                        mode = 'CMYK'
                    
                    image = Image.fromarray(img_array, mode=mode)
                    
                    if mode == 'CMYK':
                        image = image.convert('RGB')
                    
                    return image
                    
            except Exception as e:
                print(f"Zlib decompression failed: {e}")
            
            # Try other decompression methods if needed
            # You might need to handle DCTDecode (JPEG), etc.
            
            raise ValueError("Could not reconstruct image from stream data")
            
    except Exception as e:
        print(f"Error reconstructing image: {e}")
        raise

# Usage example based on your PDF data
def main():
    # Your image parameters from the PDF dictionary
    width = 1520
    height = 2239
    bits_per_component = 8
    colorspace = "RGB"  # DeviceRGB
    
    # Path to your stream data file
    stream_file = r"json\file.json-128"
    
    try:
        # Reconstruct the image
        image = reconstruct_image_from_pdf_stream(
            stream_file, 
            width, 
            height, 
            bits_per_component, 
            colorspace
        )
        
        # Save the reconstructed image
        output_path = "reconstructed_image.png"
        image.save(output_path)
        print(f"Image saved as: {output_path}")
        
        # Optionally display the image
        # image.show()
        
    except Exception as e:
        print(f"Failed to reconstruct image: {e}")
        print("You may need to handle specific PDF filters or compression methods.")

if __name__ == "__main__":
    main()

This is the sample output:

Raw data size: 10209840 bytes
Expected size: 10209840 bytes
Image dimensions: 1520 x 2239
Channels: 3, Bits per component: 8
Image saved as: reconstructed_image.png
C:\ProgramData\Temp\ipykernel_34232\1422296325.py:65: DeprecationWarning: 'mode' parameter is deprecated and will be removed in Pillow 13 (2026-10-15)
  image = Image.fromarray(img_array, mode=mode)
scholarscan-reconstructed-image

Detecting Tables in PDF Content

To detect and extract tables from the PDF, I used PyMuPDF .
Install the packages:

pip install PyMuPDF pandas

Then run the script to scan pages and extract table content. It supports exporting to CSV, Markdown, and more.

import fitz
import pandas as pd

def extract_tables_from_page(pdf_path, page_number):
    """
    Extract tables from a specific page with enhanced error handling
    """
    doc = fitz.open(pdf_path)
    
    try:
        # Check if page number is valid
        if page_number >= doc.page_count:
            print(f"Error: Page {page_number + 1} doesn't exist. PDF has {doc.page_count} pages.")
            return
        
        page = doc[page_number]
        print(f"Analyzing page {page_number + 1} of {doc.page_count}")
        
        # Find tables
        table_finder = page.find_tables()
        tables = list(table_finder)
        
        if not tables:
            print("No tables found on this page.")
            return
        
        print(f"Found {len(tables)} table(s)")
        
        for i, table in enumerate(tables):
            print(f"\n{'='*50}")
            print(f"TABLE {i + 1}")
            print(f"{'='*50}")
            print(f"Position (bbox): {table.bbox}")
            print(f"Dimensions: {table.row_count} rows × {table.col_count} columns")
            
            try:
                # Extract as list of lists
                print(f"\n--- Raw Data (first 5 rows) ---")
                data_list = table.extract()
                
                if data_list:
                    for j, row in enumerate(data_list[:5]):  # Show first 5 rows
                        print(f"Row {j}: {row}")
                    
                    if len(data_list) > 5:
                        print(f"... and {len(data_list) - 5} more rows")
                else:
                    print("No data extracted")
                    continue
                
                # Convert to pandas DataFrame
                print(f"\n--- DataFrame ---")
                try:
                    df = table.to_pandas()
                    print(df.head())
                    print(f"Shape: {df.shape}")
                    
                    # Save to CSV
                    csv_filename = f"table_page{page_number + 1}_table{i + 1}.csv"
                    df.to_csv(csv_filename, index=False)
                    print(f"Saved to: {csv_filename}")
                    
                except Exception as e:
                    print(f"Could not convert to DataFrame: {e}")
                    # Fallback: create DataFrame manually
                    if data_list and len(data_list) > 1:
                        df = pd.DataFrame(data_list[1:], columns=data_list[0])
                        print("Manual DataFrame creation:")
                        print(df.head())
                
                # Convert to Markdown
                print(f"\n--- Markdown ---")
                try:
                    md_text = table.to_markdown()
                    print(md_text[:500] + ("..." if len(md_text) > 500 else ""))
                    
                    # Save markdown
                    md_filename = f"table_page{page_number + 1}_table{i + 1}.md"
                    with open(md_filename, 'w', encoding='utf-8') as f:
                        f.write(md_text)
                    print(f"Markdown saved to: {md_filename}")
                    
                except Exception as e:
                    print(f"Could not convert to Markdown: {e}")
            
            except Exception as e:
                print(f"Error processing table {i + 1}: {e}")
    
    except Exception as e:
        print(f"Error processing page: {e}")
    
    finally:
        doc.close()

def scan_all_pages_for_tables(pdf_path):
    """
    Scan all pages to find which ones contain tables
    """
    doc = fitz.open(pdf_path)
    
    print(f"Scanning {doc.page_count} pages for tables...")
    pages_with_tables = []
    
    for page_num in range(doc.page_count):
        page = doc[page_num]
        table_finder = page.find_tables()
        tables = list(table_finder)
        
        if tables:
            pages_with_tables.append((page_num, len(tables)))
            print(f"Page {page_num + 1}: {len(tables)} table(s)")
    
    doc.close()
    
    if pages_with_tables:
        print(f"\nSummary: Tables found on {len(pages_with_tables)} page(s)")
        for page_num, table_count in pages_with_tables:
            print(f"  Page {page_num + 1}: {table_count} table(s)")
    else:
        print("No tables found in the entire document")
    
    return pages_with_tables

def extract_table_with_options(pdf_path, page_number, table_index=0):
    """
    Extract a specific table with different formatting options
    """
    doc = fitz.open(pdf_path)
    page = doc[page_number]
    
    table_finder = page.find_tables()
    tables = list(table_finder)
    
    if table_index >= len(tables):
        print(f"Table {table_index} not found. Only {len(tables)} table(s) on page {page_number + 1}")
        doc.close()
        return
    
    table = tables[table_index]
    
    print(f"Extracting Table {table_index + 1} from Page {page_number + 1}")
    print(f"Table bounds: {table.bbox}")
    
    # Different extraction options
    formats = {
        "Raw List": lambda: table.extract(),
        "Pandas DataFrame": lambda: table.to_pandas(),
        "Markdown": lambda: table.to_markdown(),
        "CSV String": lambda: table.to_csv(),
    }
    
    for format_name, extract_func in formats.items():
        try:
            print(f"\n--- {format_name} ---")
            result = extract_func()
            
            if isinstance(result, str):
                print(result[:300] + ("..." if len(result) > 300 else ""))
            elif isinstance(result, pd.DataFrame):
                print(result)
            elif isinstance(result, list):
                for i, row in enumerate(result[:3]):
                    print(f"Row {i}: {row}")
                if len(result) > 3:
                    print(f"... {len(result) - 3} more rows")
            else:
                print(result)
                
        except Exception as e:
            print(f"Error with {format_name}: {e}")
    
    doc.close()

# Usage examples
if __name__ == "__main__":
    pdf_file = "1706.03762v7.pdf"
    
    print("=== EXTRACTING FROM PAGE 10 ===")
    extract_tables_from_page(pdf_file, 9)  # Page 10 (0-indexed)
    
    print("\n=== SCANNING ALL PAGES ===")
    pages_with_tables = scan_all_pages_for_tables(pdf_file)
    
    # Extract from other pages if found
    if pages_with_tables:
        print("\n=== EXTRACTING FROM FIRST TABLE-CONTAINING PAGE ===")
        first_page_with_tables = pages_with_tables[0][0]
        extract_table_with_options(pdf_file, first_page_with_tables)
scholarscan-table-extracted

The Vibe Coding Challenge

xAI Grok Code Fast 1

With the release of Grok Code Fast 1 , now available for free via exclusive launch partners, I gave it a try! For my setup, I installed Cline inside Cursor , an AI-powered code editor.

scholarscan-grok-code-fast-1

Unfortunately, after several attempts, the project didn’t run as expected. Below is the sample autonomous coding agent setup in Cursor using Cline:

scholarscan-cline-coding-agent

Google AI Studio

With Google AI Studio , an easy and free way to start building with Gemini, I tried the same prompts there…

scholarscan-by-google-ai-studio

Conclusion

This exploration began with a simple PDF upload challenge and expanded into a deep dive through PDF internals—from inspecting raw hex data, manipulating structure with qpdf, reconstructing embedded images, to extracting tables using PyMuPDF. Along the way, I experimented with new tools like Grok Code Fast 1 and AI Studio to assess their potential in handling real-world document analysis.

If you’re working on similar PDF-related tasks, want to experiment with document internals, or just enjoy tinkering with file formats—hopefully this walkthrough gives you a head start.

Have your own tricks for working with PDFs?
Or found a better way to extract structured data from academic papers?
👉 Drop me a message or share your approach—I’d love to learn more!