From f0d8f5f507103d17d4dcffbc07517ed1335e20ef Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 16 Jun 2025 17:26:41 +0100 Subject: [PATCH 1/4] Documentation: Elaborates on Table related classes. --- docs/page.rst | 145 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 129 insertions(+), 16 deletions(-) diff --git a/docs/page.rst b/docs/page.rst index b207b3f5b..ffe80d9b7 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -2329,56 +2329,169 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref ====================================== ===================================== **Document Level** **Page Level** ====================================== ===================================== -*Document.get_page_fonts(pno)* :meth:`Page.get_fonts` -*Document.get_page_images(pno)* :meth:`Page.get_images` -*Document.get_page_pixmap(pno, ...)* :meth:`Page.get_pixmap` -*Document.get_page_text(pno, ...)* :meth:`Page.get_text` -*Document.search_page_for(pno, ...)* :meth:`Page.search_for` +:meth:`Document.get_page_fonts` :meth:`Page.get_fonts` +:meth:`Document.get_page_images` :meth:`Page.get_images` +:meth:`Document.get_page_pixmap` :meth:`Page.get_pixmap` +:meth:`Document.get_page_text` :meth:`Page.get_text` +:meth:`Document.search_page_for` :meth:`Page.search_for` ====================================== ===================================== -The page number "pno" is a 0-based integer `-∞ < pno < page_count`. +.. note:: + + Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].*. So they **load and discard the page** on each execution. + + However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: `page.get_fonts` == `page.parent.get_page_fonts(page.number)`. + + +When calling the :ref:`Document` equivalent methods then the page number is sent through as a parameter, e.g.: + +`Document.get_page_images(pno)` or `Document.get_page_text(pno)` + +.. tip:: + + The page number parameter, ``pno``, is a 0-based integer `-∞ < pno < page_count`. + + + + + +Tables and Related Classes +------------------------------------ + +The `TableFinder` class is returned by :meth:`Page.find_tables` and has related classes as follows: .. class:: TableFinder An object always returned by :meth:`Page.find_tables`. Attributes of interest: - ... attribute:: tables + .. attribute:: tables - A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found. + A list of :class:`Table` objects, each of which represents a table found on the page. An empty list if no tables are found. - ... attribute:: page + .. attribute:: page A reference to the :ref:`Page` object. .. class:: Table - An object representing a table found on the page. Attributes of interest: + An object representing a table found on the page. + + + .. attribute:: page + + A description of the page instance for the table. + + :type: `string` + + .. attribute:: cells + + An array of `Rect` objects for each cell in the table. + + :type: list + + + .. attribute:: header + + A `TableHeader` object if detected. + + :type: `TableHeader` + .. attribute:: bbox The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table. - - .. attribute:: cells + :type: :ref:`Rect` + + + + .. attribute:: row_count + + Number of rows in the table. + + :type: int + + + .. attribute:: col_count + + Number of columns in the table. + + :type: int + + + .. attribute:: rows + + An array of `TableRow` objects for each row in the table. + + :type: list + + + .. method:: extract() + + Extracts table data into a list. + + :type: list + + .. method:: to_markdown(clean=False, fill_empty=True) + + Extracts table data into a list. + + + :arg bool clean: If ``True`` then markdown syntax is removed from cell content. + :arg bool fill_empty: If ``True`` then cell content `None` is replaced by the values above (columns) or left (rows) in an effort to approximate row and columns spans. + + + :type: string + + + .. method:: to_pandas() + + Return a `pandas DataFrame `_ `DataFrame `_ version of the table. + + :type: pandas DataFrame .. class:: TableHeader -.. class:: TableRow + Dedicated class for table headers. + .. attribute:: bbox + The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table. + :type:`Rect` -.. note:: + .. attribute:: cells - Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].*. So they **load and discard the page** on each execution. + A list of tuples for each bbox of a column header. + + :type: list + + .. attribute:: names + + A list of strings with column header text. + + :type: list + + .. attribute:: external + + A boolean indicating whether the header is outside the table cells. + + :type: `bool` + + +.. class:: TableRow + + Dedicated class for table rows. + + +---- - However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*. .. rubric:: Footnotes From 3671bd7a9f2e13efc462ae5f5252e9947e48d7d5 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 16 Jun 2025 17:34:08 +0100 Subject: [PATCH 2/4] Documentation: corrects method description for to_markdown. --- docs/page.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/page.rst b/docs/page.rst index ffe80d9b7..e4ba59192 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -2437,7 +2437,7 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related .. method:: to_markdown(clean=False, fill_empty=True) - Extracts table data into a list. + Extracts table data into Markdown text format. :arg bool clean: If ``True`` then markdown syntax is removed from cell content. From f8b6224b8c66bcf7c889124c9f9e84fb3c035dc1 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 16 Jun 2025 21:59:19 +0100 Subject: [PATCH 3/4] Documentation: adds corrections to table info in Page.rst. --- docs/page.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/page.rst b/docs/page.rst index e4ba59192..535b6a0da 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -2373,6 +2373,8 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related A reference to the :ref:`Page` object. + :type: :ref:`Page` + .. class:: Table @@ -2381,9 +2383,9 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related .. attribute:: page - A description of the page instance for the table. + A back-reference to the owning page. - :type: `string` + :type: :ref:`Page` .. attribute:: cells @@ -2394,14 +2396,14 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related .. attribute:: header - A `TableHeader` object if detected. + A `TableHeader` object. :type: `TableHeader` .. attribute:: bbox - The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table. + The bounding box of all cells of the table header. :type: :ref:`Rect` @@ -2431,7 +2433,7 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related .. method:: extract() - Extracts table data into a list. + Extracts table cell text data into a list. :type: list @@ -2462,7 +2464,7 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related .. attribute:: bbox - The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table. + The bounding box of the union of cells belonging to the table header, given as a tuple (x0, y0, x1, y1). This rectangle contains all table header cells. :type:`Rect` From 48e2204b07178f9774e4b4366ad2431df1fef246 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 16 Jun 2025 22:02:31 +0100 Subject: [PATCH 4/4] Documentation: Typo fix. --- docs/page.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/page.rst b/docs/page.rst index 535b6a0da..32f3e684d 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -2466,7 +2466,7 @@ The `TableFinder` class is returned by :meth:`Page.find_tables` and has related The bounding box of the union of cells belonging to the table header, given as a tuple (x0, y0, x1, y1). This rectangle contains all table header cells. - :type:`Rect` + :type: :ref:`Rect` .. attribute:: cells