Skip to content

Documentation: Elaborates on Table related classes. #4563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 132 additions & 17 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2329,56 +2329,171 @@ This is an overview of homologous methods on the :ref:`Document` and on the :ref
====================================== =====================================
**Document Level** **Page Level**
====================================== =====================================
*Document.get_page_fonts(pno)* :meth:`Page.get_fonts`
*Document.get_page_images(pno)* :meth:`Page.get_images`
*Document.get_page_pixmap(pno, ...)* :meth:`Page.get_pixmap`
*Document.get_page_text(pno, ...)* :meth:`Page.get_text`
*Document.search_page_for(pno, ...)* :meth:`Page.search_for`
:meth:`Document.get_page_fonts` :meth:`Page.get_fonts`
:meth:`Document.get_page_images` :meth:`Page.get_images`
:meth:`Document.get_page_pixmap` :meth:`Page.get_pixmap`
:meth:`Document.get_page_text` :meth:`Page.get_text`
:meth:`Document.search_page_for` :meth:`Page.search_for`
====================================== =====================================

The page number "pno" is a 0-based integer `-∞ < pno < page_count`.
.. note::

Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.

However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: `page.get_fonts` == `page.parent.get_page_fonts(page.number)`.


When calling the :ref:`Document` equivalent methods then the page number is sent through as a parameter, e.g.:

`Document.get_page_images(pno)` or `Document.get_page_text(pno)`

.. tip::

The page number parameter, ``pno``, is a 0-based integer `-∞ < pno < page_count`.





Tables and Related Classes
------------------------------------

The `TableFinder` class is returned by :meth:`Page.find_tables` and has related classes as follows:


.. class:: TableFinder

An object always returned by :meth:`Page.find_tables`. Attributes of interest:

... attribute:: tables
.. attribute:: tables

A list of :ref:`Table` objects, each of which represents a table found on the page. Empty list if no table found.
A list of :class:`Table` objects, each of which represents a table found on the page. An empty list if no tables are found.

... attribute:: page
.. attribute:: page

A reference to the :ref:`Page` object.

:type: :ref:`Page`


.. class:: Table

An object representing a table found on the page. Attributes of interest:
An object representing a table found on the page.

.. attribute:: bbox

The bounding box of the table given as a tuple `(x0, y0, x1, y1)`. This is the rectangle that contains all cells of the table.
.. attribute:: page


A back-reference to the owning page.

:type: :ref:`Page`

.. attribute:: cells

An array of `Rect` objects for each cell in the table.

:type: list


.. attribute:: header

A `TableHeader` object.

:type: `TableHeader`


.. attribute:: bbox

The bounding box of all cells of the table header.


:type: :ref:`Rect`



.. attribute:: row_count

Number of rows in the table.

:type: int


.. attribute:: col_count

Number of columns in the table.

:type: int


.. attribute:: rows

An array of `TableRow` objects for each row in the table.

:type: list


.. method:: extract()

Extracts table cell text data into a list.

:type: list

.. method:: to_markdown(clean=False, fill_empty=True)

Extracts table data into Markdown text format.


:arg bool clean: If ``True`` then markdown syntax is removed from cell content.
:arg bool fill_empty: If ``True`` then cell content `None` is replaced by the values above (columns) or left (rows) in an effort to approximate row and columns spans.


:type: string


.. method:: to_pandas()

Return a `pandas DataFrame <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_ version of the table.

:type: pandas DataFrame



.. class:: TableHeader

.. class:: TableRow

Dedicated class for table headers.

.. attribute:: bbox

The bounding box of the union of cells belonging to the table header, given as a tuple (x0, y0, x1, y1). This rectangle contains all table header cells.

:type: :ref:`Rect`

.. note::
.. attribute:: cells

Most document methods (left column) exist for convenience reasons, and are just wrappers for: *Document[pno].<page method>*. So they **load and discard the page** on each execution.
A list of tuples for each bbox of a column header.

:type: list

.. attribute:: names

A list of strings with column header text.

:type: list

.. attribute:: external

A boolean indicating whether the header is outside the table cells.

:type: `bool`


.. class:: TableRow

Dedicated class for table rows.


----

However, the first two methods work differently. They only need a page's object definition statement - the page itself will **not** be loaded. So e.g. :meth:`Page.get_fonts` is a wrapper the other way round and defined as follows: *page.get_fonts == page.parent.get_page_fonts(page.number)*.

.. rubric:: Footnotes

Expand Down
Loading