Fix: Add extend_edges function to fix table extraction with one strat text and the other non-text #4878

monchin · 2026-01-25T14:51:40Z

Hi, I refered pymupdf to write a library to extract pdf tables, and found a table-extracting bug when one strategy is "text" and the other is not, please see monchin/tablers#8 for more details.

I have fixed it in my library, and I found it also occurs in pymupdf, so I'd like to fix it.

… text and the other non-text

JorjMcKie · 2026-01-31T07:49:29Z

We already support parameters add_lines and add_boxes which act like virtual vector graphics and obviously already lead to more edges.
So I fail to see the benefit of adding edges in the proposed way: just use add_lines.
In addition, I do not understand why a feature of this kind should be made dependent on specific detection strategies: If I have "external" information that helps the table algorithm being successful, then I certainly should supply all of it - and not care about redundancy considerations: the algorithm is clever enough to drop stuff that it doesn't need.

monchin · 2026-01-31T10:27:41Z

Thank you for your reply!

So I fail to see the benefit of adding edges in the proposed way: just use add_lines.

I proposed this PR because IMHO users may want to use library as easy as possible, and as correct as possible. add_lines is very flexible, but the problem for this scenario does exist, and as long as the users aim to get correct results in this scenario, they need to extend text edges. So, why don't we do that for users?

why a feature of this kind should be made dependent on specific detection strategies

extend_edges is offered in the new code and anyone could use it by any means. But as I said, "one strat text and the other non-text" is specific, but the current problem in this specific strategy is general. With new code, users could get correct results without any effort in this scenario, that's why I believe it would be good to automatically extend edges in this scenario. If users have some other requirements, it's also OK for them to add other lines as you said.

JorjMcKie · 2026-01-31T20:47:41Z

I don't understand most of your response.
My argument was that if you have the information that allows you to supply edges, then this information also allows you to supply lines as well - remember: these are pairs of point-likes (p1, p2).

monchin · 2026-02-01T04:18:36Z

Sry for my poor English. If I didn't misunderstanding, perhaps the greatest divergence between us is that If IHave The Information to Supply Edges.

Say what I want to do is to extract all the tables from one document in all pages. I have no information about what the tables would be like, they may be in anywhere in the pages, may be lines for both directions, may be only lines in one direction. So for a user like me, if I want to extract tables as precise as possible, I might try to extract tables in one page twice, with the 1st time h_strat lines_strict and v_strat text, and the 2nd time h_strat text and v_strat lines.

So there's a page in a pdf like this

My task is to extract all the tables in a pdf, and there are many pages in a pdf, so it's impossible that I have the information that allows me to supply edges for every page. If I need to get the information, I still need to write the same code, I need to use get_drawings to get all the lines, and use words_to_edges to get text edges, and extend these edges myself. If I don't do that, but just extracting by one strat "lines_strict" and the other "text", I would only extract a wrong table with [["1111", "2222"]] in this page. IMO that's too heavy for a user.

Fix: Add extend_edges function to fix table extraction with one strat…

004447f

… text and the other non-text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Add extend_edges function to fix table extraction with one strat text and the other non-text #4878

Fix: Add extend_edges function to fix table extraction with one strat text and the other non-text #4878

monchin commented Jan 25, 2026

Uh oh!

JorjMcKie commented Jan 31, 2026

Uh oh!

monchin commented Jan 31, 2026

Uh oh!

JorjMcKie commented Jan 31, 2026

Uh oh!

monchin commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Add extend_edges function to fix table extraction with one strat text and the other non-text #4878

Are you sure you want to change the base?

Fix: Add extend_edges function to fix table extraction with one strat text and the other non-text #4878

Conversation

monchin commented Jan 25, 2026

Uh oh!

JorjMcKie commented Jan 31, 2026

Uh oh!

monchin commented Jan 31, 2026

Uh oh!

JorjMcKie commented Jan 31, 2026

Uh oh!

monchin commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants