Skip to content

Conversation

@monchin
Copy link

@monchin monchin commented Jan 25, 2026

Hi, I refered pymupdf to write a library to extract pdf tables, and found a table-extracting bug when one strategy is "text" and the other is not, please see monchin/tablers#8 for more details.

I have fixed it in my library, and I found it also occurs in pymupdf, so I'd like to fix it.

@JorjMcKie
Copy link
Collaborator

We already support parameters add_lines and add_boxes which act like virtual vector graphics and obviously already lead to more edges.
So I fail to see the benefit of adding edges in the proposed way: just use add_lines.
In addition, I do not understand why a feature of this kind should be made dependent on specific detection strategies: If I have "external" information that helps the table algorithm being successful, then I certainly should supply all of it - and not care about redundancy considerations: the algorithm is clever enough to drop stuff that it doesn't need.

@monchin
Copy link
Author

monchin commented Jan 31, 2026

Thank you for your reply!

So I fail to see the benefit of adding edges in the proposed way: just use add_lines.

I proposed this PR because IMHO users may want to use library as easy as possible, and as correct as possible. add_lines is very flexible, but the problem for this scenario does exist, and as long as the users aim to get correct results in this scenario, they need to extend text edges. So, why don't we do that for users?

why a feature of this kind should be made dependent on specific detection strategies

extend_edges is offered in the new code and anyone could use it by any means. But as I said, "one strat text and the other non-text" is specific, but the current problem in this specific strategy is general. With new code, users could get correct results without any effort in this scenario, that's why I believe it would be good to automatically extend edges in this scenario. If users have some other requirements, it's also OK for them to add other lines as you said.

@JorjMcKie
Copy link
Collaborator

I don't understand most of your response.
My argument was that if you have the information that allows you to supply edges, then this information also allows you to supply lines as well - remember: these are pairs of point-likes (p1, p2).

@monchin
Copy link
Author

monchin commented Feb 1, 2026

Sry for my poor English. If I didn't misunderstanding, perhaps the greatest divergence between us is that If IHave The Information to Supply Edges.

Say what I want to do is to extract all the tables from one document in all pages. I have no information about what the tables would be like, they may be in anywhere in the pages, may be lines for both directions, may be only lines in one direction. So for a user like me, if I want to extract tables as precise as possible, I might try to extract tables in one page twice, with the 1st time h_strat lines_strict and v_strat text, and the 2nd time h_strat text and v_strat lines.

So there's a page in a pdf like this
image
My task is to extract all the tables in a pdf, and there are many pages in a pdf, so it's impossible that I have the information that allows me to supply edges for every page. If I need to get the information, I still need to write the same code, I need to use get_drawings to get all the lines, and use words_to_edges to get text edges, and extend these edges myself. If I don't do that, but just extracting by one strat "lines_strict" and the other "text", I would only extract a wrong table with [["1111", "2222"]] in this page. IMO that's too heavy for a user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants