content in table was dropped if it contains a format tag and enabled include_formatting
example:
<table>
<tr>
<td>
<p><b>GPT-5.4</b></p>
</td>
</tr>
</table>
will become
<table>
<row>
<cell> <p/> </cell>
</row>
</table>
I don't have any idea why this happenedso I ask GLM-5 and it said this:
When include_formatting=True, trafilatura keeps format tags such as <b><strong><i> and converts them to the internal> <hi rend="#b">
format. The problem is its strip_tags() process:
1. Normal Flow (include_formatting=False):
- <td><p><b>GPT-5.4</b></p></td> → <cell><p>GPT-5.4</p></cell> ✅
2. Question flow (include_formatting=True):
- <td><p><b>GPT-5.4</b></p></td>
- → Convert to <cell><p><hi rend="#b">GPT-5.4</hi></p></cell>
- → In a cleanup/merge step,<hi> the text inside was incorrectly handled
- → results become <cell><p></p></cell> ❌
Since I hope to use it to extract webpage content and made them an ebook to read, I hope I could keep formatting and corect table structure.
Now I can set include_formatting=False to fix the problem but it's not perfect.
I would like to help if anyone could tell me how to fix it.
content in table was dropped if it contains a format tag and enabled
include_formattingexample:
will become
I don't have any idea why this happenedso I ask
GLM-5and it said this:Since I hope to use it to extract webpage content and made them an ebook to read, I hope I could keep formatting and corect table structure.
Now I can set
include_formatting=Falseto fix the problem but it's not perfect.I would like to help if anyone could tell me how to fix it.