-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Hi!
I've been using Boilerpipe with Bitextor, and everything has worked out fine. The problem is that when I processed a PDF file, specifically this one, I run out of memory and the execution failed. The error message I got is:
Traceback (most recent call last):
File "BoilerpipeSAXInput.java", line 51, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument
File "BoilerpipeSAXInput.java", line 63, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument
File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.parse
File "org.apache.xerces.parsers.XMLParser.java", line -1, in org.apache.xerces.parsers.XMLParser.parse
File "HTMLConfiguration.java", line 452, in org.cyberneko.html.HTMLConfiguration.parse
File "HTMLConfiguration.java", line 499, in org.cyberneko.html.HTMLConfiguration.parse
File "HTMLScanner.java", line 907, in org.cyberneko.html.HTMLScanner.scanDocument
File "HTMLScanner.java", line 1967, in org.cyberneko.html.HTMLScanner$ContentScanner.scan
File "HTMLScanner.java", line 2291, in org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters
File "DefaultFilter.java", line 152, in org.cyberneko.html.filters.DefaultFilter.characters
File "HTMLTagBalancer.java", line 954, in org.cyberneko.html.HTMLTagBalancer.characters
File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.characters
File "BoilerpipeHTMLContentHandler.java", line 293, in de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters
File "BitSet.java", line 447, in java.util.BitSet.set
File "BitSet.java", line 352, in java.util.BitSet.expandTo
File "BitSet.java", line 337, in java.util.BitSet.ensureCapacity
File "Arrays.java", line 3308, in java.util.Arrays.copyOf
Exception: Java Exception
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module> File "/home/cgarcia/miniconda3/envs/bitextor/lib/python3.8/site-packages/boilerpipe/extract/__init__.py", line 67, in __init__ self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space
In order to get rid of Bitextor for the explanation of this issue, I attach to this issue the file which Bitextor generated from the PDF, which is an HTML, and the attached HTML is the one that causes this problem. The file size is 9.4 MB, which I don't know if is a size too big to make Boilerpipe fail. The problem is not related to the PDF itself, since I processed other PDFs and the process finished without errors.
In the end, I figured out that the problem was actually due to the memory (initially I though about a memory leak), what was really weird to me since it is a 9.4 MB file. I fixed the problem increasing the quantity of memory of jpype. The total quantity of memory which a 9.4 MB HTML file required was of ~52 GB!!!!!!! My system has 126 GB, so the default max. heap size of the JVM is 30 GB. Since the process was requiring 52 GB and the max. heap size was 30 GB, I was running out of memory.
The reason of this issue is to alert other people which might have the same problem and to ask the following question: do these numbers make sense? I mean, 52 GB of memory for an HTML file of 9.4 MB?
The code which triggers the error:
from boilerpipe.extract import Extractor
text = ""
with open("boilerpipe_error.html") as f:
for l in f:
text += l
text = text.strip()
Extractor(extractor='ArticleExtractor', html=text)The fix (run before the above code; it should work, but I haven't tested it out of the actual file, so I might have miss something):
import os
import jpype
import importlib
# Take 80 GB of memory for boilerpipe
boilerpipe_max_heap_size = 80 * 1024 # TODO change this value
if not jpype.isJVMStarted():
max_heap_size = f"-Xmx{str(options.boilerpipe_max_heap_size)}M" if options.boilerpipe_max_heap_size >= 0 else ''
jars = []
for top, dirs, files in os.walk(os.path.dirname(importlib.machinery.PathFinder().find_module("boilerpipe").get_filename()) + '/data'):
for nm in files:
if nm[-4:] == ".jar":
jars.append(os.path.join(top, nm))
jpype.addClassPath(os.pathsep.join(jars))
jargs = [jpype.getDefaultJVMPath()]
if max_heap_size != '':
jargs.append(max_heap_size)
jpype.startJVM(*jargs, convertStrings=False)
# ... run boilerpipe