java.lang.OutOfMemoryError: Java heap space

Hi!

I've been using Boilerpipe with [Bitextor](https://github.com/bitextor/bitextor), and everything has worked out fine. The problem is that when I processed a PDF file, specifically [this one](https://ec.europa.eu/clima/system/files/2016-11/analysis_appendix_en.pdf), I run out of memory and the execution failed. The error message I got is:

```
Traceback (most recent call last):                                                                                                                                                                           
  File "BoilerpipeSAXInput.java", line 51, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "BoilerpipeSAXInput.java", line 63, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.parse                                                                                     
  File "org.apache.xerces.parsers.XMLParser.java", line -1, in org.apache.xerces.parsers.XMLParser.parse                                                                                                     
  File "HTMLConfiguration.java", line 452, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLConfiguration.java", line 499, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLScanner.java", line 907, in org.cyberneko.html.HTMLScanner.scanDocument                                                                                                                          
  File "HTMLScanner.java", line 1967, in org.cyberneko.html.HTMLScanner$ContentScanner.scan                                                                                                                  
  File "HTMLScanner.java", line 2291, in org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters                                                                                                        
  File "DefaultFilter.java", line 152, in org.cyberneko.html.filters.DefaultFilter.characters                                                                                                                
  File "HTMLTagBalancer.java", line 954, in org.cyberneko.html.HTMLTagBalancer.characters                                                                                                                    
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.characters                                                                                
  File "BoilerpipeHTMLContentHandler.java", line 293, in de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters                                                                                       
  File "BitSet.java", line 447, in java.util.BitSet.set                                                                                                                                                      
  File "BitSet.java", line 352, in java.util.BitSet.expandTo                                                                                                                                                 
  File "BitSet.java", line 337, in java.util.BitSet.ensureCapacity                                                                                                                                           
  File "Arrays.java", line 3308, in java.util.Arrays.copyOf                                                                                                                                                  
Exception: Java Exception                                                                                                                                                                                    
                                                                                                                                                                                                             
The above exception was the direct cause of the following exception:                                                                                                                                         
                                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                           
  File "<stdin>", line 1, in <module>                                                                                                                                                                          File "/home/cgarcia/miniconda3/envs/bitextor/lib/python3.8/site-packages/boilerpipe/extract/__init__.py", line 67, in __init__                                                                                 self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()                                                                                                                                  
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space                                                                                                                                      
```

In order to get rid of Bitextor for the explanation of this issue, I attach to this issue the file which Bitextor generated from the PDF, which is an HTML, and the attached HTML is the one that causes this problem. The file size is 9.4 MB, which I don't know if is a size too big to make Boilerpipe fail. The problem is not related to the PDF itself, since I processed other PDFs and the process finished without errors.

In the end, I figured out that the problem was actually due to the memory (initially I though about a memory leak), what was really weird to me since it is a 9.4 MB file. I fixed the problem increasing the quantity of memory of `jpype`. The total quantity of memory which a 9.4 MB HTML file required was of **~52 GB**!!!!!!! My system has 126 GB, so the default max. heap size of the JVM is 30 GB. Since the process was requiring 52 GB and the max. heap size was 30 GB, I was running out of memory.

The reason of this issue is to alert other people which might have the same problem and to ask the following question: do these numbers make sense? I mean, 52 GB of memory for an HTML file of 9.4 MB?

The code which triggers the error:

```python
from boilerpipe.extract import Extractor

text = ""

with open("boilerpipe_error.html") as f:
  for l in f:
    text += l

text = text.strip()

Extractor(extractor='ArticleExtractor', html=text)
```

The fix (run before the above code; it should work, but I haven't tested it out of the actual file, so I might have miss something):
```python
import os
import jpype
import importlib

# Take 80 GB of memory for boilerpipe
boilerpipe_max_heap_size = 80 * 1024 # TODO change this value

if not jpype.isJVMStarted():
    max_heap_size = f"-Xmx{str(options.boilerpipe_max_heap_size)}M" if options.boilerpipe_max_heap_size >= 0 else ''
    jars = []

    for top, dirs, files in os.walk(os.path.dirname(importlib.machinery.PathFinder().find_module("boilerpipe").get_filename()) + '/data'):
        for nm in files:
            if nm[-4:] == ".jar":
                jars.append(os.path.join(top, nm))

    jpype.addClassPath(os.pathsep.join(jars))

    jargs = [jpype.getDefaultJVMPath()]

    if max_heap_size != '':
        jargs.append(max_heap_size)

    jpype.startJVM(*jargs, convertStrings=False)

# ... run boilerpipe
```

[html.tar.gz](https://github.com/slaveofcode/boilerpipe3/files/7913005/html.tar.gz)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

java.lang.OutOfMemoryError: Java heap space #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

java.lang.OutOfMemoryError: Java heap space #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions