-
Notifications
You must be signed in to change notification settings - Fork 15
Unconditional dependency on pandas/numpy increases package size by ~24x (<6MB -> 135MB) #29
Description
Thank you for publishing a client library!
Issue
The here-location-services package currently unconditionally depends on pandas, which depends on numpy, pytz and python-dateutil. On x86-64 Linux (for Python 3.9), these end up being very large (~130MB), with all of the rest of the dependencies being ~5MB. However, pandas is only used for converting the result for two functions associated with the matrix routing API:
here-location-services-python/here_location_services/responses.py
Lines 151 to 182 in 325b4c0
| class MatrixRoutingResponse(ApiResponse): | |
| """A class representing Matrix routing response data.""" | |
| def __init__(self, **kwargs): | |
| super().__init__() | |
| self._filters = {"matrix": None} | |
| for param, default in self._filters.items(): | |
| setattr(self, param, kwargs.get(param, default)) | |
| def to_geojson(self): | |
| """Return API response as GeoJSON.""" | |
| raise NotImplementedError("This method is not valid for MatrixRoutingResponse.") | |
| def to_distnaces_matrix(self): | |
| """Return distnaces matrix in a dataframe.""" | |
| if self.matrix and self.matrix.get("distances"): | |
| distances = self.matrix.get("distances") | |
| dest_count = self.matrix.get("numDestinations") | |
| nested_distances = [ | |
| distances[i : i + dest_count] for i in range(0, len(distances), dest_count) | |
| ] | |
| return DataFrame(nested_distances, columns=range(dest_count)) | |
| def to_travel_times_matrix(self): | |
| """Return travel times matrix in a dataframe.""" | |
| if self.matrix and self.matrix.get("travelTimes"): | |
| distances = self.matrix.get("travelTimes") | |
| dest_count = self.matrix.get("numDestinations") | |
| nested_distances = [ | |
| distances[i : i + dest_count] for i in range(0, len(distances), dest_count) | |
| ] | |
| return DataFrame(nested_distances, columns=range(dest_count)) |
It seems unfortunate to require these huge dependencies to be installed for only these wo functions when many people are likely to not be calling them anyway, and when the dependencies seemingly aren't required for any additional functionality within this client library.
Potential alternatives
- Have pandas be an optional dependency (for example, via
extra_requires={"pandas": ["pandas"]}in setup.py), and import it on-demand in the individual functions that need it. For example:For an example of prior art, this option is what the popular Pydantic library does:def to_distnaces_matrix(self): """Return distnaces matrix in a dataframe.""" try: from pandas import DataFrame except ImportError as e: raise ImportError("pandas is not installed, run `pip install here-location-services[pandas]`) from e # ... existing implementation as before ...
- 'extra' dependency on
python-dotenv: https://github.com/samuelcolvin/pydantic/blob/8846ec4685e749b93907081450f592060eeb99b1/setup.py#L134-L137 - importing from
dotenvwithin a function (not at the top level) and catching theImportErrorto provide additional help to the user: https://github.com/samuelcolvin/pydantic/blob/8846ec4685e749b93907081450f592060eeb99b1/pydantic/env_settings.py#L297-L300
- 'extra' dependency on
- Remove the pandas dependency totally, and have the functions return the nested lists (
nested_distances) without converting to aDataFrame. A user who wants to use pandas can still convert to a DataFrame themselves:DataFrame(result.to_distnaces_matrix())(thecolumns=argument seems to be unnecessary, as doing that call gives the same result AFAICT).
Both of these are probably best considered as breaking changes.
Context
We were attempting to use this package in an AWS Lambda, which has strict size limits on the size of the code asset, and exceeding it results in errors like 'Unzipped size must be smaller than 262144000 bytes' when deploying (relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html#function-configuration-deployment-and-execution "Deployment package (.zip file archive)"). Additionally, larger packages result in slower cold starts: https://mikhail.io/serverless/coldstarts/aws/ .
There's various ways to provide more code beyond the size limits (layers or docker images), but this provides some context for why someone might care about the size of a package and its dependency. (Those methods are fiddly enough and the cold start impact large enough that we've actually switched away from using this client library for now.)
Package size details
Here's some commands I used to investigate the size impact, leveraging pip install --target to install a set of packages to a specific directory:
uname -a # Linux 322c9a327f85 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 GNU/Linux
python --version # Python 3.9.10
pip install --target=everything here-location-services
pip install --target=deps-pandas requests geojson flexpolyline pyhocon requests_oauthlib
pip install --target=deps-no-pandas requests geojson flexpolyline pyhocon requests_oauthlib pandas
du -sh everything # 135M
du -sh deps-pandas # 134M
du -sh deps-no-pandas # 5.1M
du -sh everything/here_location_services # 484KThat is, without pandas, the total installed package size would be 5.1M (deps-no-pandas) + 484K (everything/here_location_services) = ~5.6MB, down from 135MB (everything).
Summary of individual packages (reported by du -sh everything/*, ignoring the $package.dist-info directories that are mostly less than 50k anyway):
| package | size | only required for pandas? |
|---|---|---|
| pandas | 58M | yes |
| numpy.libs | 35M | yes |
| numpy | 33M | yes |
| pytz | 2.8M | yes |
| oauthlib | 1.4M | |
| urllib3 | 872K | |
| dateutil | 748K | yes |
| idna | 496K | |
| here_location_services | 484K | |
| 8 others | 1.5M |