lakeview is a visibility tool for AWS S3 based data lakes.
Think of it as ncdu, but for Petabyte-scale data, on S3.
Instead of scanning billions of objects using the S3 API (which would require millions of API calls), lakeview uses Athena to query S3 Inventory Reports.
- Aggregate the sizes of directories* in S3, allowing you to drill down and find what is taking up space.
- Compare sizes between different dates - see how directories size change over time between different inventory reports.
- _Planned but not yet implemented - _ find the largest duplicates in your directories.
* S3, being an object store and not a filesystem, doesn't really have a notion of directories, but its API supports so-called "common prefixes".
All capabilities are provided in both a human consumable web interface and a machine consumable JSON report - feel free to plug them into your favorite monitoring tool.
-
Ensure you have an S3 inventory set up (preferably as Parquet or ORC)
-
Verify the table is registered in Athena
-
Run lakeview as a standalone Docker container:
docker run -it -p 5000:5000 \ -v $HOME/.aws:/home/lakeview/.aws \ treeverse/lakeview \ --table <athena table name> \ --output-location <s3 uri>
note
<athena table name>
is the name you gave in step 2, and<s3 uri>
is a location in S3 where Athena could store its results (e.g.s3://my-bucket/athena/
) -
Open http://localhost:5000/ and start exploring
To get results as JSON - add Accept: application/json
to your request headers, or pass json
as a query string parameter.
prefix (default: "")
- return objects and directories[1] starting with the given prefix
delimiter (default: "/")
- use this character as delimiter to group objects under a common prefix
date
- date string corresponding to the inventory you'd like to query (YYYY-MM-DD-00-00) is S3's default structure
compare (optional)
- another date string. If present, lakeview will calculate a diff between the two reports for every common prefix and will sort the results based on the largest absolute diff
Request:
http://localhost:5000/du?prefix=&delimiter=%2F&date=2020-08-23-00-00&compare=2020-08-22-00-00&json
Response:
{
"compare": "2020-08-22-00-00",
"date": "2020-08-23-00-00",
"delimiter": "/",
"prefix": "",
"response": [
{
"common_prefix": "users/",
"diff": 3363690400953,
"size_left": 231203538669496,
"size_right": 231203538669496
},
{
"common_prefix": "production/",
"diff": 2737293183914,
"size_left": 6238586023266733,
"size_right": 6238586023266733
},
{
"common_prefix": "staging/",
"diff": 281953288549,
"size_left": 367219795944457,
"size_right": 367219795944457
},
...
]
}
Clone the repo, and from the root directory run:
$ pip install -r requirements.txt
and run this:
$ python server.py \
--table <athena table name> \
--output-location <s3 uri>
For a complete reference, run:
$ python server.py --help
lakeview is distributed under the Apache 2.0 license. See the included LICENSE file.
lakeview was originally built (with <3) by Treeverse.
We're actively developing lakeFS as an open source tool that delivers resilience and manageability to object-storage based data lakes.