voltrondata-labs / benchmarks Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 10.0 427 KB

Language-independent Continuous Benchmarking (CB) for Apache Arrow

License: MIT License

Python 100.00%

benchmarks's People

Contributors

Stargazers

Watchers

Forkers

ljishen uccross alistaire47 austin3dickey jgehrcke js8544 westonpace llama90

benchmarks's Issues

Read and pass through tags and other JSON from R benchmarks

As {arrowbench} is now [capable of] putting case_version in tags, {benchmarks} needs to be able to pass it through to conbench, but currently the only think we're reading from it is the real time. This story can include using as much of the JSON as practically makes sense without breaking histories.

Batch TPCH benchmarks by scale factor and query

Post catastrophic errors to Conbench

Right now, we're catching catastrophic benchmark failures, but not posting them to Conbench: #143 (comment)

After voltrondata-labs/arrow-benchmarks-ci#146 is resolved (likely by eddelbuettel/digest#189 resulting in a patch), we should turn on posting here: https://github.com/voltrondata-labs/benchmarks/blob/main/benchmarks/_benchmark.py#L252-L282 (Doing so before will result in a lot of messages on PRs that there are errored benchmarks, but the cause of the error is out of the committer's scope.)

Use new benchmark results dataclass

After #112 (upon which this has a hard dependency), we need to take the new benchmark results dataclass and insert it such that we can easily write results to JSON when we're ready. For more context, see #112; this is (2) in the description there.

Add R csv benchmarks

Arrow bench has a read_csv benchmark that would be nice to have.

These are the arguments (the defaults from are {arrowbench} are ~what we want to run, though I'm happy to adjust them if we decide that only a subset of sources should be default)

It supports all of the arrowbench sources (though we don't need to run all of them, the first three are probably most important) as the source argument:
- fanniemae_2016Q4
- nyctaxi_2010-01
- chi_traffic_2020_Q1
- type_strings
- type_dict
- type_integers
- type_floats
- type_nested
- type_simple_features
It supports uncompressed and gzip compressed files as the compression argument
It supports output formats arrow_table and data_frame as the output argument
The reader argument should be arrow (the other readers it knows how to test are not important for and should not be run on conbench)

Parameterize CSV reading benchmark for streaming & compression

Right now the CSV reading benchmark is reading a gzip file which is actually something of a worst-case scenario (for hot-in-cache data) since the decompression becomes a bottleneck.

Also, the benchmark only tests the CSV file reader and not the streaming CSV reader which is used by the datasets API.

Add capability to version cases

Add capability to add a case_version tag depending on parameter values as in voltrondata-labs/arrowbench#105 . This only includes versioning Python benchmarks, not R ones, whose versions will be read and passed through from their JSON in a different story.

Only run TPC-H scale_factor = 10 when memory >= 64 GB

TPC-H query 21 at scale factor 10 has been regularly failing on machines with less than 64 GB of memory, e.g. https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-t4g-linux-compute/builds/2829#0188b088-42b8-4218-ae4a-93dd0058ce8c As a solution, remove scale_factor=10 permutations in get_valid_cases() when there is insufficient memory.

benchmarks/benchmarks/tpch_benchmark.py

Lines 6 to 12 in 606f4fc

    
           def get_valid_cases(): 
        
               result = [["query_id", "scale_factor", "format"]] 
        
               for query_id in range(1, 23): 
        
                   for scale_factor in [1, 10]: 
        
                       for _format in ["native", "parquet"]: 
        
                           result.append([query_id, scale_factor, _format]) 
        
               return result

Send R errors to Conbench properly

Currently arrowbench is passing R errors through in its result, but they're getting ignored by old error handling here. We should instead pass them through properly so they show up in Conbench.

Add TPC-H benchmarks

Now that voltrondata-labs/arrowbench#33 is merged, could we add the TPC-H benchmarks?

The benchmark name is tpc_h, and it accepts the following parameters the default values in {arrowbench} are what we would like to call:

engine - "arrow"
query_num - 1, 6
format - "native", "parquet", "feather"
scale - "1", "10"
mem_map - FALSE (I believe you can leave this unspecified and it will take the default which is what we want — it only impacts the native benchmark regardless)

{arrowbench} automagically generates the TPC-H test data if it doesn't exist (and reuses it if it does), so we should not need to do anything to put the data anywhere.

When I run all of these permutations on my computer (with maximum cores available, 3 iterations), they take a total of 2.9 minutes to run, so shouldn't add a huge amount of time to our benchmark runs, though if we want to cut that down, we could remove one of the formats (either parquet or feather, they are pretty similar).

We will be adding queries as we have the ability to run them, which I think will need PRs to benchmarks, but once we have the structure up it should be easy to add them in (they will each be added to the defaults in {arrowbench} as well).

dataset-selectivity performance regression?

I have seen something in conbench.ursa.dev that I would love to use as an example scenario: do we have a performance regression, or do we maybe have a methodological weakness?

https://conbench.ursa.dev/benchmarks/4fe411bf67a94bc6aa9787fc0394bd03/

That is, around 2023-01-05 07:49 it was measured with apache/arrow@e5ec942 that benchmark dataset-selectivity with case permutation 10%, nyctaxi_multi_parquet_s3 took almost two seconds in each of three iterations: [1.951955, 1.846497, 1.891674].

In the previous 1-2 weeks it took ~1.2 seconds:

Test compression in R csv writing

We should test compression in CSV writing in R once we easily can. Specifically, once this {arrowbench} issue (which in turn depends on this Arrow ticket) is addressed, the compression parameter can be added to CsvWriterBenchmark.

TypeError: str expected, not int

https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/3752#018ab479-f500-4212-a289-623884f53853/33-5025

Traceback (most recent call last):
  File "/var/lib/buildkite-agent/miniconda3/envs/arrow-commit/bin/conbench", line 5, in <module>
    from conbenchlegacy.cli import conbench
  File "/var/lib/buildkite-agent/miniconda3/envs/arrow-commit/lib/python3.8/site-packages/conbenchlegacy/cli.py", line 87, in <module>
    instance = benchmark()
  File "/var/lib/buildkite-agent/builds/ip-172-31-43-254-1/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/benchmarks/benchmarks/cpp_micro_benchmarks.py", line 107, in __init__
    os.environ["CONBENCH_PROJECT_PR_NUMBER"] = self.github_info["pr_number"]
  File "/var/lib/buildkite-agent/miniconda3/envs/arrow-commit/lib/python3.8/os.py", line 680, in __setitem__
    value = self.encodevalue(value)
  File "/var/lib/buildkite-agent/miniconda3/envs/arrow-commit/lib/python3.8/os.py", line 750, in encode
    raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int

Offending code:

benchmarks/benchmarks/cpp_micro_benchmarks.py

Line 107 in 079a920

os.environ["CONBENCH_PROJECT_PR_NUMBER"] = self.github_info["pr_number"]

This only happens during PR benchmarks (i.e. the "ursabot please benchmark" workflow).

Unit test core functionality directly

Per #110 (comment), ideally we want to test core functionality (mostly in _benchmark.py) directly, instead of just via implemented and example benchmarks. Some non-exclusive options:

Make more example benchmarks accessible from {arrowbench}
Put necessary R and JSON in testing resources
Mock out parts of classes in tests so smaller blocks of code can be tested

Using `run_benchmark()` for R benchmarks and the state of result schemas

I wrote a doc about what is required to run the R benchmarks via BenchmarkR with arrowbench::run_benchmark() (which runs for each case) instead of arrowbench::run_one() (which runs for a single case). A major part of this is making sure the right data and metadata from each run flows around correctly such that eventually it can be POSTed to conbench, so the doc devotes a lot of time to benchmark result (at the case level) schemas.

The implication is probably moving to a more standardized and unified form of benchmark result schema across the different levels and languages, so please add comments with opinions on what that might look like.

Prevent DuckDB rebuild to speed up CI times

In #81 we added tests for the Arrow TPC-H benchmarks. As part of that process {arrowbench} will (re)build duckdb ensuring that it has the tpch extension.

If we do the two following things, that testing time should come down to much closer to what it was before:

If we provided data files like data/customer_1.parquet (we could make a very very small dataset and put it in this place, so long as the column names are the same), the data generation process will be short-circuited.
The tpch extension is also used at the verification stage. I can provide an option to turn off verification in {arrowbench} for testing purposes so that that does not trigger a duckdb re-build.

FileNotFoundError on wide-dataframe

It seems the wide-dataframe benchmark expects a benchmarks/data/temp directory that doesn't exist.

$ conbench wide-dataframe 
[...]
Traceback (most recent call last):
  [...]
  File "/home/antoine/arrow/benchmarks/benchmarks/wide_dataframe_benchmark.py", line 29, in run
    self._create_if_not_exists(path)
  File "/home/antoine/arrow/benchmarks/benchmarks/wide_dataframe_benchmark.py", line 46, in _create_if_not_exists
    parquet.write_table(table, path)
  File "/home/antoine/arrow/dev/python/pyarrow/parquet/core.py", line 3103, in write_table
    with ParquetWriter(
  File "/home/antoine/arrow/dev/python/pyarrow/parquet/core.py", line 1010, in __init__
    sink = self.file_handle = filesystem.open_output_stream(
  File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/home/antoine/arrow/benchmarks/benchmarks/data/temp/wide.parquet'. Detail: [errno 2] No such file or directory

Add benchmark for reading from a parallel filesystem (e.g. S3)

This will demonstrate the benefit of the async feature. For the moment this probably only makes sense to run on EC2/S3.

Add R CSV writer benchmark

Parallel to #62, which is to implement a CSV writer benchmark in Python. An R CSV writer benchmark has already been written in {arrowbench}, so this is just to add it here so it gets run, similar to #37 for reading. Settings should probably be parallel to #37.

Add CSV writer benchmark

Now that there is a CSV writer we should benchmark it:

compression = munge_compression(compression, "csv")
out_stream = pyarrow.output_stream(path, compression=compression)
pyarrow.csv.write_csv(table, out_stream)

Add JavaScript benchmarks

We would love to run continuous benchmarks for the Arrow JS library. We already have a benchmark setup with benchmark.js at https://github.com/apache/arrow/blob/master/js/perf/index.js. It would be awesome if there was a native integration into Conbench for results from JavaScript benchmarks.

Could you help us set up the benchmarks for JavaScript?

Arrow ticket for Conbench integration: https://issues.apache.org/jira/browse/ARROW-12690

error when POSTing benchmark results: Expected code 201, got 200

Example build: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/builds/2347#018890c5-5406-469b-a411-3495092d1fe5

[230606-13:36:49.995] [22163] [benchclients.conbench] INFO: try to perform login
[230606-13:36:49.995] [22163] [benchclients.http] INFO: try: POST to https://conbench.ursa.dev/api/login/
[230606-13:36:50.132] [22163] [benchclients.http] INFO: POST request to https://conbench.ursa.dev/api/login/: took 0.1362 s, response status code: 204
[230606-13:36:50.132] [22163] [benchclients.conbench] INFO: ConbenchClient: initialized
[230606-13:36:50.132] [22163] [benchclients.http] DEBUG: POST request JSON body:
{
  "run_id": "fc27335fd7364cd0816346a148bee7f4",
  "batch_id": "fc27335fd7364cd0816346a148bee7f4-1n",
  "timestamp": "2023-06-06T13:36:49.725925+00:00",
  "context": {
    "arrow_compiler_flags": "-fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /var/lib/buildkite-agent/.conda/envs/arrow-commit/include -fdiagnostics-color=always",
    "benchmark_language": "R"
  },
  "info": {
    "arrow_version": "13.0.0-SNAPSHOT",
    "arrow_compiler_id": "GNU",
    "arrow_compiler_version": "11.3.0",
    "benchmark_language_version": "R version 4.2.3 (2023-03-15)",
    "arrow_version_r": "12.0.0.9000"
  },
  "tags": {
    "cpu_count": null,
    "engine": "arrow",
    "memory_map": false,
    "query_id": "TPCH-01",
    "scale_factor": 1,
    "format": "native",
    "language": "R",
    "name": "tpch"
  },
  "optional_benchmark_info": {},
  "github": {
    "repository": "https://github.com/apache/arrow",
    "pr_number": null,
    "commit": "3d0172d40dfcf934308e6e1f4249a854004fe824"
  },
  "stats": {
    "data": [
      "0.396095",
      "0.458458",
      "0.453228"
    ],
    "times": [],
    "unit": "s",
    "time_unit": "s",
    "iterations": 3,
    "mean": "0.435927",
    "median": "0.453228",
    "min": "0.396095",
    "max": "0.458458",
    "stdev": "0.034595",
    "q1": "0.424661",
    "q3": "0.455843",
    "iqr": "0.031182"
  },
  "machine_info": {
    "name": "ec2-m5-4xlarge-us-east-2",
    "os_name": "Linux",
    "os_version": "4.14.248-189.473.amzn2.x86_64-x86_64-with-glibc2.10",
    "architecture_name": "x86_64",
    "kernel_name": "4.14.248-189.473.amzn2.x86_64",
    "memory_bytes": "65498251264",
    "cpu_model_name": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
    "cpu_core_count": "8",
    "cpu_thread_count": "16",
    "cpu_l1d_cache_bytes": "32768",
    "cpu_l1i_cache_bytes": "32768",
    "cpu_l2_cache_bytes": "1048576",
    "cpu_l3_cache_bytes": "37486592",
    "cpu_frequency_max_hz": "0",
    "gpu_count": "0",
    "gpu_product_names": []
  },
  "run_name": "commit: 3d0172d40dfcf934308e6e1f4249a854004fe824",
  "run_reason": "commit"
}
[230606-13:36:50.132] [22163] [benchclients.http] INFO: try: POST to https://conbench.ursa.dev/api/benchmark-results
[230606-13:36:50.371] [22163] [benchclients.http] INFO: POST request to https://conbench.ursa.dev/api/benchmark-results: took 0.2394 s, response status code: 200
[230606-13:36:50.372] [22163] [benchclients.http] INFO: unexpected response. code: 200, body bytes: <[
  {
    "id": "0647f25ae7197cbf8000d16e33d1a4bf",
    "run_id": "d46b8964796e4429b39faf0dc15301ea",
    "batch_id": "1cdc3a9dc9d04d7782901ce831f34e85",
    "timestamp": "2023-06-06T12:21:00Z",
    "tags": {
      "name": "ReplaceWithMaskLowSelectivityBench",
      "suite": "arrow-compute-vector-replace-benchmark",
      "params": "16384/99",
      "source": "cpp-micro"
    },
    "optional_bench ...>
Traceback (most recent call last):
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/bin/conbench", line 8, in <module>
    sys.exit(conbench())
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/conbench/cli.py", line 149, in _benchmark
    for result, output in benchmark().run(**kwargs):
  File "/var/lib/buildkite-agent/builds/aws-ec2-m5-4xlarge-us-east-2-i-09bd8650b4486e15b-1/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/benchmarks/benchmarks/tpch_benchmark.py", line 30, in run
    yield self.r_benchmark(command, tags, kwargs, case)
  File "/var/lib/buildkite-agent/builds/aws-ec2-m5-4xlarge-us-east-2-i-09bd8650b4486e15b-1/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/benchmarks/benchmarks/_benchmark.py", line 304, in r_benchmark
    return self.record(
  File "/var/lib/buildkite-agent/builds/aws-ec2-m5-4xlarge-us-east-2-i-09bd8650b4486e15b-1/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/benchmarks/benchmarks/_benchmark.py", line 150, in record
    benchmark, output = self.conbench.record(
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/conbench/runner.py", line 333, in record
    self.publish(benchmark_result)
  File "/var/lib/buildkite-agent/builds/aws-ec2-m5-4xlarge-us-east-2-i-09bd8650b4486e15b-1/apache-arrow/arrow-bci-benchmark-on-ec2-m5-4xlarge-us-east-2/benchmarks/benchmarks/_benchmark.py", line 83, in publish
    self.conbench_client.post("/benchmark-results", benchmark)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/benchclients/http.py", line 164, in post
    resp = self._make_request("POST", self._abs_url_from_path(path), 201, json=json)
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/benchclients/http.py", line 205, in _make_request
    result = self._make_request_retry_until_deadline(
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/benchclients/http.py", line 266, in _make_request_retry_until_deadline
    result = self._make_request_retry_guts(
  File "/var/lib/buildkite-agent/.conda/envs/arrow-commit/lib/python3.8/site-packages/benchclients/http.py", line 393, in _make_request_retry_guts
    raise RetryingHTTPClientNonRetryableResponse(message=msg, error_response=resp)
benchclients.http.RetryingHTTPClientNonRetryableResponse: POST request to https://conbench.ursa.dev/api/benchmark-results: unexpected HTTP response. Expected code 201, got 200. Leading bytes of body: <[
  {
    "id": "0647f25ae7197cbf8000d16e33d1a4bf",
    "run_id": "d46b8964796e4429b39faf0dc15301ea",
    "batch_id": "1cdc3a9dc9d04d7782901ce831f34e8 ...>


stdout:

Revert back to install.packages('arrow') so it fails loudly if extra aren't enabled?

d5d21a8

We started using nightly because:

FAILED benchmarks/tests/test_file_benchmark.py::test_read_r[parquet, snappy, table]
FAILED benchmarks/tests/test_file_benchmark.py::test_read_r[parquet, snappy, dataframe]
FAILED benchmarks/tests/test_file_benchmark.py::test_read_r[feather, lz4, table]
FAILED benchmarks/tests/test_file_benchmark.py::test_read_r[feather, lz4, dataframe]
FAILED benchmarks/tests/test_file_benchmark.py::test_write_r[parquet, snappy, table]
FAILED benchmarks/tests/test_file_benchmark.py::test_write_r[parquet, snappy, dataframe]
FAILED benchmarks/tests/test_file_benchmark.py::test_write_r[feather, lz4, table]
FAILED benchmarks/tests/test_file_benchmark.py::test_write_r[feather, lz4, dataframe]

Exception: Error: NotImplemented: Support for codec 'snappy' not built

Exception: Error: NotImplemented: Support for codec 'lz4' not built

> install.packages("arrow")
Installing package into '/home/jkeane/R/x86_64-pc-linux-gnu-library/4.1'
(as 'lib' is unspecified)
trying URL 'https://packagemanager.rstudio.com/all/__linux__/bionic/latest/src/contrib/arrow_6.0.0.2.tar.gz'
Content type 'binary/octet-stream' length 20338508 bytes (19.4 MB)
==================================================
downloaded 19.4 MB

* installing *binary* package 'arrow' ...
* DONE (arrow)

The downloaded source packages are in
	'/tmp/RtmpJLf3lB/downloaded_packages'
> arrow_info()
Error in arrow_info() : could not find function "arrow_info"
> arrow::arrow_info()
Arrow package version: 6.0.0.2

Capabilities:
               
dataset    TRUE
parquet    TRUE
json       TRUE
s3        FALSE
utf8proc   TRUE
re2        TRUE
snappy    FALSE
gzip      FALSE
brotli    FALSE
zstd      FALSE
lz4       FALSE
lz4_frame FALSE
lzo       FALSE
bz2       FALSE
jemalloc  FALSE
mimalloc  FALSE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Add benchmark results dataclass

As eventually we want to move the posting of results out of this package, this package needs to be able to save results to JSON in the same fashion {arrowbench} does. There are a few steps to this project, of which this is the first:

The first step is to make a dataclass (pydantic or native) that aligns with the schema we're currently using.
Use of the new dataclass is beyond the scope of this issue; that will be the next story.
Actually writing the JSON can be delayed for now, provided the capability is there. This should be simple whenever we're ready.
Removing the current calls to conbench.record() (once we've got a separate tool to do so) will likely take a little more work, but allow us simplify the codebase a bit. This may not happen for a bit, but should be kept in mind during (2) especially so we can end up with a tidy codebase.

Again, this task is only (1); the rest above is just for context.

Fix cpu_count handling for R benchmarks

Currently for R benchmarks, this repo passescpu_count = NULL to run_one() (code), which then does not set the number of CPUs or threads anywhere (it omits that part of the script it creates). When run through higher-level arrowbench interfaces, cpu_count = NULL gets translated by get_default_parameters() to c(1L, parallel::detectCores()), which would create two cases for run_one(), which would be a problem.

In practice, not calling arrow:::SetCpuThreadPoolCapacity() means we're running with the default, which is the number of cores on the machine (pyarrow.cpu_count()). We should move to specifying this and recording it in tags. Right now the cpu_count key is in tags, but the value is empty. Changing this will break histories, but we should be able to adjust old records based on machine_info.cpu_core_count or machine_info.cpu_thread_count (I'm not exactly sure which we want, but they may not differ for any of the machines we're running on anyway).

Because of the shift to running arrowbench directly from arrow-benchmarks-ci, it may be more pragmatic to break things as we switch over and then do the cleanup, but I'm opening this issue here because the problem is presently here, even if the fix ends up being some tweaks in arrowbench defaults and some database cleanup.

running test suite locally: a URL of value `None` is constructed in _source.py

14:29:59 ± pytest -v -s  -k serialize benchmarks/tests
=============================================================================== test session starts ================================================================================
platform linux -- Python 3.10.8, pytest-7.2.0, pluggy-1.0.0 -- /home/jp/.pyenv/versions/3108-vd-benchmarks/bin/python
cachedir: .pytest_cache
rootdir: /home/jp/dev/voltrondata-labs-benchmarks
collecting ... [221214-14:33:26.893] [361266] [benchmarks._sources] INFO: path does not exist: /home/jp/dev/voltrondata-labs-benchmarks/benchmarks/data/fanniemae_sample.csv
[221214-14:33:26.893] [361266] [benchmarks._sources] INFO: _get_object_url for idx 0
[221214-14:33:26.893] [361266] [benchmarks._sources] INFO: HTTP GET None

Narrowed this down to

benchmarks/benchmarks/_sources.py

Line 437 in 5ea34d7

def _get_object_url(self, idx=0):

    def _get_object_url(self, idx=0):
        if self.paths:
            s3_url = pathlib.Path(self.paths[idx])
            return (
                "https://"
                + s3_url.parts[0]
                + ".s3."
                + self.region
                + ".amazonaws.com/"
                + os.path.join(*s3_url.parts[1:])
            )

        return self.store.get("source")

where if self.paths evaluates to False, and return self.store.get("source") returns None.

Add a dataset write benchmark

One can (re)write a dataset (partitioned or not) without reading the full thing into memory with pyarrow. We currently have a benchmark that runs a filter on datasets.

We should create a new benchmark that is similar to the filtering, but on top of filtering, also write the results out to a new dataset (instead of pulling into table like we do at

benchmarks/benchmarks/dataset_selectivity_benchmark.py

Line 71 in 5ea34d7

return lambda: dataset.to_table(

We might parameterize this over:

how selective the filter is (and we should definitely keep the 100% selectivity where we do not remove any rows, especially for the (re)partitioning below)
the format of the output (parquet, arrow file)
partitioning (add a new partitioning column to partition by?)

	def get_valid_cases():
	result = [["query_id", "scale_factor", "format"]]
	for query_id in range(1, 23):
	for scale_factor in [1, 10]:
	for _format in ["native", "parquet"]:
	result.append([query_id, scale_factor, _format])
	return result