Code Monkey home page Code Monkey logo

pdbufr's Introduction

pdbufr

pdbufr is a Python package implementing a Pandas reader for the BUFR format using ecCodes. It supports BUFR edition 3 and 4 files with uncompressed and compressed subsets. It works on Linux, MacOS and Windows, the ecCodes C-library is the only binary dependency. All modern versions of Python (>=3.6) and PyPy3 are supported.

The documentation can be found at https://pdbufr.readthedocs.io/.

License

Copyright 2019- European Centre for Medium-Range Weather Forecasts (ECMWF).

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pdbufr's People

Contributors

alexamici avatar iainrussell avatar nklever avatar sandorkertesz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdbufr's Issues

KeyValueNotFoundError: Key/value not found for existing key

Dear @alexamici,

pdbufr looks really nice but it does not work with bufr data e.g. from DWD and Aemet.

df = pdbufr.read_bufr( 
         'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin', 
         columns=('airTemperature')
)

I receive KeyValueNotFoundError: Key/value not found during passing the attached file.

Using bufr_dump works as expected.

In case you do not have time to fix this issue pleas let me know where the bug should be and I will try to fix it. Thanks as lot and bets regards.

Daniel

eccodes version 2.6.0

Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.zip

Do not fail when key value is not available in BUFR message

At the moment, if a BUFR key is present but its value cannot be accessed an eccodes.KeyValueNotFoundError exception is thrown and pdbufr fails without giving details about e.g. the key name involved.

Naturally, pdbufr should not fail in this case, but if has to e.g. if it is a mandatory key (e.g. "edition") it should provide the users with a better error message.

Related to issue #46

Test wave_1 fails

Test test_wave_1 fails with the following error:

tests/test_20_dataframe.py ............Fxxx [100%]

def test_wave_1():
    columns = ['data_datetime', 'longitude', 'latitude', 'significantWaveHeight']
  res = pdbufr.read_bufr(TEST_DATA_6, columns=columns)

tests/test_20_dataframe.py:545:


pdbufr/init.py:259: in read_bufr
return pd.DataFrame.from_records(filtered_iterator)
/lib/python3.6/site-packages/pandas/core/frame.py:1584: in from_records
values += data
pdbufr/init.py:249: in filter_stream
for data_items in extract_observations(subset_items, include_computed=included_keys):
pdbufr/init.py:208: in extract_observations
yield add_computed(header + data_items, include_computed)
pdbufr/init.py:187: in add_computed
(prefix + computed_key, computed_key, getter(observation, '', keys))
pdbufr/init.py:111: in datetime_from_bufr
*[observation[prefix + k] for k in datetime_keys[:4]] + [minute, second],


.0 = <list_iterator object at 0x7f51793d3cf8>

*[observation[prefix + k] for k in datetime_keys[:4]] + [minute, second],
nanosecond=nanosecond
)
E KeyError: 'year'

pdbufr/init.py:111: KeyError

Used:

  • pandas 0.25.2
  • eccodes-python 0.9.3

Do not run two CI tests when pushing to a pull request

Is your feature request related to a problem? Please describe.

When we create a PR in pdbufr and push new code into it two sets of CI tests are started automatically: one for the push and another one for the PR. Since they are identical only the PR CI tests should be running.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Organisation

No response

Fails to collect all the values of a given key from SYNOP message

What happened?

File c_85.bufr contains a single BUFR message with 2 airTemperature values. The structure is as follows:

image

When we try to extract the airTemperature with the following code:

import pdbufr
df = pdbufr.read_bufr("c_85.bufr", columns=("airTemperature"))
print(df)

we only get the second value

   airTemperature
0             0.2

What are the steps to reproduce the bug?

See above.

Version

all

Platform (OS and architecture)

all

Relevant log output

No response

Accompanying data

No response

Organisation

ECMWF

columns argument does not accept single values

A filter expression such as this:

res = pdbufr.read_bufr(f,
    columns=('airTemperature'),
    filters={'heightOfSensorAboveLocalGroundOrDeckOfMarinePlatform': [2.0]},
    )

does not work because columns is interpreted as a single string rather than a tuple, and then pdbufr internally tries to turn it into a list, which in fact just creates ['a', 'i', 'r', 'T', 'e', 'm', 'p', 'e', 'r', 'a', 't', 'u', 'r', 'e'] rather than ['airTemperature'].

The current workaround is to ensure a proper tuple is passed, e.g. columns=('airTemperature',)

Reading content from a NCEP HRRR model burf file

I'm a bit naïve to how bufr files are organized, but I hope that pdbufr can read sounding files from NCEP's HRRR model. Can you offer any tips on how to read and filter this type of file with pdbufr if that is possible?

Here is an example file:
https://noaa-hrrr-bdp-pds.s3.amazonaws.com/hrrr.20210927/conus/bufrsnd.t00z/bufr.000000.2021092700

When I try to read this file with pdbufr,

FILE = "bufr.000000.2021092700"
pdbufr.read_bufr(FILE, columns=('latitude', 'longitude', 'airTemperature'))

I get HashArrayNoMatchError: Hash array no match. I wonder if this has to do with the subsets in the file structure.

When I list the contents of the file with bufr_ls, I get this...

bufr_ls bufr.000000.2021092700 | head

centre                     masterTablesVersionNumber  localTablesVersionNumber   typicalDate                typicalTime                numberOfSubsets            
kwbc                       29                         1                          20000000                   000000                     1                         
kwbc                       29                         1                          20000000                   000000                     1                         
kwbc                       29                         1                          20000000                   000000                     0                         
kwbc                       29                         0                          20210927                   000000                     91                        
kwbc                       29                         0                          20210927                   000000                     91                        
kwbc                       29                         0                          20210927                   000000                     91                        
kwbc                       29                         0                          20210927                   000000                     91                        
kwbc                       29                         0                          20210927                   000000                     91         

Support the count ecCodes key

The count key is a generated by ecCodes and it tells us the actual index (1-based) of a given message in the BUFR file. With this working we would be able to extract data only from a given message. E.g. we could use this code to get values only from the first message:

columns = ["airTemperature"]
filters = {"count": 1}
res = pdbufr.read_bufr(path, columns=columns, filters=filters)

Alternatively we would just need a separate option to specify the index of the messages we want to read with pdbufr. This option would be quite important for testing and for probing the BUFR contents.

WMO_station_id ignored in filters unless appears in columns

What happened?

Using e.g. SYNOP data:

df = pdbufr.read_bufr(columns=["latitude", "longitude", "airTemperatureAt2M"],
                 filters={"WMO_station_id": [30846, 89514]

does not apply the filter but extracts data from all the messages. To make filters work WMO_station_id has to be added to columns. For other keys this is not required.

What are the steps to reproduce the bug?

See above.

Version

latest

Platform (OS and architecture)

all

Relevant log output

No response

Accompanying data

No response

Organisation

No response

Merging separate requests differs one joint request

Hello,

I started using pdbufr and came across a behaviour I don't understand. Maybe you can help me understanding it.

The results differ when I am requesting several variables (e.g. temperature and wind) at once (A) or each variable by itself and merging the two (B).
Why do I not get the same result for both requests?

I am using pdbufr version 0.9.0.
Thanks!

(A)

result_A = pdbufr.read_bufr(file, columns=('ident', 'heightOfStationGroundAboveMeanSeaLevel',
                                           'typicalDate', 'typicalTime',
                                           'airTemperature',
                                           'windSpeed'),
                            filters={'masterTablesVersionNumber': 31})

(B)

result_B_temp = pdbufr.read_bufr(file, columns=('ident', 'heightOfStationGroundAboveMeanSeaLevel',
                                                'typicalDate', 'typicalTime',
                                                'airTemperature'),
                                 filters={'masterTablesVersionNumber': 31})
result_B_wind = pdbufr.read_bufr(file, columns=('ident', 'heightOfStationGroundAboveMeanSeaLevel',
                                                'typicalDate', 'typicalTime',
                                                'windSpeed'),
                                 filters={'masterTablesVersionNumber': 31})
result_B = pd.merge(result_B_temp, result_B_wind,
                    on=['ident', 'heightOfStationGroundAboveMeanSeaLevel', 'typicalDate',
                        'typicalTime'], how='outer')

I get

result_A.tail()
    typicalDate typicalTime  ... airTemperature  windSpeed
738    20210614      083000  ...            NaN        5.7
739    20210614      083000  ...         301.55        1.8
740    20210614      083000  ...            NaN        NaN
741    20210614      083000  ...            NaN        0.8
742    20210614      083000  ...            NaN        2.8
[5 rows x 6 columns]

and

result_B.tail()
    typicalDate typicalTime  ... airTemperature  windSpeed
963    20210614      083000  ...            NaN        NaN
964    20210614      083000  ...         291.75        0.8
965    20210614      083000  ...            NaN        0.8
966    20210614      083000  ...         297.55        2.8
967    20210614      083000  ...            NaN        2.8
[5 rows x 6 columns]

pdbufr does not parse data correctly.

I know dealing with bufr is a mess and my first impression of pdbufr is good. But it is not able to deal with one of the main problems of observation data decoded in bufr. So my intention is to work on that issue with you together to provide the world a bufr reader that is as good as cfgrib is.

Take a look into the data I attached and you will find out that airTemperature is defined multiple times within one subset. (From my experience the reports are seperated in subsets) . So now I thought I could use pdbufr filter to access the right temperature.

df = pdbufr.read_bufr(
         'Z__C_EDZW_20210214100000_bda01,synop_bufr_GER_999999_999999__MW_536.bin',
              columns=('airTemperature'),
              filters={'heightOfSensorAboveLocalGroundOrDeckOfMarinePlatform': [2.0]},
              required_columns=False)

The results in an empty Dataframe. As well It is necessary to filter for timePeriod but it does not work, too.

In my own implementation

        parsed_bufr_data = subprocess.run(
            f"{os.environ['BUFR_DUMP_PATH']} -jf {local_bufr_file}",
            stdout=subprocess.PIPE,
            check=True,
            shell=True
        ).stdout

    synop_df = pd.DataFrame(
        json.loads(
            parsed_bufr_data.decode("utf-8", errors='ignore')
        )[SYNOP_DATA_KEY_MESSAGES])

I use bufr_dump and parse the output first as a bytes object and afterwards as a JSON and dump it into a dataframe.
Then I loop through the lines and store each timePeriod and heightOfSensor information to map them to the measures. The rule is that that the latest sensor Information and/or time period information is valid for the value. I guess this behaviour should be implement behind the filter function too.

Why do I name this Issue that pdbufr does not parse the data correctly?
-> It is not clear what kind of airTemperature is parsed (2m or 0.05m) but it is mandatory to know this information to parse the data correctly from my point of view.

Another point: During my investigation of eccodes+python and bufr_dump I have found out that bufr_dump is much compared to the use of the eccodes python interface (or what is suggested in the documentation eccodes doc ).

@alexamici

Z__C_EDZW_20210214100000_bda01.synop_bufr_GER_999999_999999__MW_536.zip

Message structure not identified correctly

pdbufr uses an in memory cache to identify and reuse the message structure as it is processing the messages in a BUFR file. Cache entries are identified by the following header keys and contain all the keys for a given message (structure):

"edition", "masterTableNumber", "numberOfSubsets","unexpandedDescriptors", "delayedDescriptorReplicationFactor"

So if there is already a cache entry for the given message the list of keys are taken from the cache instead of using the key iterator to read them from the message.

The following BUFR file contains 2 messages:

https://get.ecmwf.int/repository/test-data/pdbufr/test-data/message_structure_diff_2.bufr

and according to pdbufr their structure is identical because the value of the keys listed above are the same:

(4, 0, 1, 307096, 22061, 20058, 4024, 13012, 4024, 1, 0)

However, the first message contains more keys as using bufr_dump -p confirms it:

This is the end of the first message:

#24#timePeriod=-1
depthOfFreshSnow=MISSING
#25#timePeriod=0

This is the end of the second message:

#20#timePeriod=-1
depthOfFreshSnow=MISSING
#21#timePeriod=0

The bottom line is that the message structure identification mechanism does not work correctly in pdbufr and has to be improved.

UnicodeDecodeError when parsing BUFR file from DWD

I haven't seen an open issue on this, forgive me if that's not the case.

I'm running the master version with eccodes v2.21.0.

I can successfully read the BUFR files from German weather stations here https://opendata.dwd.de/weather/weather_reports/synoptic/germany/ (like @meteoDaniel) but not the international ones here https://opendata.dwd.de/weather/weather_reports/synoptic/international/. In the latter case after doing this

df_stations = read_bufr('/tmp/latest.bin',
          columns=('stationOrSiteName',
                   'latitude',
                   'longitude',
                   'heightOfStationGroundAboveMeanSeaLevel',
                   'year', 'month', 'day', 'hour', 'minute',
                   ))

I get

~/miniconda3/lib/python3.8/site-packages/gribapi/gribapi.py in grib_get_string(msgid, key)
    489     err = lib.grib_get_string(h, key.encode(ENC), values, length_p)
    490     GRIB_CHECK(err)
--> 491     return ffi.string(values, length_p[0]).decode(ENC)
    492 
    493 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I can successfully see file content using grib_dump but I would like to avoid having to dump everything into a json before :)

Make it possible to filter out all NaN values

Is your feature request related to a problem? Please describe.

I tried to use the "filters" flag of the read_bufr function to filter out NaN values.
My filter was a very simple lambda function: filter = lambda x : pandas.notna(x)

When I used it to get rid of missing data of a single parameter, it worked fine. But as I took many parameters, the returned pandas DataFrame shrunk and did not contain the desired data anymore, or it was even empty.

I suspect that this is due to the nature of the filter conditions. In the documentation, you mention that they are connected with logical AND: https://pdbufr.readthedocs.io/en/latest/read_bufr.html#combining-conditions

The problem for me is that without filtering I get a quite big DataFrame with many missing values which I have to get rid of afterwards. I've noticed that a lot of columns actually just contain NaN values.

Describe the solution you'd like

It would be nice to have the option to connect conditions with logical OR instead. Maybe that could already solve my problem.

Describe alternatives you've considered

Another solution I can imagine is having the option to use the equivalent of "df.loc[:, parameter].notna().any()" on each column (parameter) before returning the DataFrame. If this condition returns True for a column, i.e., it only consists of missing values, the column gets dropped.

Ideally, this would be done before the DataFrame is created internally.

Additional context

My solution for now is that I call df.dropna(how="all") on both axis after I've created the DataFrame. But this is not a very efficient way to do it, especially for large amount of data.

Organisation

Meteo Service weather research

Question: How to extract two different values for pressure

This is a pdbufr (v0.10.2) usage question. Please excuse my novice knowledge of BUFR data; I'm just learning about this data type.

I'm reading a BUFR file that has two values for "pressure" for each observation; the two values are the top and bottom pressure level used to describe one observation.

The command bufr_dump -d <bufr_file.bufr> produces this output showing there are two different pressure levels:

...(more above)

031001  delayedDescriptorReplicationFactor      DELAYED DESCRIPTOR REPLICATION FACTOR [Numeric]
007004  pressure        PRESSURE [Pa]
007004  pressure        PRESSURE [Pa]
103000  103000  103000 [103000]
031001  delayedDescriptorReplicationFactor      DELAYED DESCRIPTOR REPLICATION FACTOR [Numeric]
008023  firstOrderStatistics    FIRST-ORDER STATISTICS [CODE TABLE]
011003  u       U-COMPONENT [m/s]
011004  v       V-COMPONENT [m/s]

...(more below)

But when I read the file with pdbufr, only one value for pressure is returned.

pdbufr.read_bufr(
    FILE,
    columns=["latitude", "longitude", "pressure", "u", "v"],
)

image

Is there a way to target a specific pressure value (first or second) or return both?

Allow extracting data with repeated coordinate descriptors

Is your feature request related to a problem? Please describe.

Related to #51

The data in question has the following structure:

image

And we would like to perform:

df = pdbufr.read_bufr(f, 
    columns=("pressure", "pressure", "u", "v"),
)

to get results like this:

49660,55520,2.2,-0.9
49660,55520,0.3,0.4

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Organisation

No response

Allow to specify ecCodes key type

Some ecCodes keys like centre can be accessed both as number and string. It would be great if pdbufr could support the ecCodes type notations (see e.g. bufr_ls) to specify the return type of the key values. The notation is simply based on adding the :[d,i,s] suffix to a key to specify a float, int or string return type. For example, for an ECMWF BUFR message centre:s would result in "ecmf" while centre:i would result in 98.

Add missing value handling options

Currently read_bufr does not offer control over missing values during the extraction and we have to filter the resulting Pandas dataframe to remove them.

Option 1

Add option missing_value_policy with the following values: "include", ignore" (default="include")

df = pdbufr.read_bufr(...., missing_value_policy="ignore")

Option 2

Add option skip_missing as a bool (default=False)

df = pdbufr.read_bufr(...., skip_missing=True)

Option 3

Add option skip_na_values as a bool (default=False)

df = pdbufr.read_bufr(...., skip_na_values=True)

Column with string data not expanded correctly from compressed subsets

Hi,
First, I want to thank you for publishing pdbufr. It is awesome and is saving me so much headache.

I am reading aircraft data from EMADDC and my problem is that pdbufr seems to not be parsing the field aircraftRegistrationNumberOrOtherIdentification properly. To follow along with my example, you can get a sample BUFR file from this page (scroll all the way to the bottom and click EHS or MRAR; the MRAR file is much smaller).

import pdbufr

df = pdbufr.read_bufr(
    'EMADDC_KNMI_MRAR_20210909_1500_20210909_1514.bufr',
    columns=[
        'latitude',
        'aircraftRegistrationNumberOrOtherIdentification',
        'airTemperature',
        'numberOfSubsets',
    ]
)
df

image

As you can see, the aircraft identifier is returned as a list of values in the whole subset instead of listing one item per row like the other variables (e.g., latitude, temperature). As far as I can tell, the first 100 rows are identical lists, then the next 100 rows are identical, etc.

My crude work-around is this: since I know there are a maximum of 100 items in each subset, I append a list of identifiers from the list of every 100th row in the DataFrame's aircraftRegistrationNumberOrOtherIdentification column. This is what I expected pdbufr to return for the column.

aircraft_id= []
for i in range(0, len(df), 100):
    subset_list = df['aircraftRegistrationNumberOrOtherIdentification'].iloc[i]
    aircraft_id += subset_list

df['aircraft_id'] = aircraft_id

image


Is this a bug, that pdbufr isn't parsing the aircraftRegistrationNumberOrOtherIdentification correctly, or am I missing a setting or function that unpacks it in this way?

Thanks for your help!

Embed high-level BUFR interface from ecCodes

We wish to remove the ecCodes high-level interface at some point, so in preparation for that, pdbufr should take a copy of it so that we are immune to that change in ecCodes.

can't get some values from ECMWF tf.bufr

What happened?

I have download the .bufr from https://data.ecmwf.int/forecasts/20230905/00z/0p4-beta/oper/, I want get [time, lat, lon, windspeed, pressure] about "13W", then I try this:
pdbufr.read_bufr(bufrFp, columns=("stormIdentifier", "timePeriod", "latitude", "longitude", "windSpeedAt10M", "pressureReducedToMeanSeaLevel", ), filters={"stormIdentifier": "13W", }, )
but I get an empty dataframe
then I delete "timePeriod" and "windSpeedAt10M" from columns above, it can get 20 rows:
` stormIdentifier latitude longitude pressureReducedToMeanSeaLevel

0 13W 23.4 116.8 99900.0
1 13W 23.5 117.0 100000.0
2 13W 23.5 116.1 100200.0
3 13W 22.8 115.2 100300.0
4 13W 23.4 114.6 100500.0
5 13W 23.4 114.5 100300.0
6 13W 23.5 112.9 100500.0
7 13W NaN NaN NaN
8 13W NaN NaN NaN
9 13W 24.0 112.0 100400.0
10 13W 23.6 111.7 100400.0
11 13W 24.0 109.8 100500.0
12 13W 24.0 109.8 100500.0
13 13W 23.9 109.4 100400.0
14 13W 23.6 107.0 100400.0
15 13W NaN NaN NaN
16 13W 24.4 108.6 100500.0
17 13W 23.7 106.9 100300.0
18 13W 23.7 107.0 100300.0
19 13W 23.7 106.9 100400.0`

then I delete "pressureReducedToMeanSeaLevel" from columns like this:
pdbufr.read_bufr(bufrFp, columns=("stormIdentifier", "latitude", "longitude"), filters={"stormIdentifier": "13W", }, )
I got 40 rows:
` stormIdentifier latitude longitude

0 13W 23.5 117.3
1 13W 23.4 116.8
2 13W 23.4 119.9
3 13W 23.5 117.0
4 13W 25.3 121.1
5 13W 23.5 116.1
6 13W 25.5 120.5
7 13W 22.8 115.2
8 13W 22.8 116.7
9 13W 23.4 114.6
10 13W 19.4 116.8
11 13W 23.4 114.5
12 13W 20.1 113.3
13 13W 23.5 112.9
14 13W 20.8 115.4
15 13W NaN NaN
16 13W NaN NaN
17 13W NaN NaN
18 13W NaN NaN
19 13W 24.0 112.0
20 13W 26.2 111.4
21 13W 23.6 111.7
22 13W 22.4 113.7
23 13W 24.0 109.8
24 13W 19.9 111.4
25 13W 24.0 109.8
26 13W 20.0 109.3
27 13W 23.9 109.4
28 13W 22.4 113.7
29 13W 23.6 107.0
30 13W 20.8 110.6
31 13W NaN NaN
32 13W NaN NaN
33 13W 24.4 108.6
34 13W 21.3 109.5
35 13W 23.7 106.9
36 13W 19.5 108.6
37 13W 23.7 107.0
38 13W 20.0 108.2
39 13W 23.7 106.9
40 13W 19.7 107.9`

oh, no, why this?
I use eccodes.codes_dump() get a json file, but I can't understand the structure because it's nested multilevel. I can found 31 "timePeriod" in the json by ctrl-f, why can't I get it in the 1st case above? If some row don't have "timePeriod", why not the value is NaN? All in all, why I got different rows within different columns and same filter?

some files are here:
https://github.com/DaiDai-Dad/temp/blob/main/20230905000000-240h-oper-tf.bufr
https://github.com/DaiDai-Dad/temp/blob/main/output.json

What are the steps to reproduce the bug?

see above

Version

0.11.0

Platform (OS and architecture)

Windows10

Relevant log output

No response

Accompanying data

No response

Organisation

No response

Key/value not found error for keys that exists

Hi,

I've recently starting implementing pdbufr in my BUFR decoding. It has been working well until testing on a larger sample of files. I have encountered the error:

gribapi.errors.KeyValueNotFoundError: Key/value not found

Whilst trying to extract using:

pdbufr.read_bufr(bufr_file, columns=('stationNumber', 'timePeriod', 'maximumWindGustDirection', 'maximumWindGustSpeed'))

The data are BUFR files provided by the Met Office. Using bufr_dump on (one of) the BUFR files I can see that the keys do exists and have determined that is is the timePeriod key causing the error:

A sample from bufr_dump showing the keys existing. The keys exist for all stations within the file. :

[

                                                                    {
                                                                      "key" : "timePeriod",
                                                                      "value" : -10,
                                                                      "units" : "min"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustDirection",
                                                                      "value" : 190,
                                                                      "units" : "deg"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustSpeed",
                                                                      "value" : 4.9,
                                                                      "units" : "m/s"
                                                                    }
                                                                  ],
                                                                  [

                                                                    {
                                                                      "key" : "timePeriod",
                                                                      "value" : -60,
                                                                      "units" : "min"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustDirection",
                                                                      "value" : 190,
                                                                      "units" : "deg"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustSpeed",
                                                                      "value" : 5.8,
                                                                      "units" : "m/s"
                                                                    }
                                                                  ],
                                                                  [

                                                                    {
                                                                      "key" : "timePeriod",
                                                                      "value" : -180,
                                                                      "units" : "min"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustDirection",
                                                                      "value" : 210,
                                                                      "units" : "deg"
                                                                    },
                                                                    {
                                                                      "key" : "maximumWindGustSpeed",
                                                                      "value" : 8.4,
                                                                      "units" : "m/s"
                                                                    }
                                                                  ],

I noted issue #24 stating a similar issue, however I am using build of eccodes 2.29.0 built from source with pdbufr 0.9.0. The user in #24 stated updating solved their issue, but in this case I was already using 2.16.0 and updated to 2.29.0 to see if that resolved the issue.

Any ideas what might be causing this error given keys exists in the file?

Thanks

Test using assert_frame_equal

The current tests generally assume complete numerical equality, down to the 15th decimal place at least, which is not realistic for testing on different systems. Using assert_frame_equal solves this, as it checks only up to the 5th decimal place by default, which is fine for all our values.

Memory is accumulated as BUFR messages are processed

For the temp.bufr file (500KB, a few hundred messages) from the test samples if we run a simple extraction the peak memory usage is 1.5. GB! Simple tests show that the memory allocated to the message handle is not released after the message is processed in the main message loop.

Allow to use immutable message list objects

Is your feature request related to a problem? Please describe.

Currently, unpacking a message internally goes like this:

msg["unpack"] = 1

and message object is required to be a mutable mapping The idea is that pdbufr should also be able to use immutable mappings having the equivalent unpack method:

msg.unpack()

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Organisation

ECMWF

pdbufr can take a very long time to filter certain BUFR files/messages

The following refers to a BUFR file in ECMWF internal file system under my username: $PERM/public/i0tp_07062020_00.buf. It contains 364,359 messages and 715,925 subsets.
The following call to pdbufr takes around 41 minutes on the HPC:

import pdbufr

DIR = './'
filein = DIR + 'i0tp_07062020_00.buf'

df = pdbufr.read_bufr(filein,
    columns=("year",'month', "hour", "minute","latitude", "longitude", "atmosphericPathDelayInSatelliteSignal"),
                      filters={"stationOrSiteName": 'S3AG-EUME'},
    )
print(df)

I compared performance by running a pure eccodes-python script to iterate over the messages and unpack each one. That takes around 5 minutes.

I also modified the pdbufr code so as to avoid performing the actual filtering, and then it took around 5 minutes too.

I could see that certain sets of messages in the BUFR file were taking a lot longer to process than others. After putting a little profiling code into pdbufr and printing the indexes of the messages that were taking longer than 0.01 seconds to process, these were some of the message indexes (0-based) that consistently took longer:

count 36467 time 0.015330487862229347
count 82749 time 0.022938511800020933
count 110001 time 0.10010456619784236
count 110002 time 0.0794135918840766
… # most of the messages in between above and below
count 132562 time 1.6417972878552973
count 132563 time 0.033606610260903835
count 160917 time 0.04080211604014039
…
count 311422 time 0.02709425799548626
count 311423 time 0.010877656750380993
…
count 311619 time 1.4403185020200908
count 311620 time 5.845485555008054
…

Some standout times to process certain messages, which seemed consistent with multiple runs:

count 110774 time 15.490661825053394
count 110804 time 12.531519242096692
count 110878 time 10.05591939855367
count 110905 time 17.22438928298652
count 110909 time 14.911728173028678
count 110910 time 11.457727732136846
count 110911 time 12.165616693906486
count 110912 time 11.780960503034294
count 110913 time 9.73410841403529
count 110959 time 14.38328706100583
count 110960 time 0.0479189301840961
count 110961 time 12.562545870896429
count 110962 time 15.093519839923829
count 110963 time 16.94631726015359
count 110992 time 17.39802599698305
count 110993 time 18.96778273070231
count 110994 time 19.078917557373643
count 110996 time 0.017089318949729204
count 110997 time 13.097122138831764
count 110998 time 15.609233817085624
count 110999 time 12.680324002169073
count 111000 time 12.114437516778708
count 111001 time 19.27112303301692
count 111002 time 11.963714307174087
count 111196 time 25.567047986667603
count 111256 time 30.325322085991502
count 111552 time 39.79854283807799
count 111756 time 58.527211253065616
count 311590 time 16.124957408290356
count 311591 time 13.132242653053254

I copied a single message into $PERM/public/one_msg_110773.bfr (bufr_copy -wcount=110773 i0tp_07062020_00.buf one_msg_110773.bfr, which is the message at my 0-based 110774 index printouts) and it takes 15 seconds to process by itself with the same pdbufr script. This particular message has what looks like 6632 time periods encoded in it, and the single message occupies almost 140KB on disk.

Interestingly, calling bufr_dump on this message took only 0.6 seconds:

time bufr_dump -j a one_msg_110773.bfr >b.dump

Support rank notation in ecCodes keys

ecCodes supports the #n#key notation to get the n-th value for a given key in a BUFR message. It would be great if pdbufr allowed for this notation. E.g this code

columns = ["latitude", "longitude", "#1#airTemperature"] 
res = pdbufr.read_bufr(path, columns=columns)

could return only the first airTemperature value from each message in a radiosonde BUFR file.

Improve filter speed

The performance of the bufr filter should be improved. It is currently 4-5 times slower than the BUFR filter in Metview Python (it is based on a C++ wrapper around ecCodes), which is already slower than the bufr_filter ecCodes command line tool. The following test case illustrates the problem:

File test.bufr contains 3927 synop messages and we want to extract the 2m temperature values form it. This is the test code in Metview Python:

import metview as mv
f=mv.read('test.bufr')
gpt = mv.obsfilter(data=f,
    output="csv", 
    parameter='airTemperatureAt2M'
)
res= gpt.to_dataframe()
print(len(res))

and this is the code with pdbufr:

import pdbufr
f = 'test.bufr'
res = pdbufr.read_bufr(f, columns=('latitude', 'longitude', 'airTemperatureAt2M'))
print(len(res))

The execution time is as follows:

  • Metview Python: 2.788s
  • pdbufr: 11.861s

Test failure with pandas 2.1.1

What happened?

Test 'test_sat_compressed_1' fails with this error:

>       assert_frame_equal(res[0:1], ref_1[res.columns])
E       AssertionError: Attributes of DataFrame.iloc[:, 6] (column name="data_datetime") are different
E
E       Attribute "dtype" are different
E       [left]:  datetime64[ns]
E       [right]: datetime64[s]

This test works with pandas 2.0.3, but fails with 2.1.1.
This looks like the reason: pandas-dev/pandas#52212

In our tests, we use various ways to create a reference pandas DataFrame. This particular test uses a way that used to generate an 'ns' datetime, but with the new pandas version it generates an 's' datetime. The solution is simply to specify 'ns' when creating the reference datetimes.

What are the steps to reproduce the bug?

pytest -s -k test_sat_compressed_1

Version

0.11.0

Platform (OS and architecture)

Linux and MacOS

Relevant log output

No response

Accompanying data

No response

Organisation

ECMWF

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.