tarb / betfair_data Goto Github PK

View Code? Open in Web Editor NEW

40.0 6.0 5.0 23.51 MB

Fast Python Betfair historical data file parser

Home Page: https://betfair-datascientists.github.io/tutorials/jsonToCsvRevisited/

License: MIT License

Rust 81.80% Python 14.93% PowerShell 2.45% Shell 0.82%

python rust betfair betfair-data

betfair_data's People

Contributors

Stargazers

Watchers

Forkers

mberk luke-wild 8funtik8 josehgaks coffee-monstrik

betfair_data's Issues

Add streaming_update / streaming_unique_id to Market

Is it possible to add the raw streaming_update to the Market object?

>> market.streaming_update
{"id":"1.196641872","rc":[{"atb":[[1.11,0]],"id":40849650},{"atb":[[1.11,0]],"id":40550684},{"atb":[[1.13,1.99]],"id":40570484}]}

Would also be nice to be able to add a unique id to the file, not sure how this should be done or if it would be possible to set it after created:

market.streaming_unique_id = 123

In reference to #1 flumine uses the two fields above for optimisation on the simulation logic.

Bz2 Files that are missing MarketType are not parsed into CSV. Possible to parse with MarketType = NaN?

I am attempting to parse Betfair historical stream data (bz2 files) to CSV using the module Betfair_Data in the Betfair Lightweight format.

However, when parsing the bz2 files, I am receiving the print statement “(JSON Parse Error) missing required field at line X column Y“, and consequently, data on the markets that exhibit this error are not included in my final csv.

I was wondering if it were possible to modfiy the module, such that the bz2 files that are missing the MarketType entry are still parsed with market_type = NaN, rather than excluded from consideration.

As a side note, I have confirmed that this issue is indeed arising because in several of the raw bz2 files provided by Betfair, the marketType entry is missing.

Any help that could be provided would be much appreciated!
Cheers!

Skip bad lines

I tried using this module on manually collected data using betfairlightweight. The problem is that my files contain market book data as the first line of the file (which is necessary to obtain selection's names). I thought that the parser will just skip over the unrecognized line and perhaps warn about it, but instead it crashes and burns.

Would it be possible to update it parses so that it skips over unrecognized lines instead of tapping out? If that behaviour is desired, how about adding a skip_bad_lines option to give end users a choice? I would imagine I am not the only person in this situation.

Error when version in marketDefinition greater than uint32

I am getting an error parsing the first line of a self-collected betfair stream. The error is "Failed to parse version field: Value out of range: 4295094959." This appears to be greater than the allowed size of uint32.

I am using betfair-data 0.3.4 (the latest from pip) running in a Miniconda instance on a Macbook Pro M1, Ventura 13.2.1.

The JSON of the line is below.

{
	"mc": [{
		"id": "1.193531652",
		"img": true,
		"marketDefinition": {
			"venue": "Perry Barr",
			"inPlay": false,
			"status": "OPEN",
			"eventId": "31182317",
			"runners": [{
				"id": 42396416,
				"status": "ACTIVE",
				"sortPriority": 1
			}, {
				"id": 42052577,
				"status": "ACTIVE",
				"sortPriority": 2
			}, {
				"id": 38557499,
				"status": "ACTIVE",
				"sortPriority": 3
			}, {
				"id": 36648358,
				"status": "ACTIVE",
				"sortPriority": 4
			}, {
				"id": 39069859,
				"status": "ACTIVE",
				"sortPriority": 5
			}, {
				"id": 38757603,
				"status": "ACTIVE",
				"sortPriority": 6
			}],
			"version": 4295094959,
			"betDelay": 0,
			"complete": true,
			"openDate": "2022-01-18T11:06:00.000Z",
			"timezone": "Europe/London",
			"bspMarket": true,
			"marketTime": "2022-01-18T11:06:00.000Z",
			"marketType": "WIN",
			"regulators": ["MR_INT"],
			"bettingType": "ODDS",
			"countryCode": "GB",
			"eventTypeId": "4339",
			"suspendTime": "2022-01-18T11:06:00.000Z",
			"bspReconciled": false,
			"crossMatching": false,
			"marketBaseRate": 5,
			"discountAllowed": false,
			"numberOfWinners": 1,
			"runnersVoidable": false,
			"turnInPlayEnabled": false,
			"persistenceEnabled": false,
			"numberOfActiveRunners": 6,
			"priceLadderDefinition": {
				"type": "CLASSIC"
			}
		}
	}],
	"op": "mcm",
	"pt": 1642451002939
}

Only a few market objects per event ID are returned given price data files (tar or bz2)

when using files = bfd.Files(paths) where paths is a list of .tar file paths. There are only a few market objects per event ID returned and most are missing.

I also tried using

with open(path, "rb") as file:
        ff = bfd.File(path, file.read())
for market in ff:
      market....

The above also return the same thing, I think it is something to do with the bc2 price file given by Betfair has '\n' and is not widely recognize as valid JSON, hence causing the reading to miss most lines of data?

multiprocessing market files

Could someone provide an example to get around the GIL? I run into "TypeError: cannot pickle 'builtins.File' object" with both joblib and pathos.

last traded price lacks some info due to the logic of the parser

in the betfair BASIC files if some horse has not been traded at some timepoint, it doesn't appear in the json line for that timepoint

whereas in your code IIUC it is rolled forward to every timepoint. It 's actually correct as it is still the last traded price(!) but we lose the information when it was actually last traded at that price

e.g market file "1.212040273.bz2"

horse id 41133899 the last traded price 2.94 is rolled over in the next update

with betfair_data
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:20:21.962000', 278.038, False, 'ACTIVE', '2. Crosshaven', 41133899, 3.05, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:21:21.961000', 218.039, False, 'ACTIVE', '2. Crosshaven', 41133899, 2.94, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:22:22.001000', 157.999, False, 'ACTIVE', '2. Crosshaven', 41133899, 2.94, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:23:21.968000', 98.032, False, 'ACTIVE', '2. Crosshaven', 41133899, 3.1, None)

with my R parser (much slower than yours!)

   rc.ltp    rc.id                  pt      mkt.id mc.version
1:   2.92 41133899 2023-04-01 01:15:21 1.212040273         NA
2:   2.94 41133899 2023-04-01 01:17:21 1.212040273         NA
3:   2.92 41133899 2023-04-01 01:18:22 1.212040273 5147358276
4:   3.05 41133899 2023-04-01 01:20:21 1.212040273         NA
5:   2.94 41133899 2023-04-01 01:21:21 1.212040273         NA
6:   3.10 41133899 2023-04-01 01:23:21 1.212040273         NA
7:   3.20 41133899 2023-04-01 01:24:21 1.212040273         NA
8:   3.05 41133899 2023-04-01 01:25:21 1.212040273 5147363556

you can see there is no price update at 01:22:21 as there has been no trade on this horse from 01:21:21 to 01:22:21

I can't find the logic in your code where you roll forward the last_traded_price but it would be great if you could fix that or provide sthg like last ever traded price and the real last traded price as it appears in the files with the correct timestamp
So maybe one fix would be to have letp and ltp and ltp to be None when it wasnt there in the json and letp the value you roll forward?

Thank you for that cool package

PS: a wrong solution is to record only ltp changes but two equal consecutive values could be legit trades!

Parser Filter

I am not sure if this will have any benefit in terms of speed on your implementation however in flumine we have some code which ignores/doesn't output data that doesn't meet user requirements, for example inplay and seconds_to_start.

This gets passed down to bflw and has a huge speed improvement to the python code so wondering if it would be the same for betfair_data? Here is the code I have added which does the same but after it has been processed by the Rust code:

    def _read_loop(self) -> list:
        # listener_kwargs filtering
        in_play = self.listener_kwargs.get("inplay")
        seconds_to_start = self.listener_kwargs.get("seconds_to_start")
        cumulative_runner_tv = self.listener_kwargs.get("cumulative_runner_tv", False)
        if in_play is None and seconds_to_start is None:
            process_all = True
        else:
            process_all = False
        # process files
        files = betfair_data.bflw.Files(
            [self.market_filter],
            cumulative_runner_tv=cumulative_runner_tv,
            streaming_unique_id=self.stream_id,
        )
        for file in files:
            for update in file:
                if process_all:
                    yield update
                else:
                    for market_book in update:
                        if market_book.status == "OPEN":
                            if in_play:
                                if not market_book.inplay:
                                    continue
                            elif seconds_to_start:
                                _seconds_to_start = (
                                    market_book.market_definition.market_time
                                    - market_book.publish_time
                                ).total_seconds()
                                if _seconds_to_start > seconds_to_start:
                                    continue
                            if in_play is False:
                                if market_book.inplay:
                                    continue
                        yield [market_book]

Rust install required to install Module via PIP

Relatively simple quality of life update suggested, in the pre-requisites section of the README doco suggest adding:

Install latest version of Rust:

https://www.rust-lang.org/tools/install

Ensure it is added to the path variable

Open Powershell
Run this command:

$env:Path += ';C:\Users\<USER>\.cargo\bin'

Replace with your username.

Proceed to install.

Also does not seem to work with python 3.10.x, had to downgrade to 3.7.12 to operate

Refactor API to be more pythonic

Is there any scope to refactor the API/ReadMe so that its use is more pythonic, for example currently it is tricky to read / understand what and where it is doing the processing:

for file in bfd.Files(paths).iter():
    market_count += 1
    for market in file:
        update_count += 1

Small change to declare the files first before iterating:

files = bfd.Files(paths)

for file in files.iter():
    market_count += 1
    for market in file:
        update_count += 1

This could be improved further by utilising python's builtins __iter__ removing the requirement to call iter entirely.

files = bfd.Files(paths)

for file in files:
    market_count += 1
    for market in file:
        update_count += 1

bflw processing can stay the same

files = bfd.Files(paths)

for file in files.bflw():
    market_count += 1
    for market in file:
        update_count += 1

Is there any reason why the mutable flag isn't available at the Files level?

(IO Error) unsupported file type

Although normally gzipped my raw files look like this:

resources/PRO-1.170258213
resources/1.170258213
resources/170258213

I see in handle_file it validates the extension. Not sure the 'Rust' way of doing things but can it auto detect file type or as a last resort read as per a text/json file?

(JSON Parse Error) unknown field `batb`

I am trying to open original downloaded 2020 Tennis Advanced data and getting the error message

(JSON Parse Error) unknown fieldbatb, expected one of id, atb, atl, spn, spf, spb, spl, trd, tv, ltp, hc at line 1 column 7

The very same market loads with no error using the original downloaded 2020 Tennis Basic data and getting the error message

I am OK, with not supporting (some) extra fields in the Advanced and Pro data, but those sets also have more updates for the market. So it would be beneficial to be able load those data even ignoring the not known fields instead of quitting the parsing process.

Add support for LZMA compression

Would it be possible to add support for LZMA-compressed ZIP files? I did some benchmarking and found out that LZMA algorithm provides very similar compression ratios compared to BZ2, but decompression time is 3x faster. It is my algorithm of choice for compressing my own collected data and I would love to have this package support it.

Here are the results of some benchmarks. The numbers are: compression ratio, compression time, decompression time. Since the compression is done only once, the compression time is not relevant. However, the format is clearly superior to BZ2 which is what Betfair uses to provide data (probably because they don't bother to decompress it).

Collected data:
GZIP: 15.1 %, 5148 ms, 155 ms.
ZIP: 16.0 %, 1792 ms, 140 ms.
BZIP: 10.4 %, 4441 ms, 995 ms.
LZMA: 10.6 %, 15809 ms, 350 ms.

Betfair PRO data:
GZIP: 11.8 %, 2625 ms, 150 ms.
ZIP: 12.6 %, 884 ms, 121 ms.
BZIP: 7.5 %, 4676 ms, 841 ms.
LZMA: 8.3 %, 12055 ms, 266 ms.

Object creation support - create constructor or turn to DataClass?

Again apologies if this is just a knowledge thing, but can we create object directly by passing in values?
i.e. passing a native betfairlightweight objects raw dict data

new_market_book = bflw.MarketBook(**market_book._data)

gives TypeError: No constructor defined

kind of raises the question should we be able to create objects or is there no need with the native bflw library avilable.

What version of rust is needed to build?

I see you're using various unstable features but the code won't compile with the latest nightly version of rust

(JSON Parse Error) invalid type: null, expected a borrowed string at line 1 column 22

I recorded a bunch of markets with flumine. About half of them have the first clk as null. bfd can process the file just fine after changing null to "null" or any other string:

{"op":"mcm","clk":null,"pt":1720183153764,"mc":[{"id":"1.230347142","marketDefinition":{"bspMarket":false,"turnInPlayEnabled":true,"persistenceEnabled":true,"marketBaseRate":5,"eventId":"33388788","eventTypeId":"1","numberOfWinners":1,"bettingType":"ODDS","marketType":"OVER_UNDER_35","marketTime":"2024-07-05T15:30:00.000Z","suspendTime":"2024-07-05T15:30:00.000Z","bspReconciled":false,"complete":true,"inPlay":false,"crossMatching":true,"runnersVoidable":false,"numberOfActiveRunners":2,"betDelay":0,"status":"OPEN","runners":[{"status":"ACTIVE","sortPriority":1,"id":1222344},{"status":"ACTIVE","sortPriority":2,"id":1222345}],"regulators":["MR_INT"],"countryCode":"LT","discountAllowed":true,"timezone":"GMT","openDate":"2024-07-05T15:30:00.000Z","version":6004374642 ...

Field Typing and missing attributes

So apologies if this is incorrect as I dont have any clue about rust with python but the following things I have spotted in the bflw file:

datetime incorrectly typed
unused fields (I think they are not implemented) not correctly typed such as orders, matches, line max min, key line defs
Order of class in file (im not sure if this is a non issue but my interpreter doesnt like it)
some of the typing on the object references was off

Again apologies if none of this makes sense to how rust works with python. below ive made some changes and added some comments. Would be good to see if anything makes sense in changing:

from datetime import datetime
from typing import Iterator, Optional, Sequence, List, Dict, Any
import betfair_data

class BflwAdapter(Iterator[betfair_data.File]): ...


class MarketDefinitionRunner:
    adjustment_factor: Optional[float] = None
    bsp: Optional[float] = None
    handicap: float
    name: Optional[str] = None
    removal_date: Optional[datetime] = None
    selection_id: int
    sort_priority: int
    status: str

class MarketDefinition:
    bet_delay: int
    betting_type: str
    bsp_market: bool
    bsp_reconciled: bool
    complete: bool
    country_code: str
    cross_matching: bool
    discount_allowed: bool
    each_way_divisor: Optional[str] = None #TODO added missing
    event_id: str
    event_name: Optional[str] = None # changed from eventName
    event_type_id: str
    in_play: bool
    key_line_definitions: Optional[Dict[str, Any]] = None #TODO added create objs? seems to be different from market_book version
    line_interval: Optional[float] = None #TODO lineMinUnit: Optional[float] = None
    line_max_unit: Optional[float] = None #TODO lineMaxUnit: Optional[float] = None
    line_min_unit: Optional[float] = None #TODO lineInterval: Optional[float] = None
    market_base_rate: float
    market_time: datetime
    market_type: str
    name: Optional[str] = None
    number_of_active_runners: int
    number_of_winners: int
    open_date: datetime
    persistence_enabled: bool
    price_ladder_definition: Dict[str, Any] #TODO create an obj? priceLadderDefinition: Dict[str, str]
    race_type: str # raceType: Optional[str] = None
    regulators: List[str]
    runners: List[MarketDefinitionRunner]
    runners_voidable: bool
    settled_time: Optional[datetime] = None
    status: str
    suspend_time: Optional[datetime] = None
    timezone: str
    turn_in_play_enabled: bool
    venue: Optional[str]
    version: int




class RunnerBook:
    adjustment_factor: float
    ex: betfair_data.RunnerBookEX
    handicap: float
    last_price_traded: Optional[float] = None
    removal_date: Optional[datetime] = None
    selection_id: int
    sp: betfair_data.RunnerBookSP
    status: str
    total_matched: float
    matches: List[Any] = []
    orders: List[Any] = []

class MarketBook:
    bet_delay: int
    bsp_reconciled: bool
    complete: bool
    cross_matching: bool
    inplay: bool
    is_market_data_delayed: bool
    last_match_time: datetime
    market_definition: MarketDefinition
    market_id: str
    number_of_active_runners: int
    number_of_runners: int
    number_of_winners: int
    publish_time: datetime
    publish_time_epoch: int
    runners: List[RunnerBook]
    runners_voidable: bool
    status: str
    total_available: float
    total_matched: float
    version: int
    streaming_snap: bool #TODO added for compatability
    streaming_unique_id: int #TODO added for compatability
    streaming_update: Dict[str, Any] #TODO added the last delta message

    _data: Dict[str, Any] #TODO added dict version of market_book object


class File(Iterator[Sequence[MarketBook]]):
    def __init__(self, path: str, bytes: bytes, cumulative_runner_tv: bool = True) -> None: ...
    file_name: str

README references a pyi stub file which doesn't exist

IDE's should automatically detect the types and provide checking and auto complete. See the pyi stub file for a comprehensive view of the types and method available.