tarb / betfair_data Goto Github PK
View Code? Open in Web Editor NEWFast Python Betfair historical data file parser
Home Page: https://betfair-datascientists.github.io/tutorials/jsonToCsvRevisited/
License: MIT License
Fast Python Betfair historical data file parser
Home Page: https://betfair-datascientists.github.io/tutorials/jsonToCsvRevisited/
License: MIT License
Is it possible to add the raw streaming_update to the Market
object?
>> market.streaming_update
{"id":"1.196641872","rc":[{"atb":[[1.11,0]],"id":40849650},{"atb":[[1.11,0]],"id":40550684},{"atb":[[1.13,1.99]],"id":40570484}]}
Would also be nice to be able to add a unique id to the file, not sure how this should be done or if it would be possible to set it after created:
market.streaming_unique_id = 123
In reference to #1 flumine uses the two fields above for optimisation on the simulation logic.
I am attempting to parse Betfair historical stream data (bz2 files) to CSV using the module Betfair_Data in the Betfair Lightweight format.
However, when parsing the bz2 files, I am receiving the print statement โ(JSON Parse Error) missing required field at line X column Yโ, and consequently, data on the markets that exhibit this error are not included in my final csv.
I was wondering if it were possible to modfiy the module, such that the bz2 files that are missing the MarketType entry are still parsed with market_type = NaN, rather than excluded from consideration.
As a side note, I have confirmed that this issue is indeed arising because in several of the raw bz2 files provided by Betfair, the marketType entry is missing.
Any help that could be provided would be much appreciated!
Cheers!
I tried using this module on manually collected data using betfairlightweight. The problem is that my files contain market book data as the first line of the file (which is necessary to obtain selection's names). I thought that the parser will just skip over the unrecognized line and perhaps warn about it, but instead it crashes and burns.
Would it be possible to update it parses so that it skips over unrecognized lines instead of tapping out? If that behaviour is desired, how about adding a skip_bad_lines
option to give end users a choice? I would imagine I am not the only person in this situation.
I am getting an error parsing the first line of a self-collected betfair stream. The error is "Failed to parse version field: Value out of range: 4295094959." This appears to be greater than the allowed size of uint32.
I am using betfair-data 0.3.4 (the latest from pip) running in a Miniconda instance on a Macbook Pro M1, Ventura 13.2.1.
The JSON of the line is below.
{
"mc": [{
"id": "1.193531652",
"img": true,
"marketDefinition": {
"venue": "Perry Barr",
"inPlay": false,
"status": "OPEN",
"eventId": "31182317",
"runners": [{
"id": 42396416,
"status": "ACTIVE",
"sortPriority": 1
}, {
"id": 42052577,
"status": "ACTIVE",
"sortPriority": 2
}, {
"id": 38557499,
"status": "ACTIVE",
"sortPriority": 3
}, {
"id": 36648358,
"status": "ACTIVE",
"sortPriority": 4
}, {
"id": 39069859,
"status": "ACTIVE",
"sortPriority": 5
}, {
"id": 38757603,
"status": "ACTIVE",
"sortPriority": 6
}],
"version": 4295094959,
"betDelay": 0,
"complete": true,
"openDate": "2022-01-18T11:06:00.000Z",
"timezone": "Europe/London",
"bspMarket": true,
"marketTime": "2022-01-18T11:06:00.000Z",
"marketType": "WIN",
"regulators": ["MR_INT"],
"bettingType": "ODDS",
"countryCode": "GB",
"eventTypeId": "4339",
"suspendTime": "2022-01-18T11:06:00.000Z",
"bspReconciled": false,
"crossMatching": false,
"marketBaseRate": 5,
"discountAllowed": false,
"numberOfWinners": 1,
"runnersVoidable": false,
"turnInPlayEnabled": false,
"persistenceEnabled": false,
"numberOfActiveRunners": 6,
"priceLadderDefinition": {
"type": "CLASSIC"
}
}
}],
"op": "mcm",
"pt": 1642451002939
}
when using files = bfd.Files(paths)
where paths
is a list of .tar file paths. There are only a few market objects per event ID returned and most are missing.
I also tried using
with open(path, "rb") as file:
ff = bfd.File(path, file.read())
for market in ff:
market....
The above also return the same thing, I think it is something to do with the bc2 price file given by Betfair has '\n' and is not widely recognize as valid JSON, hence causing the reading to miss most lines of data?
Could someone provide an example to get around the GIL? I run into "TypeError: cannot pickle 'builtins.File' object" with both joblib and pathos.
in the betfair BASIC files if some horse has not been traded at some timepoint, it doesn't appear in the json line for that timepoint
whereas in your code IIUC it is rolled forward to every timepoint. It 's actually correct as it is still the last traded price(!) but we lose the information when it was actually last traded at that price
e.g market file "1.212040273.bz2"
horse id 41133899 the last traded price 2.94 is rolled over in the next update
with betfair_data
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:20:21.962000', 278.038, False, 'ACTIVE', '2. Crosshaven', 41133899, 3.05, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:21:21.961000', 218.039, False, 'ACTIVE', '2. Crosshaven', 41133899, 2.94, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:22:22.001000', 157.999, False, 'ACTIVE', '2. Crosshaven', 41133899, 2.94, None)
('1.212040273', 'To Be Placed', 'PLACE', '2023-04-01 01:25:00', '2023-04-01 01:23:21.968000', 98.032, False, 'ACTIVE', '2. Crosshaven', 41133899, 3.1, None)
with my R parser (much slower than yours!)
rc.ltp rc.id pt mkt.id mc.version
1: 2.92 41133899 2023-04-01 01:15:21 1.212040273 NA
2: 2.94 41133899 2023-04-01 01:17:21 1.212040273 NA
3: 2.92 41133899 2023-04-01 01:18:22 1.212040273 5147358276
4: 3.05 41133899 2023-04-01 01:20:21 1.212040273 NA
5: 2.94 41133899 2023-04-01 01:21:21 1.212040273 NA
6: 3.10 41133899 2023-04-01 01:23:21 1.212040273 NA
7: 3.20 41133899 2023-04-01 01:24:21 1.212040273 NA
8: 3.05 41133899 2023-04-01 01:25:21 1.212040273 5147363556
you can see there is no price update at 01:22:21 as there has been no trade on this horse from 01:21:21 to 01:22:21
I can't find the logic in your code where you roll forward the last_traded_price but it would be great if you could fix that or provide sthg like last ever traded price and the real last traded price as it appears in the files with the correct timestamp
So maybe one fix would be to have letp and ltp and ltp to be None when it wasnt there in the json and letp the value you roll forward?
Thank you for that cool package
PS: a wrong solution is to record only ltp changes but two equal consecutive values could be legit trades!
I am not sure if this will have any benefit in terms of speed on your implementation however in flumine we have some code which ignores/doesn't output data that doesn't meet user requirements, for example inplay
and seconds_to_start
.
This gets passed down to bflw and has a huge speed improvement to the python code so wondering if it would be the same for betfair_data? Here is the code I have added which does the same but after it has been processed by the Rust code:
def _read_loop(self) -> list:
# listener_kwargs filtering
in_play = self.listener_kwargs.get("inplay")
seconds_to_start = self.listener_kwargs.get("seconds_to_start")
cumulative_runner_tv = self.listener_kwargs.get("cumulative_runner_tv", False)
if in_play is None and seconds_to_start is None:
process_all = True
else:
process_all = False
# process files
files = betfair_data.bflw.Files(
[self.market_filter],
cumulative_runner_tv=cumulative_runner_tv,
streaming_unique_id=self.stream_id,
)
for file in files:
for update in file:
if process_all:
yield update
else:
for market_book in update:
if market_book.status == "OPEN":
if in_play:
if not market_book.inplay:
continue
elif seconds_to_start:
_seconds_to_start = (
market_book.market_definition.market_time
- market_book.publish_time
).total_seconds()
if _seconds_to_start > seconds_to_start:
continue
if in_play is False:
if market_book.inplay:
continue
yield [market_book]
Relatively simple quality of life update suggested, in the pre-requisites section of the README doco suggest adding:
Install latest version of Rust:
https://www.rust-lang.org/tools/install
Ensure it is added to the path variable
Open Powershell
Run this command:
$env:Path += ';C:\Users\<USER>\.cargo\bin'
Replace with your username.
Proceed to install.
Also does not seem to work with python 3.10.x, had to downgrade to 3.7.12 to operate
Is there any scope to refactor the API/ReadMe so that its use is more pythonic, for example currently it is tricky to read / understand what and where it is doing the processing:
for file in bfd.Files(paths).iter():
market_count += 1
for market in file:
update_count += 1
Small change to declare the files first before iterating:
files = bfd.Files(paths)
for file in files.iter():
market_count += 1
for market in file:
update_count += 1
This could be improved further by utilising python's builtins __iter__
removing the requirement to call iter
entirely.
files = bfd.Files(paths)
for file in files:
market_count += 1
for market in file:
update_count += 1
bflw processing can stay the same
files = bfd.Files(paths)
for file in files.bflw():
market_count += 1
for market in file:
update_count += 1
Is there any reason why the mutable flag isn't available at the Files level?
Although normally gzipped my raw files look like this:
resources/PRO-1.170258213
resources/1.170258213
resources/170258213
I see in handle_file it validates the extension. Not sure the 'Rust' way of doing things but can it auto detect file type or as a last resort read as per a text/json file?
I am trying to open original downloaded 2020 Tennis Advanced data and getting the error message
(JSON Parse Error) unknown field
batb, expected one of
id,
atb,
atl,
spn,
spf,
spb,
spl,
trd,
tv,
ltp,
hc at line 1 column 7
The very same market loads with no error using the original downloaded 2020 Tennis Basic data and getting the error message
I am OK, with not supporting (some) extra fields in the Advanced and Pro data, but those sets also have more updates for the market. So it would be beneficial to be able load those data even ignoring the not known fields instead of quitting the parsing process.
Would it be possible to add support for LZMA-compressed ZIP files? I did some benchmarking and found out that LZMA algorithm provides very similar compression ratios compared to BZ2, but decompression time is 3x faster. It is my algorithm of choice for compressing my own collected data and I would love to have this package support it.
Here are the results of some benchmarks. The numbers are: compression ratio, compression time, decompression time. Since the compression is done only once, the compression time is not relevant. However, the format is clearly superior to BZ2 which is what Betfair uses to provide data (probably because they don't bother to decompress it).
Collected data:
GZIP: 15.1 %, 5148 ms, 155 ms.
ZIP: 16.0 %, 1792 ms, 140 ms.
BZIP: 10.4 %, 4441 ms, 995 ms.
LZMA: 10.6 %, 15809 ms, 350 ms.
Betfair PRO data:
GZIP: 11.8 %, 2625 ms, 150 ms.
ZIP: 12.6 %, 884 ms, 121 ms.
BZIP: 7.5 %, 4676 ms, 841 ms.
LZMA: 8.3 %, 12055 ms, 266 ms.
Again apologies if this is just a knowledge thing, but can we create object directly by passing in values?
i.e. passing a native betfairlightweight objects raw dict data
new_market_book = bflw.MarketBook(**market_book._data)
gives TypeError: No constructor defined
kind of raises the question should we be able to create objects or is there no need with the native bflw library avilable.
I see you're using various unstable features but the code won't compile with the latest nightly version of rust
I recorded a bunch of markets with flumine. About half of them have the first clk as null. bfd can process the file just fine after changing null to "null" or any other string:
{"op":"mcm","clk":null,"pt":1720183153764,"mc":[{"id":"1.230347142","marketDefinition":{"bspMarket":false,"turnInPlayEnabled":true,"persistenceEnabled":true,"marketBaseRate":5,"eventId":"33388788","eventTypeId":"1","numberOfWinners":1,"bettingType":"ODDS","marketType":"OVER_UNDER_35","marketTime":"2024-07-05T15:30:00.000Z","suspendTime":"2024-07-05T15:30:00.000Z","bspReconciled":false,"complete":true,"inPlay":false,"crossMatching":true,"runnersVoidable":false,"numberOfActiveRunners":2,"betDelay":0,"status":"OPEN","runners":[{"status":"ACTIVE","sortPriority":1,"id":1222344},{"status":"ACTIVE","sortPriority":2,"id":1222345}],"regulators":["MR_INT"],"countryCode":"LT","discountAllowed":true,"timezone":"GMT","openDate":"2024-07-05T15:30:00.000Z","version":6004374642 ...
So apologies if this is incorrect as I dont have any clue about rust with python but the following things I have spotted in the bflw file:
Again apologies if none of this makes sense to how rust works with python. below ive made some changes and added some comments. Would be good to see if anything makes sense in changing:
from datetime import datetime
from typing import Iterator, Optional, Sequence, List, Dict, Any
import betfair_data
class BflwAdapter(Iterator[betfair_data.File]): ...
class MarketDefinitionRunner:
adjustment_factor: Optional[float] = None
bsp: Optional[float] = None
handicap: float
name: Optional[str] = None
removal_date: Optional[datetime] = None
selection_id: int
sort_priority: int
status: str
class MarketDefinition:
bet_delay: int
betting_type: str
bsp_market: bool
bsp_reconciled: bool
complete: bool
country_code: str
cross_matching: bool
discount_allowed: bool
each_way_divisor: Optional[str] = None #TODO added missing
event_id: str
event_name: Optional[str] = None # changed from eventName
event_type_id: str
in_play: bool
key_line_definitions: Optional[Dict[str, Any]] = None #TODO added create objs? seems to be different from market_book version
line_interval: Optional[float] = None #TODO lineMinUnit: Optional[float] = None
line_max_unit: Optional[float] = None #TODO lineMaxUnit: Optional[float] = None
line_min_unit: Optional[float] = None #TODO lineInterval: Optional[float] = None
market_base_rate: float
market_time: datetime
market_type: str
name: Optional[str] = None
number_of_active_runners: int
number_of_winners: int
open_date: datetime
persistence_enabled: bool
price_ladder_definition: Dict[str, Any] #TODO create an obj? priceLadderDefinition: Dict[str, str]
race_type: str # raceType: Optional[str] = None
regulators: List[str]
runners: List[MarketDefinitionRunner]
runners_voidable: bool
settled_time: Optional[datetime] = None
status: str
suspend_time: Optional[datetime] = None
timezone: str
turn_in_play_enabled: bool
venue: Optional[str]
version: int
class RunnerBook:
adjustment_factor: float
ex: betfair_data.RunnerBookEX
handicap: float
last_price_traded: Optional[float] = None
removal_date: Optional[datetime] = None
selection_id: int
sp: betfair_data.RunnerBookSP
status: str
total_matched: float
matches: List[Any] = []
orders: List[Any] = []
class MarketBook:
bet_delay: int
bsp_reconciled: bool
complete: bool
cross_matching: bool
inplay: bool
is_market_data_delayed: bool
last_match_time: datetime
market_definition: MarketDefinition
market_id: str
number_of_active_runners: int
number_of_runners: int
number_of_winners: int
publish_time: datetime
publish_time_epoch: int
runners: List[RunnerBook]
runners_voidable: bool
status: str
total_available: float
total_matched: float
version: int
streaming_snap: bool #TODO added for compatability
streaming_unique_id: int #TODO added for compatability
streaming_update: Dict[str, Any] #TODO added the last delta message
_data: Dict[str, Any] #TODO added dict version of market_book object
class File(Iterator[Sequence[MarketBook]]):
def __init__(self, path: str, bytes: bytes, cumulative_runner_tv: bool = True) -> None: ...
file_name: str
IDE's should automatically detect the types and provide checking and auto complete. See the pyi stub file for a comprehensive view of the types and method available.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.