azure / azurepublicdataset Goto Github PK
View Code? Open in Web Editor NEWMicrosoft Azure Traces
License: Creative Commons Attribution 4.0 International
Microsoft Azure Traces
License: Creative Commons Attribution 4.0 International
I find inconsistency between numbers reported on the paper and an actual dataset. Can someone please clarify which source is true?
Section 6.2 of the paper says
[...] we simulate 336k VM arrivals to a cluster of 880 servers (each with 16 cores and 112 GBytes of RAM) over a period of 1 month.
I assume this is describing vmtable.csv which actually has VM arrivals for 30 days. However, the number of those arrivals are over 2 million (2,013,767 to be precise), which is equal to the number of lines in the csv file. I confirmed every line to correspond to the new VM arrival (no duplicates). So, does the simulated 336k
arrivals refer to the subset of the vmtable.csv? If yes, can it be a random subset, top 336k lines, or something else? Or, perhaps, the paper has a typo where authors actually intended to write over 2 million
instead of 336k?
Any hint is appreciated.
Hello,
I am using V1 dataset for my research where I assume VM create times in vmtable.csv
correspond to the VM request arrival time. Is this a valid assumption? Note that VM create
time refers to the time when cloud scheduler created the VM while request arrival
refer to the time when customer requested to create a VM.
For example, for the case below, can I assume that vm1
create request arrived before vm2
?
vm1,0,900
vm2,300,900
My example simplifies the data format vmtable.csv
as vm_uuid, create_time, delete_time
.
Thanks!
The paper says that the dataset contains events sampled with 5 mins frequency (=300 seconds). This is largely true except 27 events in vmtable.csv. In other words, there are 27 VMs created (and deleted) with non 300 multiplier
timestamp. Can someone please clarify the reason?
More precisely, these are the invalid timestamps for VM create events [2556100, 2002300, 2001400, 2559700, 2393800, 2312200, 2580700, 2001100, 2458900, 1728700, 2271700, 2278900]
, and these are invalid timestamps for the VM delete events [2557600, 2005900, 2004400, 2569000, 2395300, 2313100, 2581300, 2079100, 2591500]
. Interestingly, if VM has an invalid create timestamp then its delete timestamp is always invalid as well.
We can easily deduct 100 seconds from these values to make them valid (multiplier of 300) but I wanted to know the source of the invalid data. Here are those 27 events.
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2556100,
3nr+75Hj/9Qur3kI/Tk67pNWXhHhe85FapjDT14TntrfmaNBeei2H/x4kvrxpgVX,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,19.922371,2.3844965,19.922371,Unkown,8,14
tUkxOYVz91Gx/NnyEvQ0tjJvxUhV2q1nNRj7Mm2ezCGtsVF5+42oIvFhfHh5rmpU,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,26.491372,2.4525780000000004,26.491372,Unkown,8,14
1AV898gzrFKOe9rDfJ3r+TtJWwLoVRXU0Qk9r1icoRWbodB1ZYxQss3Yla2dXHTn,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,28.266965,2.7063373333333338,28.266965,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2002300,
M1iKrAiDh2evuaTYRekxkMdoM2gA9BrR2ekD42Qex+5Nohnod0LIiawU4S/lO/vl,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHK31tjSgjtgmaIzQmdGvoeQ==,2002300,2005900,99.457229,12.393025769230768,99.457229,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2001400,
7QEvDNEskD7Vn5FGaQLUcBs534PoyiEx9G22r6jBPsU+h4kStWJeSbGjHm5RWrOH,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHK31tjSgjtgmaIzQmdGvoeQ==,2001400,2004400,99.510246,16.815538727272727,99.510246,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2559700,
Ddy+Wa0CDxcR4cKTBcvl/fyjWIotkXehlCCYhn2vj6Q/HGTSI4q22/EjZmMfdxuq,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,98.854519,7.482932968750001,92.35011,Unkown,8,14
TOut+jfY7mEfJ35fLcvjI8om1g/3pqlv770GpLH2VVDFEGZG5DgbIlBE04HUf2SP,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,100,6.3068014375000017,98.612031,Unkown,8,14
lZq5YmLZyokv7yNodFM9grK9TRRZLFDwiZ99IPVaP+cqoQfIuQ+zWYJsJkah6Imh,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,100,6.9470711250000008,88.296695,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2393800,
lAYtdT0/LvAPuUyYRhYzZGqpQYT4s5EYC/Bkx7AUwAu2VY8OWdL/vO60NlQ/lSSx,1pvP5oaK47WSSY0IZRNEQYdTLEx79rf7Gj1isBYW1jDOFGZXLQGTa0V3XnCrLrkB,qNRw2mFobfiF+ZJBeErVPoj/szmLuA7ziVS/8ASNZQ6G2guZy3zELM2we8WN8xTNNBxcdsvUlyOJ6TCnvmhFgQ==,2393800,2395300,99.079248,17.592350833333331,99.079248,Delay-insensitive,1,1.75
c1so8uLO2C9QrADgNAjJbCGNJ96MWEh0aCHA6SbQtT7kBRkujxg2x0Ej2nX6ksn4,1pvP5oaK47WSSY0IZRNEQYdTLEx79rf7Gj1isBYW1jDOFGZXLQGTa0V3XnCrLrkB,qNRw2mFobfiF+ZJBeErVPoj/szmLuA7ziVS/8ASNZQ6G2guZy3zELM2we8WN8xTNNBxcdsvUlyOJ6TCnvmhFgQ==,2393800,2395300,93.921621,18.118107833333337,93.921621,Delay-insensitive,1,1.75
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2312200,
v9tBy+e7vsk/mV2vRF8PSR8R1LwPP5cYikYV+R3xS2uRY978lc0VW+KTufAzk6ZY,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNF0GTCz+oi8tSzC0i7IouI7w==,2312200,2313100,39.470358,3.3868549999999997,39.470358,Delay-insensitive,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2580700,
VsWB5mzIbZg41Bdmdw6ed1fTYk1kEWOj6orGi/EuiuDfgG7+J5xs1/DAApuYI7an,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNFlZ2VqzXxHaAcy4VJDShmDA==,2580700,2581300,57.027536,4.7168486666666665,57.027536,Unkown,8,14
XZ+R0ri1WVYWObik2WBIKIFtkftcU/y02IpuSLo9pN1ajUWxFJc+1/tO9nwJ3VUZ,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNFlZ2VqzXxHaAcy4VJDShmDA==,2580700,2581300,54.404363,4.5038389999999993,54.404363,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2001100,
WLo6GCCKYLyxffXWnimWMWOaAWZLbzZX72Ol75CRDW3a8cYav8lert2C7qWlmN9C,310T97LVpEV/JITQpgLC/Iy5XIcUsb89uk8mM/ynF5Wz1i/j54b3k07at8e5yQLr,X4irCXVGXfmnu2Lyjcb4uFIQg3TbnTpT2jhr+tEKgNHAQjn5mFagDSY5Fmz3Uwgu7qFzlwcbawbhPLI81QAc1A==,2001100,2079100,98.984061,4.1162762605364,35.435221,Unkown,8,14
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2458900,
41lJGIyBWIcBTHtKccpv7vhEYzCLPbPOjo5Q+pPAELWS+V6YbzsfMDj6n7OSU709,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,74.02367,1.8835816726862278,2.96702,Delay-insensitive,2,3.5
FQJ1hlb5frqyiPZxE7vEo+amAH+htFAuTOKIkPjem77NkYzJPz3rpVvDaZeTCSvF,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,94.590475,8.6293499796839654,37.668553,Unkown,2,3.5
h2qtjy3cMGAdRzMzplCAZD2NmO2kWI1Btw7y4pLtzopN0NwmqCr71uf3Bkf4CUdH,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,64.899,2.1970899525959355,17.090546,Unkown,4,7
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,1728700,
R3cWZfuI9wgg64U6ZaTsw1IRpJAUNKswXJpoyKh/4FYqD5r48vc1vONTl6kICRke,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.38574,7.3583505804657712,22.211297,Interactive,1,1.75
SIsQxQsvBDFSEv82uxJIl6rHqtf1/R3m6nxrm8xcLtC7gRpwRQijQmL3EE/ORnfS,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.47211,7.0144386666666758,19.408507,Delay-insensitive,1,1.75
X5XiSNqAuYjT3/K7+XPu9HSWUy3pxcdBgpAuyOBwroAGZyyyHz/93SSEnIJX9iUz,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.707285,11.792145633993762,99.204872,Delay-insensitive,1,1.75
L0M1iqD8fhoVHpphFopF7vwlsDX4/6haQghHOjdkkb68EHv93scLJS4SgVON8ZwF,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.837916,7.0444424323948631,18.711566,Delay-insensitive,1,1.75
HuopWsgElJLanqf18dUnzfw0Fr95zbhXzMqqmDJPuUZp9dL5iN1vqZwxabyuchg8,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.984665,6.8779116586722395,17.722304,Interactive,1,1.75
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2271700,
jJNaG2gpVatneFAlqADWMxrFvkKB1W4dXeajmgeQD7zFbvVBcac2JsMcAXG4Zpxj,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.342106,9.8855666054357982,30.959831,Unkown,1,1.75
r3N3BJhx77cJy7UBy/2NnddJC8jQZKJkbtWlzdzxiX6D3UzIOzgIWNQStTzvwSF7,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.819369,10.771859931583904,34.068619,Unkown,1,1.75
rXDVMkBbNoykhuUUUmLQ2G0XJ118I471RejAa4tV1e5uw52+qpfcH+9r30ZXtPWl,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.420505,9.327924819119028,23.877358,Unkown,1,1.75
2Vg9i0duJ+U3rPg+4NgEFE/cULAgBw2HgqqwOVujcbYLo9Hvmw8Lzkz8k/ORI/wr,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.927129,10.054505607310217,21.358711,Unkown,1,1.75
$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2278900,
ClTwuEeII4pj80VpxbsV+Cf7Y95pcRJue0ihQSjcWR9LLqbaWgftjZKjonBVrGaX,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2278900,2591500,99.332977,2.52133971716203,13.029393,Unkown,4,7
The CSV files invocations_per_function*
provide the function trigger as HTTP
, Queue
or othes.
Are these functions are triggered by another function? like, function A triggers function B over the HTTP protocol. If so, could this trace provide such a function call chain?
Thanks
One frustrating thing about the trace is that the start weekday and time are unknown. E.g., does time 0 start at 00:00 AM on a Sunday or Monday?
I cannot access the website, such as the link:
https://azurecloudpublicdataset.blob.core.windows.net/azurepublicdataset/trace_data/vm_cpu_readings/vm_cpu_readings-file-125-of-125.csv.gz
Will I download the dataset with some tools?
Hello everyone,
I've found 5 negative values of Average
field in function_durations_percentiles.anon.d01.csv
For example, in line 35816, the Average
is -41692
As the Average
field indicates average execution time (ms) across all invocations of the 24-period, how can this value be negative?
Thank you very much in advance
Is there any publicly available dataset or trace containing client requests for low priority VMs (AKA spot instances)? If not, are there any plans to release such a dataset in the near future?
Hello, I would like to ask if there is a unified unit of measurement for CPU values in AzurePublicDatasetV2 with zero and ninety? I look forward to hearing from you! Thank you for the data you provided!
Hi. I noticed that the Azure Functions 2019 trace does not work anymore.
Thank you for releasing the trace.
I have conducted a simple analysis of the trace, and I found the duration
in the trace fluctuated considerably.
Does the duration
in Azure Functions Invocation Trace 2021 include the cold start time?
Or maybe it just has to do with the user invoking the function parameters or the network?
Thank you.
Hi!
In the Azure Trace for Packing 2020
, I find that at a random time (e.g., time == 0), the sum of CPU resource allocation (i.e., core
field) of the VMs alive in a machine (e.g., machineId == 0
) exceeds 100%. This really confuses me.
Therefore, I want to know whether the machineId
field in the vmType
table refers to a single physical machine or not.
Any reply will be appreciated.
I can't fetch https://raw.githubusercontent.com/Azure/AzurePublicDataset/master/data/AzureFunctionsInvocationTraceForTwoWeeksJan2021.rar
How can i download this dataset
Thank you for your work, but I can't get in from the links you provided for both V1 and V2.
Hello, I saw some dataset introduced in the notebook but cannot find them in the dataset downloaded. Should we process and generate by ourselves or can you share the generating scripts for them?
For example, I cannot find what invocations_per_hour_atc.tsv
in this repo, which you have mentioned in the notebook.
Hi,
In the dataset Azure Trace for Packing 2020, I find that the sheet "vm" has a column "vmTypeId", but in the sheet "vmType" there are many vm types with the same "vmTypeId" but different "machineId". So I'm confused how to indentify the type of a vm in the sheet "vm", because it does not the machineId information.
Thanks a lot.
Lots of vm_id string hashes present in vm_cpu_readings-files are not present in the vmtables.csv.
For instance the vm_id string hash: +oGQgnS1ILCbrFxxHUaKS64nkPOE4yEn0wZp1kBeSTyht+jQJqQchu963rzw9hZz.
Is there any other vmtable.csv file that has not been published?
Thanks in advance.
Howdy. I have been attempting to access some of the datasets available from the "AzurePublicDatasetV1Links.txt" file with no luck. I have attached the screenshot of the page I am directed to each time I attempt to click on any of the links. I have tried on multiple browsers and devices with no luck, copying the link or clicking on it via my IDE as well. Any response is greatly appreciated.
Hi,
Can you please include description of VM core and memory buckets to AzurePublicDatasetV2 dataset? It is just about including these two URLs in AzurePublicDatasetLinksV2.txt
I am aware that the exact number of VM cores are not given, as discussed in issue #5, and VMs are put in one of six buckets based on their cores or memory. However, it seems that description of these buckets are "missing", even though they were meant to be released.
I say "missing" (in quotes) because even though description file is not included in AzurePublicDatasetLinksV2.txt they are available for downloading on Azure Blob Storage. More precisely, schema.csv mentions that description of the CPU buckets are available at vm_virtual_core_bucket_definition.csv
, which has two fields: bucket and definition. I blindly constructed a path for this file by appending the file name vm_virtual_core_bucket_definition.csv
to the parent path and I was able to download through the constructed path vm_virtual_core_bucket_definition.csv.
The vm_virtual_core_bucket_definition.csv
file has description of six buckets. These descriptions match the bucket labels in "VM Cores Distribution" plot in jupyter notebook, which is referenced in the main readme. This matching confirms that the file available through Azure Blob Storage is the correct one.
The same applies to description of memory bucket: schema.csv
mentions vm_memory_bucket_definition.csv
, it is not included in AzurePublicDatasetLinksV2.txt
but is available for download in Azure Blob Storage, here vm_memory_bucket_definition.csv.
So, it would be great to update AzurePublicDatasetLinksV2.txt
file to include URL for both files (to avoid future guesswork by others):
Let me know if you accept pull requests. I'd be happy to include these two URLs in AzurePublicDatasetLinksV2.txt
by myself and perhaps add a short description of buckets to the main readme.
Also, is it accurate to say that
>24
and <=30
, and>64
and <=70
?I noticed these lines in jupyter notebook, that suggest these ranges to be correct:
#Transform vmcorecount '>24' bucket to 30 and '>64' to 70
max_value_vmcorecountbucket = 30
max_value_vmmemorybucket = 70
trace_dataframe = trace_dataframe.replace({'vmcorecountbucket':'>24'},max_value_vmcorecountbucket)
trace_dataframe = trace_dataframe.replace({'vmmemorybucket':'>64'},max_value_vmmemorybucket)
Or is this transformation just a cosmetic improvement to have the jupyter table datatype as int
? Having more precise bucket bounds would be helpful.
Finally, is there an external document that describes AzurePublicDatasetV2, like SOSP 2017 paper that describes AzurePublicDatasetV1? It would be useful to reference it in the readme, if any.
Thanks in advance for clarifications!
We're currently using the v1 dataset as part of our research, and we would like to use the v2 as well since it has more - and up to date - data. However, I'm not sure if we're misunderstanding the dataset, but we cannot find a way to identify the core count of the VMs, we only see the bucket they belong to.
We need this information to do our research, so could you help me in figuring out how to get this information from the dataset? Or is it the case that it isn't meant to be available? If so, do you plan on releasing this information on a later date?
I'm a bit puzzled by what looks like an inconsistency in the vmType table. I understand resources are given as ratio of total machine capacity, instead of the absolute number, and unless I've missed something, there is no way of obtaining the capacities of any given machineId.
The problem is that some (in fact, many) resources seem to be inconsistent with each other. For example, if I look at the 'core' resources of rows 1,2,111,112,248,249, I have the following table:
'core' |
'vmTypeId': 0 |
'vmTypeId': 1 |
---|---|---|
'machineId': 0 |
0.020833333333333332 | 0.020833333333333332 |
'machineId': 1 |
0.010416666666666666 | 0.010416666666666666 |
'machineId': 2 |
0.004583333333333333 | 0.0175 |
I would expect that the ratio between the 'core' resources for two vm types be the same for any machine type, since presumably, only the denominator of those numbers changes (corresponding to the total machine capacity). But here, the machine with 'machineId' = 2 breaks this invariant.
Does this mean that the resources for a vm type are actually not constant, and depend on the machine type? Or am I missing something?
Hello,
we(@HongyuHe) at eth-easl found some discrepancy in the AzureFunctionsDataset2019 trace.
Looking at each of the 14-day traces, we have found many duplicate apps and functions, some missing duration or memory stats.
day | app_memory_percentiles.anon | function_durations_percentiles.anon | invocations_per_function_md.anon |
---|---|---|---|
d01 | - 10 dups (20 rows) | - 16 dups (32 rows) - 377 apps missing memory stats |
- 622 functions missing duration stats - 422 apps missing memory stats |
d02 | - 13 dups (20 rows) | - 18 dups (31 rows) - 380 apps missing memory stats |
- 603 functions missing duration stats - 425 apps missing memory stats |
d03 | - 8 dups (16 rows) | - 11 dups (22 rows) - 386 apps missing memory stats |
- 633 functions missing duration stats - 429 apps missing memory stats |
d04 | - 415 apps missing memory stats | - 623 functions missing duration stats - 465 apps missing memory stats |
|
d05 | - 2 dups (4 rows) | - 4 dups (8 rows) - 397 apps missing memory stats |
- 615 functions missing duration stats - 440 apps missing memory stats |
d06 | - 1 dup (2 rows) - 705 apps missing memory stats |
- 563 functions missing duration stats - 750 apps missing memory stats |
|
d07 | - 332 apps missing memory stats | - 532 functions missing duration stats - 379 apps missing memory stats |
|
d08 | - 412 apps missing memory stats | - 630 functions missing duration stats - 453 apps missing memory stats |
|
d09 | - 1 dup (2 rows) | - 7 dups (14 rows) - 398 apps missing memory stats |
- 640 functions missing duration stats - 439 apps missing memory stats |
d10 | - 3 dups (6 rows) | - 4 dups (8 rows) - 394 apps missing memory stats |
- 633 functions missing duration stats - 444 apps missing memory stats |
d11 | - 2 dups (4 rows) | - 2 dups (4 rows) - 388 apps missing memory stats |
- 652 functions missing duration stats - 436 apps missing memory stats |
d12 | - 388 apps missing memory stats | - 631 functions missing duration stats - 440 apps missing memory stats |
|
d13 | Trace file missing | - 1 dup (2 rows) | - 576 functions missing duration stats |
d14 | Trace file missing | - 524 functions missing duration stats |
These discrepancies make it hard for us to accurately analyze the trace.
Is it reasonable to treat the duplicates as separate entities, or should we merge them?
Would discarding traces with missing data be the only way to clean up the traces?
We would appreciate it if you could provide a way to clean up these issues. Thanks.
Hello everyone,
I am using the dataset for research purposes, and I have some questions related to the workload. In vmtable.csv
some VMs are labeled as "interactive", other as "delay-insensitive", and most of them as "unknown." I would like to know how this classification has been performed, and what do they mean. E.g., is it safe to think that in the "interactive" workload include web-services?
Related to that, what does a deployment represent? Is this an application? Does it follow the definition of deployments for container strategies?
Thank you very much in advance.
Hello, is there a relationship between the vm ids in the vmtable in the data set 2019v2 and the vm ids in the vm_cpu_readings/vm_cpu_readings-file-*-of-195.csv.gz table? With the same subscription id can you find the vm id in the vm_cpu_readings/vm_cpu_readings-file-*-of-195.csv.gz file?
Looking forward to your reply, thank you!
Hi,
Is there any way we could get early access to the Azure Function Traces for research? BTW, I am a PhD student at the University of Alberta.
Best,
Nima
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
Hello,
Thank you for sharing the data however, the link for downloading it is not working.
Update:
However, the individual file link works: https://azurecloudpublicdataset2.blob.core.windows.net/azurepublicdatasetv2/azurefunctions_dataset2019/invocations_per_function_md.anon.d01.csv
This is somehow relevant to issue #6 but this is about VMs that have the same create timestamp (instead of different timestamps in issue #6).
The paper says that the VM data is captured every 5 mins, i.e., VM creates happen with 300 seconds interval. Is it fair to use the line numbers in vmtable.csv
to signify the order VMs created? I.e., lower line number (appears earlier in the .csv
file) means VM was created before the later VMs.
For example, in this simplifies the data format from vmtable.csv
as vm_uuid, create_time, delete_time
, can I assume vm2
was created after vm1
?
vm1,0,900
vm2,0,600
vm3,300,900
vm4,0,900
Note that this is a question about V1 dataset, although could be applicable to V2 as well.
Thanks!
We provide the trace as is, but are willing to help researchers understand and use it. So, please let us know of any issues or questions by sending email to our <>.
Hi,
I was wondering if there is further information about the provided VMs CPU data for each vmid
:
Virtual machines (VMs) in Azure come in predefined sizes that are called families or series.
An individual VM is often referred to as an instance. ref.
Fig. 1: Azure VM sizes example show case picture credit |
Knowing VM size, I want to compute CPU utilization percentage based on the CPU provisioned capacity from 0-100%
Thanks in advance.
In the AzureTracesForPacking2020 dataset, there are 3414 rows with a starttime of -7157. That would mean these VMs have existed for almost 20 years before the trace logging began. This may just be a logging error/artifact but I wanted to confirm.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.