azure / azurepublicdataset Goto Github PK

View Code? Open in Web Editor NEW

730.0 37.0 139.0 20.09 MB

Microsoft Azure Traces

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 100.00%

azurepublicdataset's People

Contributors

Stargazers

Watchers

Forkers

maniaabdi wtmsvn mascor1331 sguazt jeongseob dopingdeveloper knodir songyuanli-test lavenderwords andylamp sanjeeku yangrenyu sysu-ndc-lab vkthevallavan lyxiaowangzi mina1987 guanxyz polezhaev-ds moule3053 abdallahcoptan dgu-ai-lab doandongnguyen zhaolianghe adhocmaster rafaelvfalc xingzyu henglicad vishcocity apoorvemohan seekingdream karimaphd cindy199712631 gmsaravanan onlyone0001 cetinmehmet https-github-com-vfulitod176641 zhouxiangithub ik2sb samilalgul solameow rajitha1998 xietonglei dengwxn zainryan smiqbal samogden gandor26 sparksfly8 r21gh remit isabella232 zyqcsl roderickli tcyuan373 xavierwong amirkmzh wangaoone dimanzt ayushmit mingchen-github ariel1995zhao sjtu-serverless samanta-amit jyang-ai 381306110 qcoop918 anjali05 sjtugavinliu qwqyyy shivupoojar easylyz ljp580230 meng72 qi0523 lilelr tuocao wilixx hezhefly zacharyjia mustafadaraghmeh uniplore-chenz paullu-ualberta xpzssa hrzzz haroldship qiangsu97 themonocledhamster yuxiaoba wym-king twfldh pkuflyingpig qyzju614 yj243 zzyskywalker devil-yb alejandrofdez-us abanerjee84 ehorgan0 neelabalan lukaschenchen

azurepublicdataset's Issues

Clarification on simulated 336k VM arrivals

I find inconsistency between numbers reported on the paper and an actual dataset. Can someone please clarify which source is true?

Section 6.2 of the paper says

[...] we simulate 336k VM arrivals to a cluster of 880 servers (each with 16 cores and 112 GBytes of RAM) over a period of 1 month.

I assume this is describing vmtable.csv which actually has VM arrivals for 30 days. However, the number of those arrivals are over 2 million (2,013,767 to be precise), which is equal to the number of lines in the csv file. I confirmed every line to correspond to the new VM arrival (no duplicates). So, does the simulated 336k arrivals refer to the subset of the vmtable.csv? If yes, can it be a random subset, top 336k lines, or something else? Or, perhaps, the paper has a typo where authors actually intended to write over 2 million instead of 336k?

Any hint is appreciated.

Does VM create timestamp correspond to request arrival time?

Hello,

I am using V1 dataset for my research where I assume VM create times in vmtable.csv correspond to the VM request arrival time. Is this a valid assumption? Note that VM create time refers to the time when cloud scheduler created the VM while request arrival refer to the time when customer requested to create a VM.

For example, for the case below, can I assume that vm1 create request arrived before vm2?

vm1,0,900
vm2,300,900

My example simplifies the data format vmtable.csv as vm_uuid, create_time, delete_time.

Thanks!

VM creates with invalid timestamps (w.r.t. 5 min data sampling)

The paper says that the dataset contains events sampled with 5 mins frequency (=300 seconds). This is largely true except 27 events in vmtable.csv. In other words, there are 27 VMs created (and deleted) with non 300 multiplier timestamp. Can someone please clarify the reason?

More precisely, these are the invalid timestamps for VM create events [2556100, 2002300, 2001400, 2559700, 2393800, 2312200, 2580700, 2001100, 2458900, 1728700, 2271700, 2278900], and these are invalid timestamps for the VM delete events [2557600, 2005900, 2004400, 2569000, 2395300, 2313100, 2581300, 2079100, 2591500]. Interestingly, if VM has an invalid create timestamp then its delete timestamp is always invalid as well.

We can easily deduct 100 seconds from these values to make them valid (multiplier of 300) but I wanted to know the source of the invalid data. Here are those 27 events.

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2556100,
3nr+75Hj/9Qur3kI/Tk67pNWXhHhe85FapjDT14TntrfmaNBeei2H/x4kvrxpgVX,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,19.922371,2.3844965,19.922371,Unkown,8,14
tUkxOYVz91Gx/NnyEvQ0tjJvxUhV2q1nNRj7Mm2ezCGtsVF5+42oIvFhfHh5rmpU,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,26.491372,2.4525780000000004,26.491372,Unkown,8,14
1AV898gzrFKOe9rDfJ3r+TtJWwLoVRXU0Qk9r1icoRWbodB1ZYxQss3Yla2dXHTn,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHR7CPmNWMzuQFBJr2apk15A==,2556100,2557600,28.266965,2.7063373333333338,28.266965,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2002300,
M1iKrAiDh2evuaTYRekxkMdoM2gA9BrR2ekD42Qex+5Nohnod0LIiawU4S/lO/vl,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHK31tjSgjtgmaIzQmdGvoeQ==,2002300,2005900,99.457229,12.393025769230768,99.457229,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2001400,
7QEvDNEskD7Vn5FGaQLUcBs534PoyiEx9G22r6jBPsU+h4kStWJeSbGjHm5RWrOH,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQHK31tjSgjtgmaIzQmdGvoeQ==,2001400,2004400,99.510246,16.815538727272727,99.510246,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2559700,
Ddy+Wa0CDxcR4cKTBcvl/fyjWIotkXehlCCYhn2vj6Q/HGTSI4q22/EjZmMfdxuq,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,98.854519,7.482932968750001,92.35011,Unkown,8,14
TOut+jfY7mEfJ35fLcvjI8om1g/3pqlv770GpLH2VVDFEGZG5DgbIlBE04HUf2SP,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,100,6.3068014375000017,98.612031,Unkown,8,14
lZq5YmLZyokv7yNodFM9grK9TRRZLFDwiZ99IPVaP+cqoQfIuQ+zWYJsJkah6Imh,+9OPyI+/Eeu5PSXVMdkPw3cB99+uk+YiAwMRGJU1cDm2ESAgTaUXcM091m1HeTX7,TSRTTdb9LRjgp+FpJYUBXBczOvLJLO5ksIDZm6OFgtN4SaFFac8ZhReSK3rVFgQH37Pa+CyFTaBXVr8DLbIYcg==,2559700,2569000,100,6.9470711250000008,88.296695,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2393800,
lAYtdT0/LvAPuUyYRhYzZGqpQYT4s5EYC/Bkx7AUwAu2VY8OWdL/vO60NlQ/lSSx,1pvP5oaK47WSSY0IZRNEQYdTLEx79rf7Gj1isBYW1jDOFGZXLQGTa0V3XnCrLrkB,qNRw2mFobfiF+ZJBeErVPoj/szmLuA7ziVS/8ASNZQ6G2guZy3zELM2we8WN8xTNNBxcdsvUlyOJ6TCnvmhFgQ==,2393800,2395300,99.079248,17.592350833333331,99.079248,Delay-insensitive,1,1.75
c1so8uLO2C9QrADgNAjJbCGNJ96MWEh0aCHA6SbQtT7kBRkujxg2x0Ej2nX6ksn4,1pvP5oaK47WSSY0IZRNEQYdTLEx79rf7Gj1isBYW1jDOFGZXLQGTa0V3XnCrLrkB,qNRw2mFobfiF+ZJBeErVPoj/szmLuA7ziVS/8ASNZQ6G2guZy3zELM2we8WN8xTNNBxcdsvUlyOJ6TCnvmhFgQ==,2393800,2395300,93.921621,18.118107833333337,93.921621,Delay-insensitive,1,1.75

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2312200,
v9tBy+e7vsk/mV2vRF8PSR8R1LwPP5cYikYV+R3xS2uRY978lc0VW+KTufAzk6ZY,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNF0GTCz+oi8tSzC0i7IouI7w==,2312200,2313100,39.470358,3.3868549999999997,39.470358,Delay-insensitive,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2580700,
VsWB5mzIbZg41Bdmdw6ed1fTYk1kEWOj6orGi/EuiuDfgG7+J5xs1/DAApuYI7an,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNFlZ2VqzXxHaAcy4VJDShmDA==,2580700,2581300,57.027536,4.7168486666666665,57.027536,Unkown,8,14
XZ+R0ri1WVYWObik2WBIKIFtkftcU/y02IpuSLo9pN1ajUWxFJc+1/tO9nwJ3VUZ,8aRytjOt2E+dixkPugZHbKFROou3eQLywft928DTtFP2o3QzFTIxYQ+8r0kdkzvo,ouGLV30FQgVyWsLxmQOgHGq6zLZ4SfGl4sHH8iGOkETohRDh3H3wasbE2+vc3DNFlZ2VqzXxHaAcy4VJDShmDA==,2580700,2581300,54.404363,4.5038389999999993,54.404363,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2001100,
WLo6GCCKYLyxffXWnimWMWOaAWZLbzZX72Ol75CRDW3a8cYav8lert2C7qWlmN9C,310T97LVpEV/JITQpgLC/Iy5XIcUsb89uk8mM/ynF5Wz1i/j54b3k07at8e5yQLr,X4irCXVGXfmnu2Lyjcb4uFIQg3TbnTpT2jhr+tEKgNHAQjn5mFagDSY5Fmz3Uwgu7qFzlwcbawbhPLI81QAc1A==,2001100,2079100,98.984061,4.1162762605364,35.435221,Unkown,8,14

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2458900,
41lJGIyBWIcBTHtKccpv7vhEYzCLPbPOjo5Q+pPAELWS+V6YbzsfMDj6n7OSU709,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,74.02367,1.8835816726862278,2.96702,Delay-insensitive,2,3.5
FQJ1hlb5frqyiPZxE7vEo+amAH+htFAuTOKIkPjem77NkYzJPz3rpVvDaZeTCSvF,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,94.590475,8.6293499796839654,37.668553,Unkown,2,3.5
h2qtjy3cMGAdRzMzplCAZD2NmO2kWI1Btw7y4pLtzopN0NwmqCr71uf3Bkf4CUdH,9dLnlIxhP85BoH/+owh9rU5DwTlEfaVSvXNO2prVDyrwsdZo+SJl+T0X+bq2Pzfv,/eFKgWTCoQpwO+iZwnNEtDfQgpnM3xxwu3XpjWi9AmUoAt6eFpleKgk/1jusHwNAzXBOTsRPdcV50dMIfZpg1w==,2458900,2591500,64.899,2.1970899525959355,17.090546,Unkown,4,7

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,1728700,
R3cWZfuI9wgg64U6ZaTsw1IRpJAUNKswXJpoyKh/4FYqD5r48vc1vONTl6kICRke,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.38574,7.3583505804657712,22.211297,Interactive,1,1.75
SIsQxQsvBDFSEv82uxJIl6rHqtf1/R3m6nxrm8xcLtC7gRpwRQijQmL3EE/ORnfS,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.47211,7.0144386666666758,19.408507,Delay-insensitive,1,1.75
X5XiSNqAuYjT3/K7+XPu9HSWUy3pxcdBgpAuyOBwroAGZyyyHz/93SSEnIJX9iUz,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.707285,11.792145633993762,99.204872,Delay-insensitive,1,1.75
L0M1iqD8fhoVHpphFopF7vwlsDX4/6haQghHOjdkkb68EHv93scLJS4SgVON8ZwF,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.837916,7.0444424323948631,18.711566,Delay-insensitive,1,1.75
HuopWsgElJLanqf18dUnzfw0Fr95zbhXzMqqmDJPuUZp9dL5iN1vqZwxabyuchg8,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/U5BPXNv2E+3nznEw62DKvw==,1728700,2591500,99.984665,6.8779116586722395,17.722304,Interactive,1,1.75

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2271700,
jJNaG2gpVatneFAlqADWMxrFvkKB1W4dXeajmgeQD7zFbvVBcac2JsMcAXG4Zpxj,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.342106,9.8855666054357982,30.959831,Unkown,1,1.75
r3N3BJhx77cJy7UBy/2NnddJC8jQZKJkbtWlzdzxiX6D3UzIOzgIWNQStTzvwSF7,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.819369,10.771859931583904,34.068619,Unkown,1,1.75
rXDVMkBbNoykhuUUUmLQ2G0XJ118I471RejAa4tV1e5uw52+qpfcH+9r30ZXtPWl,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.420505,9.327924819119028,23.877358,Unkown,1,1.75
2Vg9i0duJ+U3rPg+4NgEFE/cULAgBw2HgqqwOVujcbYLo9Hvmw8Lzkz8k/ORI/wr,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2271700,2591500,99.927129,10.054505607310217,21.358711,Unkown,1,1.75

$ cat ./AzurePublicDataset/data/vmtable.csv | grep ,2278900,
ClTwuEeII4pj80VpxbsV+Cf7Y95pcRJue0ihQSjcWR9LLqbaWgftjZKjonBVrGaX,CnxLqCQQck55lrTwPFeZujdXtunjSl+XQCjuXEx/oxCQoQt+PBDyP7GXOn4PJvH4,6cFiBpT8v5pvI3SSTUYgUWJeAOufhkKa9OjstTZRpaFucOzTVHzbRXE5LUoYr6w/4Z3uImV5pdUcPoWo9Xkwxg==,2278900,2591500,99.332977,2.52133971716203,13.029393,Unkown,4,7

Is there a function call chain in AzureFunctionDataset2019?

The CSV files invocations_per_function* provide the function trigger as HTTP, Queue or othes.

Are these functions are triggered by another function? like, function A triggers function B over the HTTP protocol. If so, could this trace provide such a function call chain?

Thanks

Trace start weekday

One frustrating thing about the trace is that the start weekday and time are unknown. E.g., does time 0 start at 00:00 AM on a Sunday or Monday?

Can not download the dataset

I cannot access the website, such as the link:
https://azurecloudpublicdataset.blob.core.windows.net/azurepublicdataset/trace_data/vm_cpu_readings/vm_cpu_readings-file-125-of-125.csv.gz
Will I download the dataset with some tools?

Negative average execution time in function_durations_percentiles.anon.d01.csv

Hello everyone,
I've found 5 negative values of Average field in function_durations_percentiles.anon.d01.csv
For example, in line 35816, the Average is -41692
As the Average field indicates average execution time (ms) across all invocations of the 24-period, how can this value be negative?
Thank you very much in advance

Low Priority VM Dataset?

Is there any publicly available dataset or trace containing client requests for low priority VMs (AKA spot instances)? If not, are there any plans to release such a dataset in the near future?

Is the CPU value in AzurePublicDatasetV2 a percentage?

Hello, I would like to ask if there is a unified unit of measurement for CPU values in AzurePublicDatasetV2 with zero and ninety? I look forward to hearing from you! Thank you for the data you provided!

Azure Functions Trace 2019 link not working

Hi. I noticed that the Azure Functions 2019 trace does not work anymore.

https://azurecloudpublicdataset2.blob.core.windows.net/azurepublicdatasetv2/azurefunctions_dataset2019/azurefunctions-dataset2019.tar.xz

Duration of Azure Functions Invocation Trace 2021

Thank you for releasing the trace.

I have conducted a simple analysis of the trace, and I found the duration in the trace fluctuated considerably.

Does the duration in Azure Functions Invocation Trace 2021 include the cold start time?

Or maybe it just has to do with the user invoking the function parameters or the network?

Thank you.

About the definition of the machineId field in Azure Trace for Packing 2020

Hi!
In the Azure Trace for Packing 2020, I find that at a random time (e.g., time == 0), the sum of CPU resource allocation (i.e., core field) of the VMs alive in a machine (e.g., machineId == 0) exceeds 100%. This really confuses me.

Therefore, I want to know whether the machineId field in the vmType table refers to a single physical machine or not.

Any reply will be appreciated.

Fail to download Azure Functions Invocation Trace 2021

I can't fetch https://raw.githubusercontent.com/Azure/AzurePublicDataset/master/data/AzureFunctionsInvocationTraceForTwoWeeksJan2021.rar
How can i download this dataset

Can't get into the links provided by both V1 and V2

Thank you for your work, but I can't get in from the links you provided for both V1 and V2.

Where is the invocations_per_hour_atc.tsv?

Hello, I saw some dataset introduced in the notebook but cannot find them in the dataset downloaded. Should we process and generate by ourselves or can you share the generating scripts for them?

For example, I cannot find what invocations_per_hour_atc.tsv in this repo, which you have mentioned in the notebook.

Azure Trace for Packing 2020. How to identify the vmType?

Hi,

In the dataset Azure Trace for Packing 2020, I find that the sheet "vm" has a column "vmTypeId", but in the sheet "vmType" there are many vm types with the same "vmTypeId" but different "machineId". So I'm confused how to indentify the type of a vm in the sheet "vm", because it does not the machineId information.

Thanks a lot.

Missing vm_ids in vmtables.csv

Lots of vm_id string hashes present in vm_cpu_readings-files are not present in the vmtables.csv.

For instance the vm_id string hash: +oGQgnS1ILCbrFxxHUaKS64nkPOE4yEn0wZp1kBeSTyht+jQJqQchu963rzw9hZz.

Is there any other vmtable.csv file that has not been published?

Thanks in advance.

vm_cpu_readings-file-x-of-125.csv.gz access denied

Howdy. I have been attempting to access some of the datasets available from the "AzurePublicDatasetV1Links.txt" file with no luck. I have attached the screenshot of the page I am directed to each time I attempt to click on any of the links. I have tried on multiple browsers and devices with no luck, copying the link or clicking on it via my IDE as well. Any response is greatly appreciated.

Definition of buckets in AzurePublicDatasetV2

Hi,

Can you please include description of VM core and memory buckets to AzurePublicDatasetV2 dataset? It is just about including these two URLs in AzurePublicDatasetLinksV2.txt

I am aware that the exact number of VM cores are not given, as discussed in issue #5, and VMs are put in one of six buckets based on their cores or memory. However, it seems that description of these buckets are "missing", even though they were meant to be released.

I say "missing" (in quotes) because even though description file is not included in AzurePublicDatasetLinksV2.txt they are available for downloading on Azure Blob Storage. More precisely, schema.csv mentions that description of the CPU buckets are available at vm_virtual_core_bucket_definition.csv, which has two fields: bucket and definition. I blindly constructed a path for this file by appending the file name vm_virtual_core_bucket_definition.csv to the parent path and I was able to download through the constructed path vm_virtual_core_bucket_definition.csv.

The vm_virtual_core_bucket_definition.csv file has description of six buckets. These descriptions match the bucket labels in "VM Cores Distribution" plot in jupyter notebook, which is referenced in the main readme. This matching confirms that the file available through Azure Blob Storage is the correct one.

The same applies to description of memory bucket: schema.csv mentions vm_memory_bucket_definition.csv, it is not included in AzurePublicDatasetLinksV2.txt but is available for download in Azure Blob Storage, here vm_memory_bucket_definition.csv.

So, it would be great to update AzurePublicDatasetLinksV2.txt file to include URL for both files (to avoid future guesswork by others):

Let me know if you accept pull requests. I'd be happy to include these two URLs in AzurePublicDatasetLinksV2.txt by myself and perhaps add a short description of buckets to the main readme.

Also, is it accurate to say that

core range in bucket 6 is >24 and <=30, and
memory range in bucket 6 is >64 and <=70?

I noticed these lines in jupyter notebook, that suggest these ranges to be correct:

#Transform vmcorecount '>24' bucket to 30 and '>64' to 70
max_value_vmcorecountbucket = 30
max_value_vmmemorybucket = 70
trace_dataframe = trace_dataframe.replace({'vmcorecountbucket':'>24'},max_value_vmcorecountbucket)
trace_dataframe = trace_dataframe.replace({'vmmemorybucket':'>64'},max_value_vmmemorybucket)

Or is this transformation just a cosmetic improvement to have the jupyter table datatype as int? Having more precise bucket bounds would be helpful.

Finally, is there an external document that describes AzurePublicDatasetV2, like SOSP 2017 paper that describes AzurePublicDatasetV1? It would be useful to reference it in the readme, if any.

Thanks in advance for clarifications!

How do we identify the number of cores on each of the VMs in the DatasetV2?

We're currently using the v1 dataset as part of our research, and we would like to use the v2 as well since it has more - and up to date - data. However, I'm not sure if we're misunderstanding the dataset, but we cannot find a way to identify the core count of the VMs, we only see the bucket they belong to.

We need this information to do our research, so could you help me in figuring out how to get this information from the dataset? Or is it the case that it isn't meant to be available? If so, do you plan on releasing this information on a later date?

Azure Trace for Packing 2020 normalized resources

I'm a bit puzzled by what looks like an inconsistency in the vmType table. I understand resources are given as ratio of total machine capacity, instead of the absolute number, and unless I've missed something, there is no way of obtaining the capacities of any given machineId.

The problem is that some (in fact, many) resources seem to be inconsistent with each other. For example, if I look at the 'core' resources of rows 1,2,111,112,248,249, I have the following table:

`'core'`	`'vmTypeId': 0`	`'vmTypeId': 1`
`'machineId': 0`	0.020833333333333332	0.020833333333333332
`'machineId': 1`	0.010416666666666666	0.010416666666666666
`'machineId': 2`	0.004583333333333333	0.0175

I would expect that the ratio between the 'core' resources for two vm types be the same for any machine type, since presumably, only the denominator of those numbers changes (corresponding to the total machine capacity). But here, the machine with 'machineId' = 2 breaks this invariant.

Does this mean that the resources for a vm type are actually not constant, and depend on the machine type? Or am I missing something?

AzureFunctionsDataset2019 trace discrepancies

Hello,
we(@HongyuHe) at eth-easl found some discrepancy in the AzureFunctionsDataset2019 trace.
Looking at each of the 14-day traces, we have found many duplicate apps and functions, some missing duration or memory stats.

day	app_memory_percentiles.anon	function_durations_percentiles.anon	invocations_per_function_md.anon
d01	- 10 dups (20 rows)	- 16 dups (32 rows) - 377 apps missing memory stats	- 622 functions missing duration stats - 422 apps missing memory stats
d02	- 13 dups (20 rows)	- 18 dups (31 rows) - 380 apps missing memory stats	- 603 functions missing duration stats - 425 apps missing memory stats
d03	- 8 dups (16 rows)	- 11 dups (22 rows) - 386 apps missing memory stats	- 633 functions missing duration stats - 429 apps missing memory stats
d04		- 415 apps missing memory stats	- 623 functions missing duration stats - 465 apps missing memory stats
d05	- 2 dups (4 rows)	- 4 dups (8 rows) - 397 apps missing memory stats	- 615 functions missing duration stats - 440 apps missing memory stats
d06		- 1 dup (2 rows) - 705 apps missing memory stats	- 563 functions missing duration stats - 750 apps missing memory stats
d07		- 332 apps missing memory stats	- 532 functions missing duration stats - 379 apps missing memory stats
d08		- 412 apps missing memory stats	- 630 functions missing duration stats - 453 apps missing memory stats
d09	- 1 dup (2 rows)	- 7 dups (14 rows) - 398 apps missing memory stats	- 640 functions missing duration stats - 439 apps missing memory stats
d10	- 3 dups (6 rows)	- 4 dups (8 rows) - 394 apps missing memory stats	- 633 functions missing duration stats - 444 apps missing memory stats
d11	- 2 dups (4 rows)	- 2 dups (4 rows) - 388 apps missing memory stats	- 652 functions missing duration stats - 436 apps missing memory stats
d12		- 388 apps missing memory stats	- 631 functions missing duration stats - 440 apps missing memory stats
d13	Trace file missing	- 1 dup (2 rows)	- 576 functions missing duration stats
d14	Trace file missing		- 524 functions missing duration stats

dup : duplicate Hash Owner, Hash App, Hash Function with different invocations, durations, or memory
missing stats : Function or App with Hash is present in one trace file but missing in another trace file

These discrepancies make it hard for us to accurately analyze the trace.
Is it reasonable to treat the duplicates as separate entities, or should we merge them?
Would discarding traces with missing data be the only way to clean up the traces?
We would appreciate it if you could provide a way to clean up these issues. Thanks.

AzurePublicDatasetV2 - Workload categories and VM roles

Hello everyone,
I am using the dataset for research purposes, and I have some questions related to the workload. In vmtable.csv some VMs are labeled as "interactive", other as "delay-insensitive", and most of them as "unknown." I would like to know how this classification has been performed, and what do they mean. E.g., is it safe to think that in the "interactive" workload include web-services?
Related to that, what does a deployment represent? Is this an application? Does it follow the definition of deployments for container strategies?

Thank you very much in advance.

2019v2的数据集中vmtable表格中的vm id和vm_cpu_readings/vm_cpu_readings-file-*-of-195.csv.gz表格中的vm id

Hello, is there a relationship between the vm ids in the vmtable in the data set 2019v2 and the vm ids in the vm_cpu_readings/vm_cpu_readings-file-*-of-195.csv.gz table? With the same subscription id can you find the vm id in the vm_cpu_readings/vm_cpu_readings-file-*-of-195.csv.gz file? 
Looking forward to your reply, thank you!

1xazure

Early Access for Azure Functions Traces

Hi,

Is there any way we could get early access to the Azure Function Traces for research? BTW, I am a PhD student at the University of Alberta.

Best,
Nima

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Azure Functions Dataset 2019 download link not working

Hello,

Thank you for sharing the data however, the link for downloading it is not working.

https://azurecloudpublicdataset2.blob.core.windows.net/azurepublicdatasetv2/azurefunctions_dataset2019/azurefunctions-dataset2019.tar.xz

Update:

However, the individual file link works: https://azurecloudpublicdataset2.blob.core.windows.net/azurepublicdatasetv2/azurefunctions_dataset2019/invocations_per_function_md.anon.d01.csv

Does VM order in vmtable.csv signify VM creation order?

This is somehow relevant to issue #6 but this is about VMs that have the same create timestamp (instead of different timestamps in issue #6).

The paper says that the VM data is captured every 5 mins, i.e., VM creates happen with 300 seconds interval. Is it fair to use the line numbers in vmtable.csv to signify the order VMs created? I.e., lower line number (appears earlier in the .csv file) means VM was created before the later VMs.

For example, in this simplifies the data format from vmtable.csv as vm_uuid, create_time, delete_time, can I assume vm2 was created after vm1?

vm1,0,900
vm2,0,600
vm3,300,900
vm4,0,900

Note that this is a question about V1 dataset, although could be applicable to V2 as well.

Thanks!

Email address missed in README

We provide the trace as is, but are willing to help researchers understand and use it. So, please let us know of any issues or questions by sending email to our <>.

What is the VM size the provided VMs CPU data?

Hi,

I was wondering if there is further information about the provided VMs CPU data for each vmid:

Virtual machines (VMs) in Azure come in predefined sizes that are called families or  series. 
An individual VM is often referred to as an instance. ref.

What is the VM size? (e.g. DS3_v2)


Fig. 1: Azure VM sizes example show case picture credit

Which MS Azure VM families/series belongs to this CPU data of the instance?

Knowing VM size, I want to compute CPU utilization percentage based on the CPU provisioned capacity from 0-100%

Thanks in advance.

AzureTracesForPacking2020 - VM Create Times?

In the AzureTracesForPacking2020 dataset, there are 3414 rows with a starttime of -7157. That would mean these VMs have existed for almost 20 years before the trace logging began. This may just be a logging error/artifact but I wanted to confirm.