camsas / firmament Goto Github PK

View Code? Open in Web Editor NEW

414.0 414.0 79.0 9.75 MB

The Firmament cluster scheduling platform

License: Apache License 2.0

Makefile 0.21% Shell 1.04% Python 2.93% C 0.78% C++ 88.92% Smarty 3.00% CMake 3.12%

firmament's People

Stargazers

Watchers

Forkers

sebastian dharmeshkakadia gustafa pooya dferstay tomzhang sparkthu ashw7n greggomann 0xgj neujie mrwangxc dalanlan wqx081 ldesiqueira linearregression bugshacker bosconi nbueecs anksv shivramsrivastava izogain containerz dragon9783 emblica nagyist ravilr qiaohaijun scueczhuzhu wxdublin pgaref snailjie gokulchandrap gyliu513 xiechengsheng jiankangren ruiliangzhang chq7920 mengxi-chen atumanov fcofdez huawei-cloudnative epsilonsolutions qxiang88 leonlee jiaxuanzhou mylinyuzhi huahuiyang curiosityyy xyuan xushaohui xiaoxubeii k82cn smkuls waleed5544 lecterqian nickrenren simoncqk wh-forker chanlan container-projects henyihanwobushi kfertakis pidb cloudnative-lab tianhao909 superleo distributedsystemresearch yzs-lab

firmament's Issues

Linker error building examples

I'm having trouble getting make examples to work. I get a linker error:

/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'

which I assume means there is some additional shared library I ought to be linking against, but I can't figure out what it is.

Full output starting from a make clean of the repo is below. System is Ubuntu 14.04 with boost 1.55.0 and clang++/llvm 3.4.

Any ideas?

 (master %=) mike@docker1:~/firmament.io/firmament$ make clean
rm -rf build
rm -rf src/generated-cxx/*
rm -rf src/generated-c/*
find src/ -depth -name .setup -type f -delete

 (master %=) mike@docker1:~/firmament.io/firmament$ make all
rm -f build/tests/all_tests.txt
mkdir -p build/tests
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/base all
touch build/tests/all_tests.txt
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/base'
  SETUP   /home/mike/firmament.io/firmament/build/base
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/coco_interference_scores.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/whare_map_stats.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/whare_map_stats.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/coco_interference_scores.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/reference_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/reference_desc.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_vector.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/task_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/task_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_vector.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_desc.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_topology_node_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_topology_node_desc.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/task_perf_statistics_sample.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/task_final_report.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/task_final_report.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/machine_perf_statistics_sample.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/machine_perf_statistics_sample.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/task_perf_statistics_sample.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/data_object_name.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/base/job_desc.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/data_object_name.pb.h
  CXX     /home/mike/firmament.io/firmament/build/base/resource_status.o
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/base/job_desc.pb.h
  CXX     /home/mike/firmament.io/firmament/build/base/data_object.o
  PBC     /home/mike/firmament.io/firmament/build/base/coco_interference_scores.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/whare_map_stats.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/reference_desc.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/resource_vector.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/task_desc.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/resource_desc.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/resource_topology_node_desc.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/task_perf_statistics_sample.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/task_final_report.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/machine_perf_statistics_sample.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/data_object_name.pb.o
  PBC     /home/mike/firmament.io/firmament/build/base/job_desc.pb.o
  AR      /home/mike/firmament.io/firmament/build/base/libfirmament_base.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/base/data_object_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/base/references_test
rm /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_topology_node_desc.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/job_desc.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_desc.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/task_perf_statistics_sample.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/resource_vector.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/machine_perf_statistics_sample.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/task_desc.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/task_final_report.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/coco_interference_scores.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/whare_map_stats.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/data_object_name.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/base/reference_desc.pb.cc
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/base'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/messages all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/messages'
  SETUP   /home/mike/firmament.io/firmament/build/messages
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/test_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/heartbeat_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/test_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/registration_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/heartbeat_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/registration_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_delegation_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_heartbeat_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_delegation_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_info_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_info_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_heartbeat_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_kill_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_spawn_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_kill_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_state_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_state_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_spawn_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_registration_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/create_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/create_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/delete_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_registration_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/delete_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/lookup_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/copy_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/lookup_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/io_notification_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/copy_message.pb.h
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/messages/base_message.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/io_notification_message.pb.h
  PBC     /home/mike/firmament.io/firmament/build/messages/test_message.pb.o
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/messages/base_message.pb.h
  PBC     /home/mike/firmament.io/firmament/build/messages/heartbeat_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/registration_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_delegation_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_heartbeat_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_info_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_kill_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_spawn_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/task_state_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/storage_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/storage_registration_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/create_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/delete_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/lookup_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/copy_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/io_notification_message.pb.o
  PBC     /home/mike/firmament.io/firmament/build/messages/base_message.pb.o
  AR      /home/mike/firmament.io/firmament/build/messages/libfirmament_messages.a
rm /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_kill_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_heartbeat_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/delete_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/lookup_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/copy_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/heartbeat_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_state_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/create_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/registration_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/base_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_info_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/io_notification_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_spawn_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/test_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/storage_registration_message.pb.cc /home/mike/firmament.io/firmament/src/generated-cxx/messages/task_delegation_message.pb.cc
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/messages'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/platforms all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/platforms'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/misc all
  SETUP   /home/mike/firmament.io/firmament/build/platforms
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/platforms/common.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/platforms/common.pb.h
  PBC     /home/mike/firmament.io/firmament/build/platforms/common.pb.o
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/misc'
  SETUP   /home/mike/firmament.io/firmament/build/misc
  CXX     /home/mike/firmament.io/firmament/build/misc/generate_trace.o
make -C sim all
  SETUP   /home/mike/firmament.io/firmament/build/platforms/sim
  CXX     /home/mike/firmament.io/firmament/build/platforms/sim/simulated_messaging_adapter.o
  CXX     /home/mike/firmament.io/firmament/build/misc/pb_utils.o
  AR      /home/mike/firmament.io/firmament/build/platforms/sim/libfirmament_platforms_sim.a
make -C unix all
  SETUP   /home/mike/firmament.io/firmament/build/platforms/unix
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/async_tcp_server.o
  CXX     /home/mike/firmament.io/firmament/build/misc/string_utils.o
  CXX     /home/mike/firmament.io/firmament/build/misc/utils.o
  AR      /home/mike/firmament.io/firmament/build/misc/libfirmament_misc.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/misc/envelope_test
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/common.o
  TESTLNK /home/mike/firmament.io/firmament/build/tests/misc/utils_test
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/procfs_monitor.o
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/misc'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/storage all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/storage'
  SETUP   /home/mike/firmament.io/firmament/build/storage
  CXX     /home/mike/firmament.io/firmament/build/storage/hdfs_bridge.o
  CXX     /home/mike/firmament.io/firmament/build/storage/simple_object_store.o
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/procfs_machine.o
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/signal_handler.o
  CXX     /home/mike/firmament.io/firmament/build/storage/stub_object_store.o
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/stream_sockets_adapter.o
  AR      /home/mike/firmament.io/firmament/build/storage/libfirmament_storage.a
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/storage'
  CXX     /home/mike/firmament.io/firmament/build/platforms/unix/tcp_connection.o
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/engine/executors all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/engine/executors'
  SETUP   /home/mike/firmament.io/firmament/build/engine/executors
  CXX     /home/mike/firmament.io/firmament/build/engine/executors/local_executor.o
  AR      /home/mike/firmament.io/firmament/build/platforms/unix/libfirmament_unix.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/platforms/unix/procfs_monitor_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/platforms/unix/procfs_machine_test
  CXX     /home/mike/firmament.io/firmament/build/engine/executors/remote_executor.o
  TESTLNK /home/mike/firmament.io/firmament/build/tests/platforms/unix/stream_sockets_adapter_test
  CXX     /home/mike/firmament.io/firmament/build/engine/executors/simulated_executor.o
  CXX     /home/mike/firmament.io/firmament/build/engine/executors/task_health_checker.o
  CXX     /home/mike/firmament.io/firmament/build/engine/executors/topology_manager.o
  TESTLNK /home/mike/firmament.io/firmament/build/tests/platforms/unix/stream_sockets_channel_test
  AR      /home/mike/firmament.io/firmament/build/engine/executors/libfirmament_engine_executors.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/engine/executors/local_executor_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/engine/executors/topology_manager_test
rm /home/mike/firmament.io/firmament/src/generated-cxx/platforms/common.pb.cc
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/platforms'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/sim/dfs all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/sim/dfs'
  SETUP   /home/mike/firmament.io/firmament/build/sim/dfs
  CXX     /home/mike/firmament.io/firmament/build/sim/dfs/google_block_distribution.o
  CXX     /home/mike/firmament.io/firmament/build/sim/dfs/simulated_dfs.o
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/engine/executors'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/scheduling'
  SETUP   /home/mike/firmament.io/firmament/build/scheduling
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/scheduling/scheduling_delta.pb.h
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/scheduling/scheduling_delta.pb.h
  CXX     /home/mike/firmament.io/firmament/build/scheduling/common.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/event_driven_scheduler.o
  AR      /home/mike/firmament.io/firmament/build/sim/dfs/libfirmament_sim_dfs.a
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/sim/dfs'
  CXX     /home/mike/firmament.io/firmament/build/scheduling/knowledge_base.o
  PBC     /home/mike/firmament.io/firmament/build/scheduling/scheduling_delta.pb.o
  AR      /home/mike/firmament.io/firmament/build/scheduling/libfirmament_scheduling.a
rm /home/mike/firmament.io/firmament/src/generated-cxx/scheduling/scheduling_delta.pb.cc
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/scheduling'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling/flow all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/scheduling/flow'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling/simple all
  SETUP   /home/mike/firmament.io/firmament/build/scheduling/flow
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/coco_cost_model.o
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/scheduling/simple'
  SETUP   /home/mike/firmament.io/firmament/build/scheduling/simple
  CXX     /home/mike/firmament.io/firmament/build/scheduling/simple/simple_scheduler.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_add_node.o
  AR      /home/mike/firmament.io/firmament/build/scheduling/simple/libfirmament_scheduling_simple.a
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/scheduling/simple'
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_change_arc.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_change_stats.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_exporter.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_new_arc.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/dimacs_remove_node.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/flow_graph.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/flow_graph_arc.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/flow_graph_manager.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/flow_graph_node.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/flow_scheduler.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/json_exporter.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/octopus_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/quincy_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/sjf_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/solver_dispatcher.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/trivial_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/random_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/void_cost_model.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/wharemap_cost_model.o
  AR      /home/mike/firmament.io/firmament/build/scheduling/flow/libfirmament_scheduling_flow.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/scheduling/flow/dimacs_exporter_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/scheduling/flow/flow_graph_test
make -C sim all
  SETUP   /home/mike/firmament.io/firmament/build/scheduling/flow/sim
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/sim/google_runtime_distribution.o
  CXX     /home/mike/firmament.io/firmament/build/scheduling/flow/sim/simulated_quincy_cost_model.o
  AR      /home/mike/firmament.io/firmament/build/scheduling/flow/sim/libfirmament_simulated_quincy.a
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/scheduling/flow'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/engine all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/engine'
  SETUP   /home/mike/firmament.io/firmament/build/engine
  DYNLNK  /home/mike/firmament.io/firmament/build/engine/task_lib_inject.so
  CXX     /home/mike/firmament.io/firmament/build/engine/health_monitor.o
  CXX     /home/mike/firmament.io/firmament/build/engine/node.o
  CXX     /home/mike/firmament.io/firmament/build/engine/coordinator_http_ui.o
  CXX     /home/mike/firmament.io/firmament/build/engine/coordinator.o
In file included from /home/mike/firmament.io/firmament/src/engine/coordinator_http_ui.cc:17:
/home/mike/firmament.io/firmament/ext/pb2json-git/pb2json.h:8:16: warning: unused function 'parse_msg' [-Wunused-function]
static json_t *parse_msg(const google::protobuf::Message *msg);
               ^
/home/mike/firmament.io/firmament/ext/pb2json-git/pb2json.h:9:16: warning: unused function 'parse_repeated_field'
      [-Wunused-function]
static json_t *parse_repeated_field(const google::protobuf::Message *msg,const google::protobuf::Reflection * ref,const g...
               ^
  CXX     /home/mike/firmament.io/firmament/build/engine/worker.o
2 warnings generated.
  CXX     /home/mike/firmament.io/firmament/build/engine/task_lib.o
  DYNLNK  /home/mike/firmament.io/firmament/build/engine/coordinator
  DYNLNK  /home/mike/firmament.io/firmament/build/engine/worker
  AR      /home/mike/firmament.io/firmament/build/engine/libfirmament_engine.a
  AR      /home/mike/firmament.io/firmament/build/engine/libfirmament_task_lib.a
  TESTLNK /home/mike/firmament.io/firmament/build/tests/engine/coordinator_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/engine/simple_scheduler_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/engine/worker_test
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/engine'
make  --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/sim all
make[1]: Entering directory `/home/mike/firmament.io/firmament/src/sim'
  SETUP   /home/mike/firmament.io/firmament/build/sim
  GEN     /home/mike/firmament.io/firmament/src/generated-cxx/sim/event_desc.pb.h
  CXX     /home/mike/firmament.io/firmament/build/sim/event_manager.o
  GENC    /home/mike/firmament.io/firmament/src/generated-cxx/sim/event_desc.pb.h
  CXX     /home/mike/firmament.io/firmament/build/sim/google_trace_loader.o
  CXX     /home/mike/firmament.io/firmament/build/sim/knowledge_base_simulator.o
  CXX     /home/mike/firmament.io/firmament/build/sim/simulator.o
  CXX     /home/mike/firmament.io/firmament/build/sim/simulator_bridge.o
  CXX     /home/mike/firmament.io/firmament/build/sim/synthetic_trace_loader.o
  CXX     /home/mike/firmament.io/firmament/build/sim/trace_utils.o
  CXX     /home/mike/firmament.io/firmament/build/sim/google_trace_task_processor.o
  PBC     /home/mike/firmament.io/firmament/build/sim/event_desc.pb.o
  DYNLNK  /home/mike/firmament.io/firmament/build/sim/simulator
  TESTLNK /home/mike/firmament.io/firmament/build/tests/sim/event_manager_test
  TESTLNK /home/mike/firmament.io/firmament/build/tests/sim/simulator_bridge_test
  DYNLNK  /home/mike/firmament.io/firmament/build/sim/google_trace_processor
rm /home/mike/firmament.io/firmament/src/generated-cxx/sim/event_desc.pb.cc
make[1]: Leaving directory `/home/mike/firmament.io/firmament/src/sim'

 (master *%=) mike@docker1:~/firmament.io/firmament$ make --no-print-directory examples
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/base all
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/messages all
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/misc all
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/platforms all
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/engine/executors all
make -C sim all
make[1]: Nothing to be done for `all'.
make[2]: Nothing to be done for `all'.
make -C unix all
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/sim/dfs all
make[2]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/storage all
make[1]: Nothing to be done for `all'.
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling all
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling/flow all
make -C sim all
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/scheduling/simple all
make[2]: Nothing to be done for `all'.
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/engine all
make[1]: Nothing to be done for `all'.
make  --no-print-directory - --jobserver-fds=3,5 -j -C /home/mike/firmament.io/firmament/src/examples all
  SETUP   /home/mike/firmament.io/firmament/build/examples/hello_world
  CXX     /home/mike/firmament.io/firmament/build/examples/hello_world/hello_world.o
  DYNLNK  /home/mike/firmament.io/firmament/build/examples/hello_world/hello_world
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 0 has invalid symbol index 11
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 1 has invalid symbol index 12
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 2 has invalid symbol index 2
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 3 has invalid symbol index 2
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 4 has invalid symbol index 11
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 5 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 6 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 7 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 8 has invalid symbol index 12
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 9 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 10 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 11 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 12 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 13 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 14 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 15 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 16 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 17 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 18 has invalid symbol index 13
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_info): relocation 19 has invalid symbol index 21
/usr/bin/ld: /usr/lib/debug/usr/lib/x86_64-linux-gnu/crt1.o(.debug_line): relocation 0 has invalid symbol index 2
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../x86_64-linux-gnu/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [/home/mike/firmament.io/firmament/build/examples/hello_world/hello_world] Error 1

submit job error

Hello,When I get firmament up and running and try to use
python scripts/job/job_submit.py localhost 8080 /bin/sleep 60

The command failed with below error.
Traceback (most recent call last): File "scripts/job/job_submit.py", line 1, in <module> from base import job_desc_pb2 ImportError: No module named base
And I run the command make in the script directory, it complained as no makefile found.

I tried several times but failed, maybe some doc error?

Replace slow task assignment extraction code

The naive task mapping extraction code in quincy_dispatcher.cc is slowing things down at scale (most notably when running the simulator on the full Google trace) since it takes a few seconds to extract the task mappings.

We have an algorithmically superior implementation in Flowlessly, which we should back-port into the Firmament code base, so that all flow solvers benefit from it.

Output from flowlessly solver invalid on second invocation unless --only_read_assignment_changes is specified

Invoking flowlessly in default mode (i.e., non-incremental) results in output being returned that the dispatchers parsing logic fails to understand because node type information is lacking.

Steps to reproduce:

Invoke the coordinator as build/engine/coordinator --logtostderr --scheduler flow --flow_scheduling_cost_model 2 --v=1 --flow_scheduling_solver=flowlessly --debug_flow_graph
Submit a job.
Submit another job.
Observe the error in the output:

I0626 16:53:01.913985  6803 solver_dispatcher.cc:191] Writing flow graph debug info into /tmp/firmament-debug/debug_1.dm
I0626 16:53:01.914111  6803 utils.cc:307] External execution of command: ext/flowlessly-git/run_fast_cost_scaling --graph_has_node_types=true --global_update=false --daemon=false
I0626 16:53:01.915642  6803 utils.cc:346] Subprocess with PID 7002 created.
E0626 16:53:01.917798  6803 solver_dispatcher.cc:562] Unknown type of row in flow graph: m 38 24
I0626 16:53:01.917911  6803 utils.cc:370] Subprocess with PID 7002 exited with status 0
I0626 16:53:01.917989  6803 flow_scheduler.cc:135] Applying 0 scheduling deltas...

Inspecting the file in question shows that it is an incremental delta:

$ cat  /tmp/firmament-debug/debug-flow_1.dm
m 38 24
c EOI

... but the system isn't expecting assignment changes to be returned.

This suggests to me that we should either:

make the --only_read_assignment_changes flag implicit when --flow_solver is set to "flowlessly";
not do above, but fail if the flag is now set when --flow_solver is set to "flowlessly";
remove the special cases for flowlessly and cs2 and simply allow the user to specify a solver binary plus the appropriate combination of --incremental_flow and --only_read_assignment_changes, making it their responsibility to get it right;
shelve this until the fast delta extraction code has been back-ported into the main code base, at which point flowlessly can return the entire flow, just as cs2 does.

(3) seems painful for the user, and (4) seems inefficient to me. Maybe go for (2)?

Create common cost model state class.

Currently, all the cost model constructors have a large number of arguments. Create a class to wrap up the common objects required by all the cost models.

coordinator start failed

I use the latest code and want to run the coordinator on ubuntu14.04, but failed, the error msg is

2016-05-03 19:57:47.086785, p20959, th139855899261056, ERROR Failed to setup RPC connection to "localhost:8020" caused by:
TcpSocket.cpp: 293: HdfsNetworkConnectException: Connect to "localhost:8020" failed: (errno: 111) Connection refused
    @   Hdfs::Internal::TcpSocketImpl::connect(addrinfo*, char const*, char const*, int)
    @   Hdfs::Internal::TcpSocketImpl::connect(char const*, char const*, int)
    @   Hdfs::Internal::RpcChannelImpl::connect()
    @   Hdfs::Internal::RpcChannelImpl::invokeInternal(std::shared_ptr<Hdfs::Internal::RpcRemoteCall>)
    @   Hdfs::Internal::RpcChannelImpl::invoke(Hdfs::Internal::RpcCall const&)
    @   Hdfs::Internal::NamenodeImpl::invoke(Hdfs::Internal::RpcCall const&)
    @   Hdfs::Internal::NamenodeImpl::getFsStats()
    @   Hdfs::Internal::NamenodeProxy::getFsStats()
    @   Hdfs::Internal::FileSystemImpl::getFsStats()
    @   Hdfs::Internal::FileSystemImpl::connect()
    @   Hdfs::FileSystem::connect(char const*, char const*, char const*)
    @   hdfsBuilderConnect
    @   firmament::store::HdfsDataLocalityManager::HdfsDataLocalityManager(firmament::TraceGenerator*)
    @   firmament::Coordinator::Coordinator()
    @   main
    @   Unknown
    @   Unknown

F0503 19:57:48.098994 20959 hdfs_data_locality_manager.cc:32] Could not connect to HDFS

I don't have hdfs and want to run on local filesystem storage，any help ??

Improve build system

Edit, July 2015: generalized this into a point about the build system as a whole.

We should move the build system over to a somewhat less hacky setup. There are several problems with the current build system:

The use of per-module libraries to aggregate object files (libfirmament_*.a) is convenient for specifying cross-module dependencies, but inhibits us from using different build options for different targets (e.g. AddressSanitizer does not work with the TaskLib .so).
The per-directory Makefiles can get quite messy. For example, the Makefile in src/engine duplicates a lot of trivial code across the coordinator and coordinator_sim targets, and there are long lists of manual dependency specifications.
We cannot easily include or exclude modules from the build. This is problematic for building the simulator and the coordinator separately (rather than always building both), building unit tests separately, and for integrating future adaptors for other orchestration systems.

As a solution, we could either:

Move the build system to automake.
Move the build system to CMake.
Hand-roll a better build system ourselves.

(Mentioning @AdamGleave re prior conversations on this.)

Crash if sysfs network speed pseudofile is unavailable

We have Ubuntu VM's Created on OpenStack.
The build went fine, but the coordinator crashes with this error message.
Because this file '/sys/class/net/eth0/speed' has no data and is read by the function 'GetMachineCapacity'.

The following is the error message:

ubuntu:~/workspace/src/firmament$ sudo build/src/coordinator --listen_uri tcp:10.11.12.111:9091 --task_lib_dir=$(pwd)/build/src/
F1006 17:54:10.509845  1742 procfs_machine.h:64] Check failed: fscanf(input, "%ju ", x) == 1 (-1 vs. 1)
*** Check failure stack trace: ***
    @     0x7f57273fedaa  (unknown)
    @     0x7f57273fece4  (unknown)
    @     0x7f57273fe6e6  (unknown)
    @     0x7f5727401687  (unknown)
    @           0x60c2e4  firmament::platform_unix::ProcFSMachine::readunsigned()
    @           0x60d4c4  firmament::platform_unix::ProcFSMachine::GetMachineCapacity()
    @           0x56790c  firmament::Coordinator::AddResource()
    @           0x5fea8e  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5680d9  firmament::Coordinator::DetectLocalResources()
    @           0x56d5f1  firmament::Coordinator::Run()
    @           0x55e80f  main
    @     0x7f5725ebcf45  (unknown)
    @           0x5641c0  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

Relevant source code:

void ProcFSMachine::GetMachineCapacity(ResourceVector* cap) {
  // Extract the total available resource capacities on this machine
  MemoryStatistics_t mem_stats = GetMemoryStats();
  cap->set_ram_cap(mem_stats.mem_total / BYTES_TO_MB);
  vector<CPUStatistics_t> cpu_stats = GetCPUStats();
  // Subtract one as we have an additional element for the overall CPU load
  // across all cores
  cap->set_cpu_cores(cpu_stats.size() - 1);
  // Get network interface speed from ProcFS
  string nic_speed_path;
  spf(&nic_speed_path, "**/sys/class/net/%s/speed**",
      FLAGS_monitor_netif.c_str());
  FILE* nic_speed_fd = fopen(nic_speed_path.c_str(), "r");
  uint64_t speed = 0;
  if (nic_speed_fd) {
    **readunsigned(nic_speed_fd, &speed);             ---> fails here during the CHECK_EQ call..**
    CHECK_EQ(fclose(nic_speed_fd), 0);
  }

Is this an issue with our machines or a problem with the firmament?

I did the following workaround, and it seems to work fine. But this 'readunsigned' function is called from a couple of other places, looks like not the correct fix, but a temporary workaround.

   inline void readunsigned(FILE* input, uint64_t *x) {
-    CHECK_EQ(fscanf(input, "%ju ", x), 1);
+    -fscanf(input, "%ju ", x);
   }

Circular dependencies during `make` process

I was involved into a small issue during make process in the following, which maybe I shouldn't have popped it up here.

$ mkdir build
$ cd build
$ cmake ..
$ make

The problem is shown below,

root@ubuntu:~/firmament/build# make
/usr/bin/ld: //usr/local/lib/libgflags.a(gflags.cc.o): undefined reference to symbol 'pthread_rwlock_wrlock@@GLIBC_2.2.5'
//lib/x86_64-linux-gnu/libpthread.so.0: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[5]: *** [/root/firmament/build/third_party/flowlessly/src/flowlessly/build/flow_scheduler] Error 1
make[4]: *** [src/CMakeFiles/flow_scheduler.dir/all] Error 2
make[3]: *** [all] Error 2

basically I suspected that it's a circular dependencies issue, and I think that some required knobs are needed for Makefile. I tried this and other similar solution, yet no luck.
Any suggestions, please?

HdfsNetworkConnectException: Connect to "localhost:8020" failed

I deploy firmament according to "Getting started" tutorial. When I start firmament, I come accross follow errors. I don't familiar with hdfs, so forgive me if I miss some steps.

cxxly@ubuntu:~/firmament$ sudo build/src/coordinator --listen_uri tcp:133.133.134.130:9001 --task_lib_dir=$(pwd)/build/src/
2016-06-26 07:06:06.825858, p7895, th139744479869056, ERROR Failed to setup RPC connection to "localhost:8020" caused by:
TcpSocket.cpp: 293: HdfsNetworkConnectException: Connect to "localhost:8020" failed: (errno: 111) Connection refused
    @   Hdfs::Internal::TcpSocketImpl::connect(addrinfo*, char const*, char const*, int)
    @   Hdfs::Internal::TcpSocketImpl::connect(char const*, char const*, int)
    @   Hdfs::Internal::RpcChannelImpl::connect()
    @   Hdfs::Internal::RpcChannelImpl::invokeInternal(std::shared_ptr<Hdfs::Internal::RpcRemoteCall>)
    @   Hdfs::Internal::RpcChannelImpl::invoke(Hdfs::Internal::RpcCall const&)
    @   Hdfs::Internal::NamenodeImpl::invoke(Hdfs::Internal::RpcCall const&)
    @   Hdfs::Internal::NamenodeImpl::getFsStats()
    @   Hdfs::Internal::NamenodeProxy::getFsStats()
    @   Hdfs::Internal::FileSystemImpl::getFsStats()
    @   Hdfs::Internal::FileSystemImpl::connect()
    @   Hdfs::FileSystem::connect(char const*, char const*, char const*)
    @   hdfsBuilderConnect
    @   firmament::store::HdfsDataLocalityManager::HdfsDataLocalityManager(firmament::TraceGenerator*)
    @   firmament::Coordinator::Coordinator()
    @   main
    @   Unknown
    @   Unknown

Dockerfile build issues

I try to install firmament on centos7 and ubuntu14.04, but finally failed. It requires so many dependencies which must be installed one by one. Is there any easily solution(such add a dockerfile) or more detailed description about how to install firmament.

Remove duplication in Quincy cost models (simulated vs. real cluster)

The simulated Quincy should wrap around the real Quincy model and feed it with simulated data locality information.

Enabling flow scheduler crashes during resource topology addition

When running the coordinator with a flow scheduler and any cost model, it crashes on startup (as of recent versions):

$ build/engine/coordinator --logtostderr --scheduler=quincy --v=3 --flow_scheduling_cost_model=6 --debug_flow_graph
[...]
F0329 15:29:24.509809 19645 octopus_cost_model.cc:34] Check failed: 'dst_rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fb271dc9996  google::DumpStackTraceAndExit()
    @     0x7fb271dc10fd  google::LogMessage::Fail()
    @     0x7fb271dc2fb2  google::LogMessage::SendToLog()
    @     0x7fb271dc0c9f  google::LogMessage::Flush()
    @     0x7fb271dc384e  google::LogMessageFatal::~LogMessageFatal()
    @           0x7f4554  google::CheckNotNull<>()
    @           0x80de85  firmament::OctopusCostModel::ResourceNodeToResourceNodeCost()
    @           0x837122  firmament::FlowGraph::ConfigureResourceBranchNode()
    @           0x836d40  firmament::FlowGraph::AddResourceNode()
    @           0x84383a  boost::_mfi::mf1<>::operator()()
    @           0x843790  boost::_bi::list2<>::operator()<>()
    @           0x8436e2  boost::_bi::bind_t<>::operator()<>()
    @           0x843488  boost::detail::function::void_function_obj_invoker1<>::invoke()
    @           0x87ba20  boost::function1<>::operator()()
    @           0x87adb2  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x836034  firmament::FlowGraph::AddResourceTopology()
    @           0x83a672  firmament::FlowGraph::UpdateResourceNode()
    @           0x84383a  boost::_mfi::mf1<>::operator()()
    @           0x843790  boost::_bi::list2<>::operator()<>()
    @           0x8436e2  boost::_bi::bind_t<>::operator()<>()
    @           0x843488  boost::detail::function::void_function_obj_invoker1<>::invoke()
    @           0x87ba20  boost::function1<>::operator()()
    @           0x87adb2  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x832ac0  firmament::FlowGraph::UpdateResourceTopology()
    @           0x832a1d  firmament::FlowGraph::AddMachine()
    @           0x7f0f95  firmament::scheduler::QuincyScheduler::UpdateResourceTopology()
    @           0x7f3026  firmament::scheduler::QuincyScheduler::RegisterResource()
    @           0x6f0718  firmament::Coordinator::AddResource()
    @           0x72fe13  boost::_mfi::mf3<>::operator()()
    @           0x72fd58  boost::_bi::list4<>::operator()<>()
    @           0x72fc62  boost::_bi::bind_t<>::operator()<>()
    @           0x72f9bb  boost::detail::function::void_function_obj_invoker1<>::invoke()

It looks to me like we've got two nested BFS traversals going on here, one triggered from FlowGraph::UpdateResourceTopology and one from FlowGraph::AddResourceTopology, which seems a bit funky. I believe this was introduced by 502af61 via FlowGraph::AddMachine, but it's not the root of the problem -- if I change FlowGraph::AddMachine to call FlowGraph::AddResourceTopology directly, things still fail.

Any ideas? (@ICGog?)

Null pointers in Whare-Map and Octopus

I am getting null pointer assertion failures when trying to use two of the cost models: whare-map and octopus. Error is like this:

$ build/engine/coordinator --task_lib_dir=$(pwd)/build/engine/ --listen_uri=tcp:10.0.1.101:55556 --scheduler flow --flow_scheduling_cost_model 6
rm: cannot remove ‘/tmp/firmament-debug/*’: No such file or directory
F1129 15:38:31.345584  1278 octopus_cost_model.cc:191] Check failed: 'rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fe3a3c9fdaa  (unknown)
    @     0x7fe3a3c9fce4  (unknown)
    @     0x7fe3a3c9f6e6  (unknown)
    @     0x7fe3a3ca2687  (unknown)
    @           0x613a9f  firmament::OctopusCostModel::GatherStats()
    @           0x643bd3  firmament::FlowGraphManager::ComputeTopologyStatistics()
    @           0x611a19  firmament::scheduler::FlowScheduler::UpdateCostModelResourceStats()
    @           0x612769  firmament::scheduler::FlowScheduler::RegisterResource()
    @           0x590f71  firmament::Coordinator::AddResource()
    @           0x6a4b05  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5909d1  firmament::Coordinator::DetectLocalResources()
    @           0x5912d2  firmament::Coordinator::Run()
    @           0x551b95  main
    @     0x7fe3a0b15ec5  (unknown)
    @           0x55179d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

and

$ build/engine/coordinator --task_lib_dir=$(pwd)/build/engine/ --listen_uri=tcp:10.0.1.101:55556 --scheduler flow --flow_scheduling_cost_model 4
rm: cannot remove ‘/tmp/firmament-debug/*’: No such file or directory
F1129 15:40:05.337487  1299 wharemap_cost_model.cc:760] Check failed: 'rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fd8f6eeadaa  (unknown)
    @     0x7fd8f6eeace4  (unknown)
    @     0x7fd8f6eea6e6  (unknown)
    @     0x7fd8f6eed687  (unknown)
    @           0x62ada9  firmament::WhareMapCostModel::GatherStats()
    @           0x643bd3  firmament::FlowGraphManager::ComputeTopologyStatistics()
    @           0x611a19  firmament::scheduler::FlowScheduler::UpdateCostModelResourceStats()
    @           0x612769  firmament::scheduler::FlowScheduler::RegisterResource()
    @           0x590f71  firmament::Coordinator::AddResource()
    @           0x6a4b05  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5909d1  firmament::Coordinator::DetectLocalResources()
    @           0x5912d2  firmament::Coordinator::Run()
    @           0x551b95  main
    @     0x7fd8f3d60ec5  (unknown)
    @           0x55179d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

The other cost models (that are marked as complete in README.md) work fine.

short description about your work

I have read your Phd thesis, it is brilliant. But it is too difficulty to understand all of things , is there some concise description about your work that everyone can understand easily. I think that wil make the project more attractive.

Ongoing Poseidon/Firmament Work & Concerns

Hi Ionel & Malte, hope things are great with you two. I wanted to give you an update and ask for your expert opinion/advise on one very important issue we have identified while doing detail throughput performance testing recently. Firstly, we have completed incorporating the following scheduling functionality into Firmament scheduler:

Node level Affinity/Anti-Affinity using the network flow graph approach, using the similar approach as for regular workloads/tasks with no affinity/anti-affinity requirements.
Pod Level Affinity/Anti-Affinity using Pod-at-a-time with multi-round scheduling approach. We had to optimize the multi-round process somewhat in order for better throughput. Currently, we do see that Firmament throughput is approx. 2X better even though we are doing pod-at-a-time processing.
Support for Taints/Tolerations.

Overall, throughput numbers are definitely in favor of Firmament scheduler by a great margin, as we earlier discovered as well. However, there is a one caveat in all this. The Job size (K8S replica-set, deployment or Jobs) is quite large with large no. of tasks per job in these tests. If the Job size is smaller and we have great number of Jobs, Firmament performance really degrades as you can see in the examples below. This is due to the fact that the solver has to deal with great no. of arcs in such cases.

Net-net is, based on our assessment, Firmament scheduler definitely does a great job in use cases where Job size is quite large. This is primarily true because due to equivalence classes as a mechanism for amortizing the work.

Large number of Jobs consisting of smaller number of tasks, throughput benefits are not there due to large number of arcs drawn in the graph.

Question for both of you is if there is a way to optimize all this and reduce the number of arcs in the following examples in order for Firmament to be a general purpose scheduler. Please let us know your thoughts and perspective on all this. I will also create an issue to this regards in CAMSAS as well. Thanks.

Node Anti-Affinity Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 800 replicate-sets with 19 replicas in each set.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• In a node anti-affinity use case scenario, let us assume that an incoming pod can go to any remaining 799 nodes as one single node in the cluster has conflict with the node level anti-affinity rule for incoming Pods.
• Accordingly, we end up having no. of arcs in the flow graph as = 799 * 800 * 19 = 12,144,800 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 12,144,800 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.

Normal Pod Scenario
• Let us assume there are 800 nodes in a cluster.
• In a scheduling run, let us say we are processing 15,200 pods.
• Let us say we use 3,040 replicate-sets with 5 replicas in each set. Each replica-set uses a unique CPU-Memory combination.
• Let us also assume that we have set limit of 19 arcs between task EC and nodes (using machine individual ECs with each EC of capacity 1). This is essentially to load balance the incoming workloads/pods across multiple nodes (each arc increases incrementally in order to do load distribution across eligible machines).
• Let us assume incoming pods can go to any of the 800 nodes.
• Accordingly, we end up having no. of arcs in the flow graph as = 3,040 * 19 * 800 = 46,208,000 arcs.
• Even though there are only 15,200 incoming pods in a scheduling run, we end up creating 46,208,000 arcs in the graph unnecessarily.
• Ideally, we should limit the no of arcs drawn between task EC and nodes (using machine ECs) to lowest cost 15,200 arcs only.

build error

hello Malte and everyone, I got a problem when I build the firmament. After 'make' in build dir, it return 'flowlessly-download error', but, when I use 'git clone https://github.com/ICGog/Flowlessly.git', it works well. The flowlessly-download-err.log as follows:
fatal: unable to access 'https://github.com/ICGog/Flowlessly.git/': gnutls_handshake() failed: Error in the pull function.
Besides, when downloading the grpc, it also give the likely fault. I have try many times but failed.

Segfault on job submission after protobuf3 transition

As reported by @shivramsrivastava in #50:

We are getting this following error after we refreshed our branch with the latest changes we are getting a SIGSEGV.

#0  0x00000000007688ea in google::protobuf::Message::GetDescriptor() const ()
#1  0x000000000089bbdd in google::protobuf::internal::ReflectionOps::Merge(google::protobuf::Message const&, google::protobuf::Message*) ()
#2  0x00000000007688ff in google::protobuf::Message::GetDescriptor() const ()
#3  0x000000000089e594 in google::protobuf::TextFormat::Parser::Parse(google::protobuf::io::ZeroCopyInputStream*, google::protobuf::Message*) ()
#4  0x000000000089e6b4 in google::protobuf::TextFormat::Parser::ParseFromString(std::string const&, google::protobuf::Message*) ()
#5  0x000000000089ed54 in google::protobuf::TextFormat::ParseFromString(std::string const&, google::protobuf::Message*) ()
#6  0x00000000006a5d28 in firmament::webui::CoordinatorHTTPUI::HandleJobSubmitURI (this=0xc51d50, http_request=..., tcp_conn=...)
    at /home/ubuntu/workspace/src/firmament/src/engine/coordinator_http_ui.cc:157

Line where the error occurs.

156       google::protobuf::TextFormat::ParseFromString(job_descriptor_param,
157                                                     &job_descriptor);

Content of 'job_descriptor_param'.

$9 = "name: \"anonymous_job_at_1477412128\"root_task {  name: \"root_task\"  dependencies {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: CONCRETE    location: \"blob:/tmp/fib_in\"  }  outputs {    id: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out1\"  }  outputs {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out2\"  }  binary: \"openssl\"  args: \"speed\"  args: \"sha512\"  inject_task_lib: true  resource_request {    cpu_cores: 0.1    ram_cap: 128  }  priority: 5}output_ids: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"output_ids: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\""

This error relates to the protobuf3 transition: the job submission script still generates Python code using protobuf2, and the text format therefore becomes incompatible.

There are two solutions:

The immediate fix: use protobuf3 for the job submission script.
Longer-term solution: since protobuf3 supports JSON decoding (one of the reasons we moved to it), we can just move the job submission script to send JSON.

Dynamic equivalence class sets are not supported

Our current implementation will exhibit undefined behaviour if a task's equivalence classes change over time. In other words, we assume that the set of ECs for each task is (a) deterministic in the task's fixed properties, and (b) consequently does not change over time.

None of our current scheduling policies (cost models) have dynamic EC sets, but we may want to have them in the future. (For example: it would be conceivable for Kubernetes labels to be used as ECs, but such labels can change over time.)

This issue is primarily to serve as a reminder that dynamically changing ECs aren't supported, and as a starting point for discussion about future support for them.

If we were to support them, we would have to extend the scheduling code (FlowGraphManager, primarily) with code that checks the set of current ECs returned from GetTaskEquivClasses and GetResourceEquivClasses against the current ECs in the flow graph, and updates the graph to reflect any changes (both in the existence and in the arc costs for each equivalence class).

glog not set up correctly

fetch-externals.sh currently does not set up google-glog correctly; it builds it into a local path, but if we link it dynamically, binaries do not work. Instead, fetch-externals.sh should prompt the user to run the necessary commands to install it.

Track active tasks' data separately from "archived" ones'

We currently keep track of all tasks known to a coordinator in the TaskMap_t data structure owned by the Coordinator. This contains tasks in new, runnable, running, completed, failed and various other states. We use it for the web UI, scheduling and the management of task-specific data structures.

However, the flow graph (and, consequently, the cost models) sometimes needs to iterate over all tasks that are currently of interest to the scheduler (i.e., those which are still eligible for scheduling: runnable, running and failed ones), and can get tripped up by "archived" tasks that are still in the task map.

In order to increase the efficiency of such iterations and clear up the semantics, we should de-conflate the two purposes of the task map. There are several options for this:

Establish a separate data structure in the flow scheduler that keeps track of all tasks that are of interest to it.
- Pros: easy, not a breaking change, compatible with factoring the flow scheduler into a standalone module
- Cons: duplication of bookkeeping, need to manage another data structure, memory overhead
Re-designate the task map to only contain active tasks, and have an archival map for those that are no longer active.
- Pros: no memory overhead, clear separation of concerns
- Cons: major architectural change, need to still manage two data structures, potential for inconsistency
Garbage-collect finished tasks' state at some time after they finish (as in Mesos), and retire any information we want to retain to the knowledge base. --
- Pros: clean solution, also addresses state accumulation issues, clear separation of concerns
- Cons: invasive change that touches assumptions, needs state migration logic

Interested in views on what the best way forward is.

Starting subordinate coordinators with flow scheduler causes crash

When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:

Master node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and Worker node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.

Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:

What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error).

Improve hash functions used to generate IDs.

Currently, we use hash_combine() from boost::hash to generate unique IDs. However, boost::hash::hash_combine is a rather weak hash that in practice can lead to collisions. One such collision can be obtained by running a simulation with 1000 tasks / second submission rate using the following command:

./build/sim/simulator --scheduler="flow" --simulation="synthetic"  --solver="flowlessly" --online_factor=1 --logtostderr --run_incremental_scheduler=true --generate_trace --generated_trace_path=/home/icg27/firmament/generated_trace/ --flow_scheduling_cost_model=6 --synthetic_num_machines=1500 --synthetic_jobs_per_second=20 --synthetic_tasks_per_job=50 --trace_path="/" --runtime=1000000000

Is there roadmap about integrate this with Kubernetes?

Just as title said, I did not find any roadmap doc to describe when/how to integrate this with Kubernetes project.

Code flow/documentation

It will be great if there is some documentation on the code flow.
I tried with Doxygen, but it was not generating any docs :-(
If you can give some pointer like what are the main structures, and where to start it will great.

Is there any reason you chose c++, which other languages like golang or python can't do?
Do you have any plans on porting to other languages, if all the said features are achievable ?

How to repeat the experiments in OSDI 2016 paper using Firmament

Hello, @ms705.

I read your OSDI 2016 paper “Firmament:fast, centralised cluster scheduling at scale”. I plan to replay the experiments in your paper, but I was confused by the following questions.

1 The first experiment I want to replay is the experiment described in Figure 7 in your paper. The experiment is to get average runtime for MCMF algorithms on clusters of different sizes subsampled from the Google trace. However, I cannot find the algorithm relaxation codes from the source code of Firmament. The CS2 software from the third-party directory seems to implement the Cost scaling algorithm. And the flowlessly software support Cycle cancelling and Successive shortest path algorithms. But the following codes in the “solver_dispatcher.cc” shows that maybe I could designate the relax algorithm to run firmament.

I try to run the simulator with the following command “-flowlessly_algorithm=relax” and it failed.

The warning log showed that:

So where is the codes of relaxation algorithm? And how to designate and distinguish the incremental cost scaling and Quincy’s cost scaling through command parameters? Another question is how to designate the cluster size when running the simulator of Firmament?
I would appreciate much if you could give me the sample of executing commands when running a simulation as to how to replay the experiment of Figure 7.

2 The second experiment is of Figure 8. Is the simulated Google cluster from the Google trace(12,500 machines)? And how to push the simulated Google cluster closer to oversubscription like 98%?

3 The third experiment is of Figure 9 in your paper. It describes contention slows down the relaxation algorithm. The cluster you used is about 40 machines or is exactly the same simulated cluster as that in Figure 8? And how to create a single job with an increasing number of task ranging from 1000 to 5000?

Solving Affinity/Anti-Affinity Complex Constraints Problem

Hi Ionel & Malte, we are creating this issue to bring to your attention the issues we are running into while implementing “xor” flow construct for pod-to-pod anti-affinity functionality.

As regards to “and” flow network construct, it seems current min-cost flow solver needs to be enhanced to generalized min-cost flow network based solver. We are not sure at this time how to go about accomplishing all this.

In the meanwhile, we are addressing pod-to-pod affinity/anti-affinity complex constraints by processing one pod at a time using multi-scheduling rounds, as suggested in Malte’s thesis.

Let us know if there is a way to address issues we are running into while implementing “xor” and “and” flow constructs for solving affinity/anti-affinity complex constraints problem. Following is the Google Doc link for all the issues we have encountered so far. Thanks.

https://docs.google.com/document/d/1kdiDXiLJ2glJ35AWRMX2H3YMOznNgkNu8-4sG4z9HRo/edit?usp=sharing

@ms705 @ICGog @shivramsrivastava

Confusion between listen_uri and http_ui_port

Confusion between listen_uri and http_ui_port causes system to crash on line stream_sockets_channel-inl.h:379 due to invalid value of msg_size field.

Arcs between ECs don't get updated.

Currently, the capacity and cost of the arcs connecting equivalence classes do not get updated. This is not a problem for our current cost models because they don't have arcs between ECs. However, we still want to update the arcs because we may add cost models that require this feature.

Received implausibly large message from tcp error

When I execute the following command
wyb89@ubuntu:~/soft/firmament-master$ ./build/src/coordinator --listen_uri tcp:localhost:8080 --task_lib_dir=./build/src
and then type in "localhost:8080" in my browser
I get following errors
E0102 03:07:12.235633 32499 coordinator_http_ui.cc:1323] Failed running the coordinator's HTTP UI due to bind: Address already in use
F0102 03:07:37.936508 32502 stream_sockets_channel.h:454] Check failed: msg_size < 1024*1024 (5135603447292250196 vs. 1048576) Received implausibly large message from tcp:127.0.0.1:49682
*** Check failure stack trace: ***
@ 0x7f0a1533fffd google::LogMessage::Fail()
@ 0x7f0a15341d80 google::LogMessage::SendToLog()
@ 0x7f0a1533fbe3 google::LogMessage::Flush()
@ 0x7f0a1534274e google::LogMessageFatal::~LogMessageFatal()
@ 0x73e99a firmament::platform_unix::streamsockets::StreamSocketsChannel<>::RecvASecondStage()
@ 0x73cfc7 boost::asio::detail::read_op<>::operator()()
@ 0x73d9b4 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
@ 0x72cc5d boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
@ 0x7302cd boost::asio::io_service::run()
@ 0x7f0a162a45d5 (unknown)
@ 0x7f0a15a346ba start_thread
@ 0x7f0a145963dd (unknown)
Aborted (core dumped)

I do not know why. Were my operations wrong? Could you please give me some guidance?

Build errors on Ubuntu 15.10

Building Firmament on Ubuntu 15.10 currently fails when using clang as the compiler.

On Ubuntu 15.10, gcc and libstdc++ have been upgraded to version 5+. Along with this came a C++11 ABI change that made backwards-incompatible changes to libstdc++. As a result, only libraries that were compiled against the same libstdc++ ABI can be linked together. The Ubuntu libprotobuf9-v5 package is compiled against the new ABI introduced by gcc5.

Clang does not support this new ABI (and apparently has no immediate intention to do so [1]), so Firmament cannot link the Ubuntu libprotobuf unless we switch to g++ for our compiler. All our dependent libraries also have to use the same ABI (either the new or the old one), and we will have to recompile any libraries that do not match what the others do.

Using libc++ (the LLVM C++ standard library) instead of libstdc++ is unfortunately not a solution to this problem, as it turns out to breaks when linking libraries that were built against libstdc++ (i.e., almost anything that we depend on). Recompiling our dependencies against libc++ is not easily possible, because many of them rely on gcc-isms.

Finally, building with g++ on 15.10 succeeds, but the coordinator fails with a segfault on launch.

[1] -- https://llvm.org/bugs/show_bug.cgi?id=23529

Fix simulated Quincy cost model hack.

The simulated Quincy cost model has a hack to allow direct mapping of tasks to racks, rather than via task equivalence classes.

hello world task failed!

hi Malte and everyone, I got a problem when I try to run the 'hello word' example after installing firmament on a virtual machine of Ubuntu 14.04. I had built firmament from source and all 'ctest' tests had been passed successfully. I started up by using the command
./build/src/coordinator --listen_uri tcp:127.0.0.1:8081 --task_lib_dir=$(pwd)/build/src/
, then I could successfully observe the firmament status fromhttp://127.0.0.1:8080. However, when I tried to run the 'hello world' example using command
python job_submit.py localhost 8080 /home/lilelr/opensource/firmament/firmament/build/src/examples/hello_world/hello_world

in the '/firmament/scripts/job' directory, I could see the job was submitted successfully, but quickly it showed me that
E0823 14:33:35.263828 784 task_health_checker.cc:51] Task 2611075011106894433 has failed!

The log coodinator.INFO shows that:

** From your codes in 'hello_world.cc', I guess if the 'hello world' example job was scheduled successfully, I should see the word 'Hello world' on the terminal. It seems the task failed because it didn't send its heartbeat in less than 1 minute, so it was killed by the coordinator. Is that right? How could I fix this problem? Or just because my virtual machine runs firmament too slowly? By the way, when I tried to run the example**

python job_submit.py localhost 8080 /bin/sleep 60
the same problem occurred again.

In gunit tests, factor out setup steps into common helper methods that will make the test code easier to maintain.

A lot of the new tests have very similar setup steps (add job, add task, add machine/PU). These could be factored out into common helper methods that will make the test code easier to maintain.

Implementing pod anti-affinity

We have tried to implement a new cost model by referring to the net cost model. The new cost model named "CPU mem cost model" considers CPU and memory requirements for scheduling instead of network bandwidth. And we have designed and implemented soft constraints on top of this CPU mem cost model.

Please refer the design document of the same and provide your feedback.

In the above design document, we have mentioned the problem with implementing the pod anti-affinity. Please ( @ms705,@ICGog) provide your inputs on this design and suggest how we can efficiently implement pod anti-affinity for cpu mem cost model.

The CPU mem cost model implementation changes are open for review in below gerrithub link.

Make the coordinator less CPU-spin happy

Currently, the coordinator spins hard on incoming messages in the StreamSocketsAdapter when doing asynchronous receives. This was the easiest way of implementing the functionality, but is not actually required.

Instead, we should rework the StreamSocketsAdapter and StreamSocketsChannel code to use the Boost ASIO proactor pattern correctly (with a finishing async receive callback setting up the next async receive), or move to a libev or select-based implementation.

This isn't a correctness issue, but a performance one: the coordinator currently uses an entire CPU core on each machine. Usually, that's not an issue, but we may as well not waste resources unnecessarily.

Workaround required for cs2 to work with sparse node IDs

When using the cs2 solver, the maximum node ID in the flow graph output sent to the solver can be no greater than the number of nodes in the flow graph, or the solver fails. In other words, the p N E DIMACS line at the start of the output must contain the maximum node ID used as N.

The workaround fix deployed in ee31f80 is to return the maximum ID in use instead of the true number of nodes, which leads to cs2 implicitly assuming disconnected nodes for the unused IDs. However, this isn't ideal: as part of the workaround, FlowGraph::NumNodes() returns a number greater than the actual number of nodes in the graph. Moreover, the implied disconnected nodes may slow the solver down (although I doubt there is much impact in practice).

Note that this isn't a problem when node IDs are reused: the number of nodes in the flow graph and the maximum ID are in agreement in this case.

Possible comprehensive solutions:

Re-map sparse node IDs into a continuous space before sending them to cs2, and back again (yuck!).
Continue with the disconnected implicit node approach, but re-engineer things so that FlowGraph::NumNodes() returns the correct number again.

See patch in CL 237749.

Best Way to schedule multiple workers?

I notice in the documentation you've written:

To use Firmament across multiple machines, you need to run a coordinator instance on each machine. 
These coordinators can then be arranged in a tree hierarchy, in which each coordinator can schedule 
tasks locally and on its subordinate childrens' resources.

Yet you've also written:

The parent coordinator must already be running. Once both coordinators are up, you will be able to 
see the child resources on the parent coordinator's web UI.

It seems to imply here that we can create multiple coordinators (such as 3), yet also seems to imply we can only have two coordinators running.

Starting three schedulers in the pattern:

tcp:10.36.75.73:8000 (no parent uri)
tcp:10.36.65.78:8000 (--parent-uri tcp:10.36.75.73:8000 )
tcp:10.36.71.204:8000 (--parent-uri tcp:10.36.75.73:8000 )

Yet the topology map shows only two hosts. I'm assuming this is because this is due to the project being "alpha stage" as mentioned, and totally understandable. Just want to make sure I'm not going crazy.

EventDrivenScheduler should not use executors.

The EventDrivenScheduler instantiates executors, manages and calls them. However, the scheduling logic should only make scheduling decisions. It should be agnostic of the types of executors needed to run the tasks.

Overcommit due to max flow priority over min cost by solver

We ran some tests with cpu memory cost model where we observed a scenario in which firmament over commits pod to a resource. After analysing the output flow graph from the solver(CS2), it was found that solver gives max flow more priority than min cost in a situation where it has to make a choice between the two. In that process it over commits pod to a resource and the pod fails with "OutOfcpu" status. If min cost was given higher priority than max flow the overcommit issue could be avoided.

As it is part of solver code implementation, can you please suggest if there is any way to make the solver solve the flow graph giving min cost more priority than max flow.

@ms705 @ICGog

AddResourceTopologyDFS doesn't fail when we try to add a resource twice.

AddResourceTopologyDFS silently updates the topology when called twice on the same resource. It should crash upon second call, but it cannot do that for now because the Coordinator doesn't fully adhere to the single call per resource policy.

Circular dependency between scheduler unit tests and cost models

We still have a circular dependency in the build system, unfortunately: the cost_models target depends on scheduling, but scheduling contains some unit tests that require libfirmament_cost_models.a.

To reproduce, do a make clean and build:

[...]
  TESTLNK /home/malte/Projects/firmament/build/tests/scheduling/dimacs_exporter_test
  TESTLNK /home/malte/Projects/firmament/build/tests/scheduling/flow_graph_test
clang: error: no such file or directory: '/home/malte/Projects/firmament/build/scheduling/cost_models/libfirmament_cost_models.a'
clang: error: no such file or directory: '/home/malte/Projects/firmament/build/scheduling/cost_models/libfirmament_cost_models.a'
[...]

Possible solutions:

Bring cost models back to scheduling (not really a good answer)
Break out the parts of scheduling that cost_models depends on and build them as a separate module.
Explicitly specify the .o files that the unit tests depend on, rather than going via libfirmament_cost_models.a.

How can I submit tasks and nodes to Firmament?

I have some questions when running firmament? Could you help me?

Question 1: What does parameter --task-lib-dir means?

docker run camsas/firmament:dev /firmament/build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:<host>:<port> --task_lib_dir=$(pwd)/build/src

Question 2: How to submit tasks and nodes info?

Our scenario is as below: (1)T1 and T2 are two tasks (1)N1 and N2 are two nodes. I just want to know whether flow scheduler can get global optimized result if T1 has been scheduled on N1

scenario

Failed to run the simulator

Hi, I am trying to run the simulation with synthetic trace, but it is failing.
I have tried the solvers (cs2 and flowlessly), and the cost models (0 and 6).
The general configuration that I have used is as shown below:

build-release/src/simulator \
--simulation=synthetic \
--synthetic_num_jobs=100 \
--synthetic_num_machines=10 \
--synthetic_machine_failure_duration=0 \
--synthetic_task_duration=2 \
--synthetic_tasks_per_job=2 \
--runtime=100000000000 \
--scheduler=flow \
--flow_scheduling_cost_model=6 \
--preemption \
--simulated_dfs_type=bounded \
--simulated_block_size=1073741824 \
--max_sample_queue_size=10 \
--solver=cs2 \
--log_solver_stderr \
--max_solver_runtime=100000000000 \
--machine_tmpl_file=../../tests/testdata/mach_16pus.pbin \
--generate_trace \
--generated_trace_path=firmament/results/simu-release/trace-path \
--generate_quincy_cost_model_trace \
--log_dir=firmament/results/simu-release/log \
--quincy_no_scheduling_delay \
--online_factor 1 -v 10

For the cost model 0, its is failing with the following error:
F0425 18:59:30.177265 24372 trivial_cost_model.cc:139] Check failed: leaf_res_ids_->size() >= FLAGS_num_pref_arcs_task_to_res (0 vs. 1)
The traces:
results/simu-release/trace-path/task_events/part-00000-of-00500.csv
1000000,,1,1,,0,,,,,,,
results/simu-release/trace-path/machine_events/part-00000-of-00001.csv
0,1,0,,,
0,2,0,,,
0,3,0,,,
0,4,0,,,
0,5,0,,,
0,6,0,,,
0,7,0,,,
0,8,0,,,
0,9,0,,,
0,10,0,,,

For the cost model 6, its is failing with the following error:
*** Error in `firmament/build-release/src/simulator': corrupted double-linked list: 0x000000000133bdb0 ***
The log is saying:
W0425 18:59:58.946012 24390 trace_generator.cc:264] 100% of tasks are unscheduled
results/simu-release/trace-path/scheduler_events/scheduler_events.csv
1000000,388,0,930,2,0,0,2,2,10,16,0,25,1,0,2,11,1,1,1,2,0,10,0,10,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1000388,358,0,934,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1000746,459,0,1085,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1001205,460,0,1067,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
etc...
results/simu-release/trace-path/task_events/part-00000-of-00500.csv
1000000,,1,1,,0,,,,,,,
1000000,,1,2,,0,,,,,,,
2000000,,2,1,,0,,,,,,,
2000000,,2,2,,0,,,,,,,
3000000,,3,1,,0,,,,,,,
3000000,,3,2,,0,,,,,,,

results/simu-release/trace-path/machine_events/part-00000-of-00001.csv
0,1,0,,,
0,2,0,,,
0,3,0,,,
0,4,0,,,
0,5,0,,,
0,6,0,,,
0,7,0,,,
0,8,0,,,
0,9,0,,,
0,10,0,,,
1012484,10,1,,,
1012484,10,0,,,

For the COCO model, it is generating at least some traces, but for the TRIVIAL, it has almost no trace generated. I am most probably missing some configuration. Please, could you help me?

Thanks!

Integrating affinity/anti-affinity aware policy into firmament

Kubernetes has already supported inter-pod affinity/anti-affinity within its native scheduler, however, such an important feature could not be found within firmament. Another limitation is the support of task preemption, until now TaskPreemptionCost is still empty.

So when the features of affinity/anti-affinity, as well as task preemption could be supported by firmament? And if possible, could technical report be delivered at first?

Resource allocation of firmament

Hi:

As far as I know, there are two ways to allocate resource:

Coarse granularity： Partition machine into fixed-size slots, and every slot can run one task, such as Hadoop.
Fine-grained resource allocate like Brog. (Borg users request CPU in units of milli-cores, and memory and disk space in bytes)

I have seen that both your work and Quincy use constant integer K to represent the capacity of a machine, like coarse-grained allocate. But there are some fine-grained resource information in cost model.

I want to know

How does firmament represent resource requested by a task and resource owned a machine ?
what's the physical meaning of capacity and how do you get the value of K?

Submitting >1 job crashes the coordinator

This is due to all root tasks currently having the same name, as it is based on hashing the creating task ID (which is zero for the root task on job submission).

The engine, however, assumes that it has never seen a task before, and will CHECK-fail if it sees the same TaskID arriving again.

Required fixes:

Change hashing scheme such that different jobs' root tasks have different names.
Sensibly deal with resubmissions of the same task:
- no-op if it hash already finished and its outputs exist
- no-op (wait) if it is currently running

Some questions about the trace simulator

@ms705 Hi, I read yours OSDI paper and see that you used the google trace workload to test the firmament scheduler's task placement latency and the job response time. And I see the trace data csv file includes job and task's metadata, so my question is:

Can the google trace data be used to test the task/job finish time in one machine? The simulator is just a process that simulates the scheduler and can't simulate the task/job execution time, is my thought right?
I saw your source sim code, can the new scheduler(eg. mesos scheduler) be added to yours simlulator by extending src/sim/simulator_bridge.cc?
So, if I want to test a new scheduler's(eg. mesos) task placement latency, do I have to write the simulator codes for this scheduler? Or can I just use yours simulator to test the scheduler's performance? And if there are more details about how to write a new simulator for one's own scheduler by using google trace data, please let me know.

Thank you very much!

Parallel build using make -j fails

There's something wrong with dependencies via intermediate libfirmament_*.a files; when building in parallel mode (e.g. make -j 8), all .o files get made correctly, but linking via the .a files fails as they do not (yet?) exist.

Should investigate this.

[build] ext build target marked as complete even when it fails

If you run make ext and it fails, for example because of missing dependencies, the build system seems to believe the step has been successfully completed. If you run make ext again, even without having made any changes, I get:
make: 'ext' is up to date.

I am running Ubuntu 11.10
gnu make: 3.81

camsas / firmament Goto Github PK

firmament's People

Stargazers

Watchers

Forkers

firmament's Issues

Recommend Projects

Recommend Topics

Recommend Org