Code Monkey home page Code Monkey logo

loghub's Introduction

Loghub

Loghub maintains a collection of system logs, which are freely accessible for AI-driven log analytics research. Some of the logs are production data released from previous studies, while some others are collected from real systems in our lab environment. Wherever possible, the logs are NOT sanitized, anonymized or modified in any way. These log datasets are freely available for research or academic work.

🤗 We proudly announce that the loghub datasets have attained total by more than 450 organizations from both industry and academia.

Logs currently available

🔗 Get raw logs via hyperlinks in the Download column.

Dataset Description Labeled Time Span #Lines Raw Size Download
📂 Distributed systems
HDFS_v1 Hadoop distributed file system log ✔️ 38.7 hours 11,175,629 1.47GB 🔗
HDFS_v2 Hadoop distributed file system log N.A. 71,118,073 16.06GB 🔗
HDFS_v3 Instrumented HDFS trace log (TraceBench) ✔️ N.A. 14,778,079 2.96GB 🔗
Hadoop Hadoop mapreduce job log ✔️ N.A. 394,308 48.61MB 🔗
Spark Spark job log N.A. 33,236,604 2.75GB 🔗
Zookeeper ZooKeeper service log 26.7 days 74,380 9.95MB 🔗
OpenStack OpenStack infrastructure log ✔️ N.A. 207,820 58.61MB 🔗
📂 Super computers
BGL Blue Gene/L supercomputer log ✔️ 214.7 days 4,747,963 708.76MB 🔗
HPC High performance cluster log N.A. 433,489 32.00MB 🔗
Thunderbird Thunderbird supercomputer log ✔️ 244 days 211,212,192 29.60GB 🔗
📂 Operating systems
Windows Windows event log 226.7 days 114,608,388 26.09GB 🔗
Linux Linux system log 263.9 days 25,567 2.25MB 🔗
Mac Mac OS log 7.0 days 117,283 16.09MB 🔗
📂 Mobile systems
Android_v1 Android framework log N.A. 1,555,005 183.37MB 🔗
Android_v2 Android framework log N.A. 30,348,042 3.38GB 🔗
HealthApp Health app log 10.5 days 253,395 22.44MB 🔗
📂 Server applications
Apache Apache web server error log 263.9 days 56,481 4.90MB 🔗
OpenSSH OpenSSH server log 28.4 days 655,146 70.02MB 🔗
📂 Standalone software
Proxifier Proxifier software log N.A. 21,329 2.42MB 🔗

🔥 Citation

Please cite the following paper if you use the loghub datasets in your research.

Publications using loghub datasets

Publication Paper Title
DSN'07 Adam J. Oliner, Jon Stearley. What Supercomputers Say: A Study of Five System Logs. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007.
SOSP'09 Wei Xu, Ling Huang, Armando Fox, David A. Patterson, Michael I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. ACM Symposium on Operating Systems Principles (SOSP), 2009.
KDD'09 Adetokunbo Makanju, A. Nur Zincir-Heywood, Evangelos E. Milios. Clustering Event Logs Using Iterative Partitioning. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2009.
ISSRE'16 Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Experience Report: System Log Analysis for Anomaly Detection. IEEE International Symposium on Software Reliability Engineering (ISSRE), 2016.
DSN'16 Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. An Evaluation Study on Log Parsing and Its Use in Log Mining. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2016.
ICSE'16 Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen. Log Clustering Based Problem Identification for Online Service Systems. International Conference on Software Engineering (ICSE), 2016.
ICWS'17 Pinjia He, Jieming Zhu, Zibin Zheng, Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree. IEEE International Conference on Web Services (ICWS), 2017.
CCS'17 Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. ACM Conference on Computer and Communications Security (CCS), 2017.
TDSC'18 Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. Towards Automated Log Parsing for Large-Scale Log Data Analysis. IEEE Transactions on Dependable and Secure Computing (TDSC), 2018.
TKDE'18 Min Du, Feifei Li. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018.
ASE'19 Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, Michael R. Lyu. Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression. IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019.
ICSE'19 Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. Tools and Benchmarks for Automated Log Parsing. International Conference on Software Engineering (ICSE), 2019.
ICSE'22 Zanis Ali Khan, Donghwan Shin, Domenico Bianculli, Lionel Briand. Guidelines for Assessing the Accuracy of Log Message Template Identification Techniques. International Conference on Software Engineering (ICSE), 2023.
ICSE'23 Van-Hoang Le, Hongyu Zhang. Log Parsing with Prompt-based Few-shot Learning. International Conference on Software Engineering (ICSE), 2023.
ICSE'23 Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He, Qingwei Lin, Dongmei Zhang. Did We Miss Something Important? Studying and Exploring Variable-Aware Log Abstraction. International Conference on Software Engineering (ICSE), 2023.
ICSE'23 Yintong Huo, Yuxin Su, Cheryl Lee, Michael R. Lyu. SemParser: A Semantic Parser for Log Analysis. International Conference on Software Engineering (ICSE), 2023.
WWW'23 Liming Wang, Hong Xie, Ye Li, Jian Tan, John C.S. Lui. Interactive Log Parsing via Light-weight User Feedback. ACM Web Conference, 2023.
TSC'23 Siyu Yu, Pinjia He, Ningjiang Chen, Yifan Wu. Brain: Log Parsing with Bidirectional Parallel Tree. IEEE Transaction on Severice Computing, 2023.

💡 If you use loghub datasets in your paper, please feel free to make a PR to add your paper to the table.

Discussion

Welcome to join our WeChat group for any question and discussion. Alternatively, you can open a discussion here.

Scan QR code

🌈 License

The datasets are freely available for research or academic work. For any usage or distribution of the datasets, please refer to the loghub repository URL https://github.com/logpai/loghub and cite the loghub paper where applicable.

loghub's People

Contributors

pinjiahe avatar shilinhe avatar zhujiem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

loghub's Issues

WindowsEvent label

I need to mining through windows events to capture the events from them.
Unfortunately, your windows events are without a label.
Do you have any metadata about them?
Or doe you know any paper used these events and preparing the labels or metadata?

Thanks

IPLoM throws IndexError exception

This only happens when processing Android logs with more than 10M.
The exception causes IPLoM finishes much faster than usual. Hence maybe the rest of the logs are not processed.

Multi-class resources availability

There are some companies like Zebrium around saying that they get an AI to do root cause analysis of logs. I think root cause analysis is not just about splitting the data between normal and abnormal. But I can only find binary classification data. Is there any multi-class data?

Anomaly Items in Dataset?

Hi,

Can I ask whether there is any other dataset including clear abnormal items inside logs except HDFS? Thanks.

I notice BGl is labelled dataset. But I did not find the clear labels regarding it from the hub.

Please let me know more details.

Thanks

Redistribution rights of derivatives generated from loghub's free datasets

Hi,

We need to generate a synthetic dataset for an experiment in our upcoming research work. In order to provide any future work in this direction a fair ground for comparison, we want to make the dataset available for download. We are using the sample logs from this repository to generate these logs, however, since there is no licensing information available for the sample logs I wanted to know if we can host this synthetic dataset on our github repository or can loghub help in hosting this dataset.

Thank you.

What is the reference to choose CBS logs as the dataset

I have made some research about how to check the event logs to do digital forensic. I found the security logs have the largest relationship with hacker attacks.
When a unauthenticated access or login happen, you can find the record on security logs. While there is nothing showed in CBS logs.

So I have the doubt whether such kind of dataset is really useful when doing the anomaly detection.

Need a help to understand the HDFS2_ log file structure: 2016-01-13 07:48:28,240 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x13de8a8372744c, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 1 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.

2016-01-13 07:48:28,240 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x13de8a8372744c, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 1 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.

Feature (Column) Names of HDFS_1 and HDFS_2 Files

Can you give me information about feature names of the files mentioned in the title? For example

081109 203518 143 INFO dfs.DataNode$DataXceiver: Receiving block blk_-1608999687919862906 src: /10.250.19.102:54106 dest: /10.250.19.102:50010

What does above row say?

OpenStack Logs

你好何教授,

我想最近在做日志相关的探索,想详细了解下OpenStack数据集里的日志每个文件是做什么的?

openstack_abnormal 这个文件里的instance全部是异常的吗?

openstack_normal1 和 openstack_normal2 两个日志有什么区别吗?

labels 文件里为什么只有4个注入异常类的instance, openstack_abnormal 里的instance数远超4个,那么其他的instance是什么异常呢?这4个标出来的instance有什么特殊吗?

多谢!

OpenSSH Logs.

I applied for access to the LogPai team because I was specifically interested in the OpenSSH logs for my work. Is there any way I can get access to them for my dissertation work? If not tons of logs, at least a few MBs would be great. Anything more than the 2K that is present. The help is much appreciated.

Encoding issue with Linux log

Not sure whether it's an issue from here. But when try to read the current Linux.log (zenodo, md5:6d1802d7778126f21c001c6aa7b6b106) with python i got

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 20: invalid start byte

can you confirm that or is that something probably going wrong on my side?

Can I redistribute some of the 2K sample logs?

Dear LogPai team,

Thank you for maintaining this wonderful log datasets.

I just wonder if it’s okay to redistribute a few of your 2K logs (e.g., https://github.com/logpai/loghub/blob/master/HDFS/HDFS_2k.log) just as an example log dataset in my replication package.

Though you kindly noted here that “the log datasets are freely available for research purposes”, it’s not clear to me if this includes the redistribution right of the log datasets. If possible, I will present a clear reference to this dataset repository and then include a few sample logs in my replication package as examples.

Looking forward to hearing from you soon.

Thanks,
Donghwan

where can i get windows logs

i applied logs on zendoo but i didn't find windows logs which i need. So where can i get them? please do tell, thanks a lot.

Make data available via Hugging Face Hub

Very cool dataset and impressive to see how much use/impact it is having! It would be nice if it was also possible to access it via the Hugging Face Hub (https://huggingface.co/datasets). There are a few possible approaches to doing this (and it's also possible to have gated acces). Happy to help with this if it is of interest!

Where is the complete HDFS.log raw file

Hey Im having problem finding the complete file cause 2k is not enough data for a model Im testing on. Can you provide the link to get the complete HDFS.log raw log file?

Hadoop log

Hello!

In LogHub, Hadoop log dataset are divided into different application parts with different id and have labels very clearly for every type but in LogPub(logHub 2.0), you mixed them only in one file but i want to get just WORDCOUNT application part with the same file path format of that as in LogHub. Can you help me with that?

Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.