munhouiani / deep-packet Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 52.0 2.87 MB

Pytorch implementation of deep packet: a novel approach for encrypted traffic classification using deep learning

License: MIT License

Python 2.31% Jupyter Notebook 97.69%

cnn deep-learning pytorch pytorch-lightning traffic-classification

deep-packet's People

Contributors

Stargazers

Watchers

Forkers

maazirfan aliffhadeni satkol nicphang1011 condyboy cecily01001 cyan95 readall freeebird xuanray timtico mrshellx-1999820 bangbangyang lagarcia38 debadityashome dudumiumiumiu xbover seanachao glen61y141 aalmangour changyui nico-leo didiksudyana ginger45 lingtreewfy hangeramber derekzang123 lhmzll johanncsp k1m743hyun w-monster robot-2020 nine-hg p-anand56 ailabteam lulu-cloud h2bit codebilibili 2013140952 cookie1024 qiu105 bluesxxx76 axh2018 zz4fap an9unie chris123540 1159904577 zfy3000163 zhaoyongrui

deep-packet's Issues

Make a prediction

Dear owner,

In the project, there are methods for preprocessing data, and training and evaluating the model. However, it is unclear to me how data should be passed for perfrming an inference or prediction with the obtained model, as in preprocessing app_label and traffic_labels are passed and this wouldn't make sense in the case.

Could you give some tips, example or Jupyter notebook on how data should be passed to consume the model?

Thanks,
Álex.

About missing .pcap file

我下载了ISCXVPN2016链接中的原始文件，但是发现没有包含"torrent01"这个pcap文件，可以分享一下该pcap文件吗

Balance the train and test sets

If we have almost the same amount of packets for every label, can we skip the undersampling?
The question is how to have almost the same amount not only for the train set, but also for the test set?

标签问题

preprocessing.py中根据文件名前缀来获取app label和traffic label。
但PREFIX_TO_APP_ID和PREFIX_TO_TRAFFIC_ID里的前缀都没有完全覆盖完整的数据集。
是不是某些数据包文件只参与一种分类任务？

你好，docker镜像里没有文件为什莫

关于一些python库的安装

本机为Windows10系统，在安装pytorch_lightning和petastorm时出现安装错误，且在预处理时在命令行输入python preprocessing.py -s D:\DP\deeppacket\ISCX数据集\CompletePcap -t processed_data -n 1时出现错误，其中D:\DP\deeppacket\ISCX数据集\CompletePcap为全部数据集文件的路径，processed_data为与preprocessing.py在同一文件夹的空文件夹，运行至中途出错，使得processed_data已有部分parquet文件，不知是否与两个库没装以及电脑自身问题有关

close

在代码中的一些讨论

作者您好，我也在准备复现deep packet这篇论文，您这个项目对于我的帮助很大。在参考您的代码实际运行过程中有一些疑惑想要和您交流一下。
preprocessing.py中：arr = np.pad(arr,pad_width=(0,pad_width),mode='constant',constant_values=0)，我这边运行提示mode=是不可省略参数，所以我就补上了这个不知道对不对；Parallel(n_jobs=njob)，如果采用了默认的n_jobs=-1的话我运行过程中一直出现CPU和内存占用一直100%，过一段时间报memorryerror的错误，限定4或者6反而没报错。
ml/model.py中，training_step()函数中有一句y_hat = self(x)，我个人理解是y_hat = self.forward(x)，不知道是不是一个作用，想要得到您的解答

pre-traind模型使用出错

求助求助！！！

error when run train_cnn.py

when I tried to train the model on my own dataset, there was an error as follow:
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep-packet/train_cnn.py", line 25, in main
train_application_classification_cnn_model(data_path, model_path,cls_num)
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep-packet/ml/utils.py", line 289, in train_application_classification_cnn_model
train_cnn(
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep-packet/ml/utils.py", line 230, in train_cnn
trainer.fit(model)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 206, in on_run_start
self.trainer.reset_train_dataloader(self.trainer.lightning_module)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1515, in reset_train_dataloader
self.train_dataloader = self._data_connector._request_dataloader(RunningStage.TRAINING)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 446, in _request_dataloader
dataloader = source.dataloader()
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 520, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep-packet/ml/utils.py", line 111, in train_dataloader
dataset_dict = datasets.load_dataset(self.hparams.data_path)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/load.py", line 1769, in load_dataset
ds = builder_instance.as_dataset(split=split, ignore_verifications=ignore_verifications, in_memory=keep_in_memory)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/builder.py", line 1066, in as_dataset
datasets = map_nested(
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 444, in map_nested
mapped = [
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 445, in
_single_map_nested((function, obj, types, None, True, None))
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
return function(data_struct)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/builder.py", line 1097, in _build_single_dataset
ds = self._as_dataset(
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/builder.py", line 1168, in _as_dataset
dataset_kwargs = ArrowReader(cache_dir, self.info).read(
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/arrow_reader.py", line 239, in read
return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/arrow_reader.py", line 260, in read_files
pa_table = self._read_files(files, in_memory=in_memory)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/arrow_reader.py", line 203, in _read_files
pa_table = concat_tables(pa_tables) if len(pa_tables) != 1 else pa_tables[0]
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/table.py", line 1778, in concat_tables
return ConcatenationTable.from_tables(tables, axis=axis)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/table.py", line 1484, in from_tables
return cls.from_blocks(blocks)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/table.py", line 1427, in from_blocks
table = cls._concat_blocks(blocks, axis=0)
File "/home/inspur/anaconda3/envs/mulmodel/lib/python3.8/site-packages/datasets/table.py", line 1373, in _concat_blocks
return pa.concat_tables(pa_tables, promote=True)
File "pyarrow/table.pxi", line 5120, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field feature has incompatible types: list<element: double> vs list<item: double>
0%| | 0/1 [00:01<?, ?it/s]

关于train_cnn

win10系统，python3.9，使用下载的train_test_data中的数据运行train_cnn，出现如下报错：
Traceback (most recent call last):
File "E:\Deep-Packet-master\train_cnn.py", line 28, in
main()
File "D:\py\python3.9\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "D:\py\python3.9\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "D:\py\python3.9\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\py\python3.9\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "E:\Deep-Packet-master\train_cnn.py", line 20, in main
train_application_classification_cnn_model(data_path, model_path, gpu)
File "E:\Deep-Packet-master\ml\utils.py", line 42, in train_application_classification_cnn_model
train_cnn(c1_kernel_size=4, c1_output_dim=200, c1_stride=3, c2_kernel_size=5, c2_output_dim=200, c2_stride=1,
File "E:\Deep-Packet-master\ml\utils.py", line 32, in train_cnn
model = CNN(hparams).float()
File "E:\Deep-Packet-master\ml\model.py", line 15, in init
self.hparams = hparams
File "D:\py\python3.9\lib\site-packages\torch\nn\modules\module.py", line 1178, in setattr
object.setattr(self, name, value)
AttributeError: can't set attribute

training fails on VPN dataset with a ValueError

I see a ValueError: Please pass features or at least one example when writing data` at the end of train_cnn when run on VPN dataset. I have not modified the code. I faced NaN error and set under sampling to False. Then I encountered this one.

Here is the detailed output
``Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..

| Name | Type | Params

0 | conv1 | Sequential | 1.0 K
1 | conv2 | Sequential | 200 K
2 | max_pool | MaxPool1d | 0
3 | fc1 | Sequential | 9.9 M
4 | fc2 | Sequential | 20.1 K
5 | fc3 | Sequential | 5.0 K
6 | out | Linear | 867

10.1 M Trainable params
0 Non-trainable params
10.1 M Total params
40.430 Total estimated model params size (MB)
Using custom data configuration train.parquet-2c3be5e9d214c057
Downloading and preparing dataset parquet/train.parquet to /home/rvn/.cache/huggingface/datasets/parquet/train.parquet-2c3be5e9d214c057/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3663.15it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 565.27it/s]
Traceback (most recent call last):
File "/home/rvn/Deep-Packet/train_cnn.py", line 33, in
main()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/rvn/Deep-Packet/train_cnn.py", line 25, in main
train_application_classification_cnn_model(data_path, model_path)
File "/home/rvn/Deep-Packet/ml/utils.py", line 117, in train_application_classification_cnn_model
train_cnn(
File "/home/rvn/Deep-Packet/ml/utils.py", line 58, in train_cnn
trainer.fit(model)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 195, in run
self.on_run_start(*args, **kwargs)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 211, in on_run_start
self.trainer.reset_train_dataloader(self.trainer.lightning_module)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1812, in reset_train_dataloader
self.train_dataloader = self._data_connector._request_dataloader(RunningStage.TRAINING)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 453, in _request_dataloader
dataloader = source.dataloader()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 526, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/rvn/Deep-Packet/ml/model.py", line 101, in train_dataloader
dataset_dict = datasets.load_dataset(self.hparams.data_path)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/load.py", line 1698, in load_dataset
builder_instance.download_and_prepare(
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/builder.py", line 807, in download_and_prepare
self._download_and_prepare(
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/builder.py", line 898, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/builder.py", line 1516, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/home/rvn/miniconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/arrow_writer.py", line 559, in finalize
raise ValueError("Please pass features or at least one example when writing data")
ValueError: Please pass features or at least one example when writing data`

other datasets for encrypted traffic classification

Do you use other data sets for encrypted traffic classification, such as ISCXTor2016, and if so, can you share the full code? I failed to modify your code. My email is [email protected].

Thank you

关于数据标签的问题

您好，非常感谢您复现了deep-packet。
在使用您preprocess.py文件时，我发现您对流量数据的标签按照应用(app)分类时只包括了Non-VPN，即

AIM chat

'aim_chat_3a': 0,
'aim_chat_3b': 0,
'aimchat1': 0,
'aimchat2': 0,

但是ISCXVPN2016原数据集中还包括了例如vpn_aim_chat，这一部分vpn数据是不考虑在app分类中吗？

非常期待您的回答

Approach flawed if ports left in dataset

I just stumbled upon this repository and it looks like only IP information is removed from the dataset. Are you aware of this? I find it hard to believe that your model actually learned anything besides the mapping of network flows to traffic classes.

GPUs requested but none are available.

我的机器有显卡GeForce GTX 1080Ti，按照Dockerfile搭建了本地环境，train_cnn.py不使用GPU时可以运行，但启用GPU运行会报错。

pytorch_lightning.utilities.exceptions.MisconfigurationException: GPUs requested but none are available.

已安装了nvidia驱动，cuda和cudnn。

Ask a question about parameter njobs

My computer has 6 cores and 8GB RAM.
But my njobs parameter can only use 1.
When I use -1 or even 2, it reports error.

cannot convert float NaN to integer

Dear owner,

I wanted to test the code and I have tried to start the python create_train_test_set.py -s processed_data -t train_test_data code after python preprocessing.py -s data/VPN-PCAPS-01/ -t processed_data. I used dataset from README VPN-PCAPS-01.zip.

I got an error cannot convert float NaN to integer in 57 line: min_label_count = int(label_count_df["count"].min()). Of course I can change the line as min_label_count = int(label_count_df["count"].min()) if not label_count_df["count"].empty else 0, but I understand that data from packet is empty and after finished the code the data will be empty.

If I run the code as python create_train_test_set.py -s processed_data -t train_test_data --under_sampling False (1), than I will be able get some result:

label count
0 7 15991
1 6 127016
2 9 269115
3 5 40164
4 10 900984

I will get an error Please pass features or at least one example when writing data in evaluation_cnn.ipynb line 6 with model on the output of result (1) train_test_data.

How I need to start python create_train_test_set.py -s processed_data -t train_test_data after preprocessing.py? Because the instruction is not full in the current README.

Also, it's not clear for me how you set label for pcap packets? I saw that you used prefix = path.name.split(".")[0].lower(). I'm not sure that used name of file is good idea to set class for NN learning. Why you don't use the IP list? How I can understand that the model works clear if I can't see labels by IP?

Thanks.

The result of evaluation.ipynb

Hello, I'm using this repo. I notice that the results in evaluation.ipynb are not very consistent with the paper, especially the precision will be lower, why is this?

Thank you!

Why "remove tor and torrent related data as they are no longer available" mean?

I would like to ask what the exact definition of "no longer available" is. Is there an error in the data set? Or does the modern Internet world not recognize tornet?
Thanks!

Pre-trained models link are not available

Hi! First, thank you very much for this implementation :D
I'm trying download the pre-trained models but the link is not working :( The drive shows "The file you requested does not exist.".

运行错误

model/application_classification.cnn.model 在项目中未发现这个模块，需要自己写嘛？

KetError:length when run train_cnn.py

Traceback (most recent call last):
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep_pac/train_cnn.py", line 35, in
main(data_dir + 'etc/train_test_data/train.parquet','model/etc.cnn.model',task="app",cls_num=8)
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep_pac/train_cnn.py", line 23, in main
train_application_classification_cnn_model(data_path, model_path,cls_num)
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep_pac/ml/utils.py", line 290, in train_application_classification_cnn_model
train_cnn(
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep_pac/ml/utils.py", line 231, in train_cnn
trainer.fit(model)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 195, in run
self.on_run_start(*args, **kwargs)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 211, in on_run_start
self.trainer.reset_train_dataloader(self.trainer.lightning_module)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1812, in reset_train_dataloader
self.train_dataloader = self._data_connector._request_dataloader(RunningStage.TRAINING)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 453, in _request_dataloader
dataloader = source.dataloader()
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 526, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/inspur/lvzhuo/metrics learning/MultiModel_Plus/compare/deep_pac/ml/utils.py", line 112, in train_dataloader
dataset_dict = datasets.load_dataset(self.hparams.data_path)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/load.py", line 1675, in load_dataset
builder_instance = load_dataset_builder(
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/load.py", line 1478, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/builder.py", line 347, in init
self.info = DatasetInfo.from_directory(self._cache_dir)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/info.py", line 284, in from_directory
return cls.from_dict(dataset_info_dict)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/info.py", line 289, in from_dict
return cls({k: v for k, v in dataset_info_dict.items() if k in field_names})
File "", line 20, in init
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/info.py", line 145, in post_init
self.features = Features.from_dict(self.features)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/features/features.py", line 1597, in from_dict
obj = generate_from_dict(dic)
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/features/features.py", line 1280, in generate_from_dict
return {key: generate_from_dict(value) for key, value in obj.items()}
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/features/features.py", line 1280, in
return {key: generate_from_dict(value) for key, value in obj.items()}
File "/home/inspur/anaconda3/envs/deep_packet/lib/python3.10/site-packages/datasets/features/features.py", line 1284, in generate_from_dict
return Sequence(feature=generate_from_dict(obj["feature"]), length=obj["length"])
KeyError: 'length'

Provided train_test_set is not correct

There is a statement "For each of the application and traffic classification tasks, the dataset is first stratified split into train set and test set with the ratio of 80:20" in blog post https://blog.munhou.com/2020/04/05/Pytorch-Implementation-of-Deep-Packet-A-Novel-Approach-For-Encrypted-Tra%EF%AC%83c-Classi%EF%AC%81cation-Using-Deep-Learning/.
But in fact the ratio for provided dataset on link
https://drive.google.com/file/d/1EF2MYyxMOWppCUXlte8lopkytMyiuQu_/view?usp=sharing
is 20:80, so test set much bigger than train dataset:

About the missing data set categories

Hi.

I am trying to use ISCXVPN2016 for data preprocessing and segmentation of training and test sets. But ISCXVPN2016 does not seem to have a torrent01 item.

So I downloaded your processed dataset, but when I checked the number, I found that your dataset (category classification) is distributed as follows.

    label  count                                                                
0       0  12731
1       7  12731
2       6  12731
3       5  12731
4       1  12731
5      10  12731
6       3  12731
7       8  12731
8      11  12731
9       2  12731
10      4  12731

    label    count
0       0    23990
1       7    25344
2       6     8480
3       5   958956
4       1    13582
5      10    53498
6       3    18473
7       8     3260
8      11   179758
9       2  1236595
10      4    14258

It looks like there are only 11 categories instead of 12. I would like to ask, is it a mistake on my part?

关于create_train_test.py问题

想请问一下~
为什么执行了此py文件后
parquet文件的大小会减小很多？

docker运行时，调用spark时出错.

您好。在生成测试集和训练集时出错。
docker中未看到安装jdk，jdk安装步骤或者新的docker能提供不？
python create_train_test_set.py -s processed_data -t train_test_data
JAVA_HOME is not set
Traceback (most recent call last):
File "create_train_test_set.py", line 185, in
main()
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "create_train_test_set.py", line 153, in main
.config('spark.driver.host', '127.0.0.1')
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/conda/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/conda/lib/python3.7/site-packages/pyspark/context.py", line 133, in init
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/opt/conda/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/opt/conda/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/opt/conda/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

SAE

The model of the original paper is the combination of SAE and CNN. However, there is no reference to SAE in your code. May I ask where can I find the relevant code of SAE? Thank you for your reply

main.py需要用到吗？

under sampling

Hi, I notice that in your code under sampling was made after "train and test split", and only done it with train set. Maybe this operation will make test set is bigger than train set. So this is a mistake, or there are some reasons that you did such operation?

UDP padding

Hello,

I assume you pad UDP with the function pad_udp(packet) because the difference between UDP and TCP is exactly 12 bytes.

However, TCP might have also an optional field up to 20 bytes. There is no checkup for this optional length, as I can see. Do I overlook something?

best regards

about:create_train_test_set.py

Why set test_size= 0.2 in create_train_test_set.py, but the resulting data set (training set: Test Set) is not (8:2). Moreover, the sample size of the test set is much higher than that of the training set.

运行train_cnn.py出现错误

Traceback (most recent call last):
File "train_cnn.py", line 27, in
main()
File "/home/ubuntu/lxd-storage/qizhipeng/anaconda3/envs/torchEnv/lib/python3.7/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/home/ubuntu/lxd-storage/qizhipeng/anaconda3/envs/torchEnv/lib/python3.7/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/lxd-storage/qizhipeng/anaconda3/envs/torchEnv/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/lxd-storage/qizhipeng/anaconda3/envs/torchEnv/lib/python3.7/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "train_cnn.py", line 21, in main
train_traffic_classification_cnn_model(data_path, model_path, gpu)
File "/home/ubuntu/lxd-storage/qizhipeng/Deep-Packet/ml/utils.py", line 54, in train_traffic_classification_cnn_model
logger=logger)
File "/home/ubuntu/lxd-storage/qizhipeng/Deep-Packet/ml/utils.py", line 35, in train_cnn
model = CNN(hparams).float()
File "/home/ubuntu/lxd-storage/qizhipeng/Deep-Packet/ml/model.py", line 15, in init
self.hparams = hparams
File "/home/ubuntu/lxd-storage/qizhipeng/anaconda3/envs/torchEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 826, in setattr
object.setattr(self, name, value)
AttributeError: can't set attribute
找了好久没找到问题出在哪，流程是正确的，请问知道原因嘛？

pre-train model

您好，预训练模型的链接再次失效，请问可以更新一下链接嘛~

关于运行train_cnn.py出错的问题

电脑为Windows 10系统，python版本为 3.9.1
在使用cmd控制台运行train_cnn.py时，出现 AttributeError: can't set attribute 错误，请问如何解决？
输入的命令行参数如下：
python train_cnn.py -d F:\Deep-Packet-master\testin -m F:\Deep-Packet-master\modeltest -t traffic
感谢~

训练时间太久

大佬您好，我尝试运行了您的代码，在训练app分类器，但是第一个epoch的训练就用了很久，这是为什么呢？是正常的吗？

The size of train set and test set is wired

I have followed the steps in your github and blog, everything went well except for the size of train and test set. In the fold \train_test_data\application_classification\test.parquet, the total size of data is 2.49 GB while that in \train_test_data\application_classification\train.parquet is only 37.6 MB. Is that OK?

Request to add a license

Hi, I'd like to ask permission to use this code in my project (non-commercial). Could you perhaps add a license to the repository? Thank you in advance.

This may help in choosing a license: https://choosealicense.com/no-permission/