I think the error happens when the program tries to read the synthetic dataset in AIRR format, but there is some issue with the way the columns are specified.
I investigated the synthetic airr file rep_0.tsv and found that the sequence_id column has some weird issues. This is an example of the file contents:
This is my output.
Any help is appreciated.
(immuneml_env) [immuneML]$ immune-ml-quickstart ./quickstart_results/
immuneML quickstart: generating a synthetic dataset...
2024-05-05 20:22:13.029352: Setting temporary cache path to quickstart_results/synthetic_dataset/result/cache
2024-05-05 20:22:13.029383: ImmuneML: parsing the specification...
2024-05-05 20:22:13.752929: Imported repertoire dataset my_synthetic_dataset with 100 examples.
2024-05-05 20:22:13.876557: Full specification is available at quickstart_results/synthetic_dataset/result/full_simulation_specs.yaml.
2024-05-05 20:22:13.876602: ImmuneML: starting the analysis...
2024-05-05 20:22:13.876629: Instruction 1/1 has started.
2024-05-05 20:22:15.137774: Instruction 1/1 has finished.
2024-05-05 20:22:15.151792: Generating HTML reports...
2024-05-05 20:22:15.194902: HTML reports are generated.
2024-05-05 20:22:15.195323: ImmuneML: finished analysis.
immuneML quickstart: finished generating a synthetic dataset.
immuneML quickstart: training a machine learning model...
2024-05-05 20:22:15.201168: Setting temporary cache path to quickstart_results/machine_learning_analysis/result/cache
2024-05-05 20:22:15.201184: ImmuneML: parsing the specification...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 183, in load_sequence_dataframe
df = alternative_load_func(filepath, params)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 159, in alternative_load_func
df = airr.load_rearrangement(filename)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/airr/interface.py", line 103, in load_rearrangement
df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 164, in load_repertoire_as_object
dataframe = ImportHelper.load_sequence_dataframe(filename, params, alternative_load_func)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 187, in load_sequence_dataframe
raise Exception(f"{ex}\n\nImportHelper: an error occurred during dataset import while parsing the input file: {filepath}.\n"
Exception: Bool column has NA values in column 2
ImportHelper: an error occurred during dataset import while parsing the input file: quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/repertoires/rep_0.tsv.
Please make sure this is a correct immune receptor data file (not metadata).
The parameters used for import are DatasetImportParams(path=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), is_repertoire=True, metadata_file=PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), paired=False, receptor_chains=None, result_path=PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1'), columns_to_load=None, separator='\t', column_mapping={'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, column_mapping_synonyms=None, region_type=<RegionType.IMGT_CDR3: 'IMGT_CDR3'>, import_productive=True, import_unproductive=None, import_with_stop_codon=False, import_out_of_frame=False, import_illegal_characters=False, metadata_column_mapping=None, number_of_processes=1, sequence_file_size=50000, organism=None, import_empty_nt_sequences=True, import_empty_aa_sequences=False).
For technical description of the error, see the log above. For details on how to specify the dataset import, see the documentation.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 177, in load_repertoire_as_object
raise RuntimeError(f"{ImportHelper.__name__}: error when importing file {metadata_row['filename']}.") from exception
RuntimeError: ImportHelper: error when importing file rep_0.tsv.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 60, in _parse_dataset
dataset = import_cls.import_dataset(params, key)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/IO/dataset_import/AIRRImport.py", line 109, in import_dataset
return ImportHelper.import_dataset(AIRRImport, params, dataset_name)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 49, in import_dataset
dataset = ImportHelper.import_repertoire_dataset(import_class, processed_params, dataset_name)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/ImportHelper.py", line 95, in import_repertoire_dataset
repertoires = pool.starmap(ImportHelper.load_repertoire_as_object, arguments)
File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File ".conda/envs/immuneml_env/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
RuntimeError: ImportHelper: error when importing file rep_0.tsv.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 10, in wrapped
return func(*args, **kwargs)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 70, in _parse_dataset
raise Exception(f"{ex}\n\nAn error occurred while parsing the dataset {key}. See the log above for more details.")
Exception: ImportHelper: error when importing file rep_0.tsv.
An error occurred while parsing the dataset d1. See the log above for more details.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".conda/envs/immuneml_env/bin/immune-ml-quickstart", line 11, in <module>
sys.exit(main())
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 167, in main
quickstart.run(sys.argv[1] if len(sys.argv) == 2 else None)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/workflows/instructions/quickstart.py", line 160, in run
app.run()
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/app/ImmuneMLApp.py", line 44, in run
symbol_table, self._specification_path = ImmuneMLParser.parse_yaml_file(self._specification_path, self._result_path)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 119, in parse_yaml_file
symbol_table, path = ImmuneMLParser.parse(workflow_specification, file_path, result_path)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/ImmuneMLParser.py", line 141, in parse
def_parser_output, specs_defs = DefinitionParser.parse(workflow_specification, symbol_table, result_path)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/definition_parsers/DefinitionParser.py", line 48, in parse
symbol_table, specs_import = ImportParser.parse(specs, symbol_table, result_path)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/dsl/import_parsers/ImportParser.py", line 27, in parse
symbol_table = ImportParser._parse_dataset(key, workflow_specification[ImportParser.keyword][key], symbol_table, result_path)
File ".conda/envs/immuneml_env/lib/python3.8/site-packages/immuneML/util/Logger.py", line 14, in wrapped
raise Exception(f"{e}\n\n"
Exception: ImportHelper: error when importing file rep_0.tsv.
An error occurred while parsing the dataset d1. See the log above for more details.
ImmuneMLParser: an error occurred during parsing in function _parse_dataset with parameters: ('d1', {'format': 'AIRR', 'params': {'is_repertoire': True, 'path': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr'), 'paired': False, 'import_productive': True, 'import_with_stop_codon': False, 'import_out_of_frame': False, 'import_illegal_characters': False, 'region_type': 'IMGT_CDR3', 'separator': '\t', 'column_mapping': {'junction': 'sequences', 'junction_aa': 'sequence_aas', 'v_call': 'v_alleles', 'j_call': 'j_alleles', 'locus': 'chains', 'duplicate_count': 'counts', 'sequence_id': 'sequence_identifiers'}, 'import_empty_nt_sequences': True, 'import_empty_aa_sequences': False, 'metadata_file': PosixPath('quickstart_results/synthetic_dataset/result/simulation_instruction/exported_dataset/airr/metadata.csv'), 'result_path': PosixPath('quickstart_results/machine_learning_analysis/result/datasets/d1')}}, SymbolTable(), PosixPath('quickstart_results/machine_learning_analysis/result')).
For more details on how to write the specification, see the documentation. For technical description of the error, see the log above.