From Kyle via email: I am working on a data analysis pr

Here's a test program I'm using to evaluate memory utilization <div class="highlig

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Regarding <a class="issue-link js-issue-link" data-error-text="Failed to load title" d

Now try with an ABF file that's bigger than available RAM </blockquot

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to read a single sweep without loading the entire ABF into memory about pyabf HOT 13 OPEN

swharden commented on August 25, 2024

How to read a single sweep without loading the entire ABF into memory

from pyabf.

Comments (13)

swharden commented on August 25, 2024

here's the code that, when called, loads all sweeps at once

pyABF/src/pyabf/abf.py

Lines 466 to 484 in 06247e0

    
           def _loadAndScaleData(self, fb: BufferedReader): 
        
               """Load data from the ABF file and scale it by its scaleFactor.""" 
        
               # read the data from the ABF file 
        
               fb.seek(self.dataByteStart) 
        
               raw = np.fromfile(fb, dtype=self._dtype, count=self.dataPointCount) 
        
               nRows = self.channelCount 
        
               nCols = int(self.dataPointCount/self.channelCount) 
        
               raw = np.reshape(raw, (nCols, nRows)) 
        
               raw = np.transpose(raw) 
        
               # if data is int, scale it to float32 so we can scale it 
        
               self.data = raw.astype(np.float32) 
        
               # if the data was originally an int, it must be scaled 
        
               if self._dtype == np.int16: 
        
                   for i in range(self.channelCount): 
        
                       self.data[i] = np.multiply(self.data[i], self._dataGain[i]) 
        
                       self.data[i] = np.add(self.data[i], self._dataOffset[i])

A simple fix may be to add an optional argument onlySweep which, if specified, only populates the self.data[i] for the i specified by that argument, then pass that value here:

pyABF/src/pyabf/abf.py

Lines 602 to 604 in 06247e0

    
           if not "data" in (dir(self)): 
        
               with open(self.abfFilePath, 'rb') as fb: 
        
                   self._loadAndScaleData(fb)

from pyabf.

shadowk29 commented on August 25, 2024

Another approach using memmap, assuming you can extract data format in a compatible string:

columntypes = np.dtype([('channel_1', header.get_data_format()), ('channel_2', header.get_data_format())]) 
raw_data_channel_1 = np.memmap(filename, dtype=columntypes, offset=header.get_header_bytes())['channel_1']

this just memmaps the file without loading it into memory, while still allowing you to index into it like a numpy array downstream. Gives some flexibility.

My own version of this was built in a way that isn't directly mappable to your architecture at this point, but I may be able to port it over.

it also handles the channel count, so no need to reshape or even directly interact with the raw bytes, - as long as you provide one (name, dtype) tuple per channel it assumes by default that data is interleaved and handles the reshaping/slicing internally.

for example, if using 16-bit signed integer adc codes for two channels of data with little-endian bit ordering, you would use
columntypes = np.dtype([('channel_1','<i2'), ('channel_2', '<i2')])

At that point, loading data for channel 1 from T1 to T2 (indices) would be just raw_data_channel_1[T1:T2]

from pyabf.

swharden commented on August 25, 2024

Here's a test program I'm using to evaluate memory utilization

import sys
import pathlib
import psutil
import pyabf

environment_kb = None
def printMemoryUsage(message = ""):
    global environment_kb
    kb = psutil.Process().memory_info().rss / 1000
    if not environment_kb:
        environment_kb = kb
    kb = int(kb - environment_kb)
    print(f"{message} {kb:,} KB")


if __name__ == "__main__":
    printMemoryUsage("environment")
    abf = pyabf.ABF("vc_drug_memtest.abf", loadData=False)
    printMemoryUsage("after reading header")
    for i in range(5):
        abf.setSweep(i)
        printMemoryUsage(f"after loading sweep {i}")

It shows how the memory jumps to max the first time a sweep is loaded

environment 0 KB
after reading header 344 KB
after loading sweep 0 78,680 KB
after loading sweep 1 78,991 KB
after loading sweep 2 78,995 KB
after loading sweep 3 78,995 KB
after loading sweep 4 78,995 KB

from pyabf.

swharden commented on August 25, 2024

Hi @shadowk29, looking at this closer, it looks like abf.data is expected to always be a 2D numpy array sized to hold all the values from all sweeps.

Even if we attempt to read only one sweep from disk, the current architecture of _loadAndScaleData and abf.data is such that a large amount of memory will be used if lots of ABFs are opened simultaneously. There's probably not a clean fix here that doesn't require significant refactoring which I'm less inclined to pursue.

However, there are two solutions to your challenge:

Modify your code to load an ABF, extract the piece of it you need, then remove the ABF object from memory. You keep only the little segment you need, and this problem goes away
Create a custom GetSweepFromFile() method that bypasses _loadAndScaleData() and does not use SetSweep() at all. I'll explore this a bit and see if I can come up with a working code example.

Both options are nice because they do not require editing the source code or deploying a new package

from pyabf.

swharden commented on August 25, 2024

Regarding #1, here's an example that saves memory by deleting the ABF object, freeing its memory, after it is no longer needed

abf = pyabf.ABF("vc_drug_memtest.abf", loadData=False)
segments = []
for i in range(5):
    abf.setSweep(i)
    segment = np.array(abf.sweepY[100:200]) # small piece of interest
    segments.append(segment)
del abf
gc.collect()

environment 0 KB
after reading header 454 KB
after loading sweep 0 78,761 KB
after loading sweep 1 79,097 KB
after loading sweep 2 79,110 KB
after loading sweep 3 79,110 KB
after loading sweep 4 79,110 KB
after unloading ABF 1,064 KB

from pyabf.

shadowk29 commented on August 25, 2024

Now try with an ABF file that's bigger than available RAM - the issue is that it loads the full file contents into memory and that is not always feasible.

…

________________________________ From: Scott W Harden ***@***.***> Sent: Wednesday, July 3, 2024 10:27 AM To: swharden/pyABF ***@***.***> Cc: Kyle Briggs ***@***.***>; Mention ***@***.***> Subject: Re: [swharden/pyABF] How to read a single sweep without loading the entire ABF into memory (Issue #138) Attention : courriel externe | external email Regarding #1<#1>, here's an example that saves memory by deleting the ABF object, freeing its memory, after it is no longer needed abf = pyabf.ABF("vc_drug_memtest.abf", loadData=False) segments = [] for i in range(5): abf.setSweep(i) segment = np.array(abf.sweepY[100:200]) # small piece of interest segments.append(segment) del abf gc.collect() environment 0 KB after reading header 454 KB after loading sweep 0 78,761 KB after loading sweep 1 79,097 KB after loading sweep 2 79,110 KB after loading sweep 3 79,110 KB after loading sweep 4 79,110 KB after unloading ABF 1,064 KB — Reply to this email directly, view it on GitHub<#138 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABP75YKSLILJVT5EK4XU37DZKQC43AVCNFSM6AAAAABKIM2EW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGI4DOOJWGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from pyabf.

swharden commented on August 25, 2024

Now try with an ABF file that's bigger than available RAM

lol, what are you guys studying 🤪

I'm working on a method to do this now and am optimistic it will work well! It will be nice to include this into the source and publish a new package after all 🚀

from pyabf.

swharden commented on August 25, 2024

Hi @shadowk29, try this! You can run it in your own code. If it works well, I can built it into pyABF for a future release. I'm looking forward to your feedback!

def getOnlySweep(abf, sweepIndex: int):
    with open(abf.abfFilePath, 'rb') as fb:
        pointsPerSweep = int(abf.dataPointCount / abf.sweepCount)
        bytesPerSweep = int(pointsPerSweep * abf.dataPointByteSize)
        sweepByteOffset = bytesPerSweep * sweepIndex
        fb.seek(abf.dataByteStart + sweepByteOffset)
        raw = np.fromfile(fb, dtype=abf._dtype, count=pointsPerSweep)
        nRows = abf.channelCount
        nCols = int(pointsPerSweep/abf.channelCount)
        raw = np.reshape(raw, (nCols, nRows))
        raw = np.transpose(raw)
        raw = raw.astype(np.float32)
        if abf._dtype == np.int16:
            for i in range(abf.channelCount):
                raw[i] = np.multiply(raw[i], abf._dataGain[i])
                raw[i] = np.add(raw[i], abf._dataOffset[i])
        return raw

abf = pyabf.ABF("test.abf", loadData=False)
for i in range(5):
    values = getOnlySweep(abf, i)
    print(values)

When I try this memory never exceeds 1MB, so at first glance it seems to be working well

environment 0 KB
after reading header 278 KB
[[-7.2021 -6.7139 -5.7373 ... -7.8125 -7.5684 -6.5918]]
after reading sweep 0 761 KB
[[-5.249  -5.8594 -7.3242 ... -8.4229 -9.0332 -8.667 ]]
after reading sweep 1 925 KB
[[-7.9346 -7.2021 -7.0801 ... -5.9814 -5.9814 -5.9814]]
after reading sweep 2 925 KB
[[-5.0049 -4.3945 -5.249  ... -7.0801 -6.4697 -6.1035]]
after reading sweep 3 925 KB
[[-6.3477 -7.8125 -8.667  ... -6.8359 -6.4697 -6.2256]]
after reading sweep 4 925 KB

from pyabf.

shadowk29 commented on August 25, 2024

Pretty much there - I think the only change I would suggest is to allow loading of a specific part of a sweep. In principle a single sweep could be larger than RAM (we have very high bandwidth measurements), so the template would be def getOnlySweep(abf, sweepIndex: int, startPoint: int, endPoint: int): and load only data between index startPoint and endPoint for the specified sweep. Does that make sense?

…

________________________________ From: Scott W Harden ***@***.***> Sent: Wednesday, July 3, 2024 10:46 AM To: swharden/pyABF ***@***.***> Cc: Kyle Briggs ***@***.***>; Mention ***@***.***> Subject: Re: [swharden/pyABF] How to read a single sweep without loading the entire ABF into memory (Issue #138) Attention : courriel externe | external email Hi @shadowk29<https://github.com/shadowk29>, try this! You can run it in your own code. If it works well, I can built it into pyABF for a future release. I'm looking forward to your feedback! def getOnlySweep(abf, sweepIndex: int): with open(abf.abfFilePath, 'rb') as fb: pointsPerSweep = int(abf.dataPointCount / abf.sweepCount) bytesPerSweep = int(pointsPerSweep * abf.dataPointByteSize) sweepByteOffset = bytesPerSweep * sweepIndex fb.seek(abf.dataByteStart + sweepByteOffset) raw = np.fromfile(fb, dtype=abf._dtype, count=pointsPerSweep) nRows = abf.channelCount nCols = int(pointsPerSweep/abf.channelCount) raw = np.reshape(raw, (nCols, nRows)) raw = np.transpose(raw) raw = raw.astype(np.float32) if abf._dtype == np.int16: for i in range(abf.channelCount): raw[i] = np.multiply(raw[i], abf._dataGain[i]) raw[i] = np.add(raw[i], abf._dataOffset[i]) return raw abf = pyabf.ABF("test.abf", loadData=False) for i in range(5): values = getOnlySweep(abf, i) print(values) — Reply to this email directly, view it on GitHub<#138 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABP75YJNMLO2A3VCHADO5SDZKQFEXAVCNFSM6AAAAABKIM2EW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGM3TENBRHE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from pyabf.

swharden commented on August 25, 2024

the only change I would suggest is to allow loading of a specific part of a sweep

Good idea! I can adapt the method above to look something like:

def getOnlySweep(abf, sweepIndex: int, startTime: float, endTime: float):

Where times are specified in seconds

from pyabf.

shadowk29 commented on August 25, 2024

That would be perfect

…

________________________________ From: Scott W Harden ***@***.***> Sent: Wednesday, July 3, 2024 12:30 PM To: swharden/pyABF ***@***.***> Cc: Kyle Briggs ***@***.***>; Mention ***@***.***> Subject: Re: [swharden/pyABF] How to read a single sweep without loading the entire ABF into memory (Issue #138) Attention : courriel externe | external email the only change I would suggest is to allow loading of a specific part of a sweep Good idea! I can adapt the method above to look something like: def getOnlySweep(abf, sweepIndex: int, startTime: float, endTime: float): Where times are specified in seconds — Reply to this email directly, view it on GitHub<#138 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABP75YMI2FUZDCRA2BARG63ZKQRL5AVCNFSM6AAAAABKIM2EW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG42TKMZUG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from pyabf.

swharden commented on August 25, 2024

Hi @shadowk29, I think this works as expected. Can you double-check it with your large ABFs? If you report that it works well, I'll built it into the ABF class and release a new package 👍

def getOnlySweep(abf: pyabf.ABF, sweepIndex: int, startTime: float = None, endTime: float = None):
    """
    Return values for a sweep (one row per channel) by reading directly from the ABF file.
    Use this to avoid storing large amounts of sweep data in memory.

    ### Parameters
    * `abf` - ABF the sweep data will be read from
    * `sweepIndex` - The sweep number (starting at zero). Note that all channels for this sweep will be returned.
    * `startTime` - Data returned will begin at this time within the sweep (in seconds)
    * `endTime` - Data returned will end at this time within the sweep (in seconds)
    """

    startTime = startTime if startTime else 0
    startTime = max(0, startTime)

    endTime = endTime if endTime else abf.sweepLengthSec
    endTime = min(endTime, abf.sweepLengthSec)

    bytesPerSample = int(abf.dataPointByteSize)
    bytesPerSecond = int(abf.dataPointByteSize * abf.sampleRate)
    samplesPerSweep = int(abf.dataPointCount / abf.sweepCount)
    bytesPerSweep = samplesPerSweep * bytesPerSample
    sweepFirstByte = abf.dataByteStart + bytesPerSweep * sweepIndex
    startTime = startTime if startTime else 0
    sweepFirstByte += int(startTime * bytesPerSecond)
    endTime = endTime if endTime else abf.sweepLengthSec
    sampleCount = (endTime - startTime) * abf.sampleRate

    with open(abf.abfFilePath, 'rb') as fb:
        fb.seek(sweepFirstByte)
        raw = np.fromfile(fb, dtype=abf._dtype, count=sampleCount)
        nRows = abf.channelCount
        nCols = sampleCount
        raw = np.reshape(raw, (nCols, nRows))
        raw = np.transpose(raw)
        raw = raw.astype(np.float32)
        if abf._dtype == np.int16:
            for i in range(abf.channelCount):
                raw[i] = np.multiply(raw[i], abf._dataGain[i])
                raw[i] = np.add(raw[i], abf._dataOffset[i])
        return raw

abf = pyabf.ABF("test.abf", loadData=False)
for i in range(5):
    values = getOnlySweep(abf, i, 2, 3) # return only data from seconds 2 through 3
    print(values)

from pyabf.

shadowk29 commented on August 25, 2024

Three minor bugs, but otherwise works well:

sampleCount needs to be cast to int:

sampleCount = int((endTime - startTime) * abf.sampleRate)

the load of raw data needs to take into account that the abf file might have more than one channel of interleaved data when loading by multiplying the number of points to load by the channel Count

raw = np.fromfile(fb, dtype=abf._dtype, count=abf.channelCount*sampleCount)

Finally, the function should only return data from channel/sweep i (which means you can drop the scaling loop and just scale the one channel for a little bit of extra speed).

return raw[i]

from pyabf.

How to read a single sweep without loading the entire ABF into memory about pyabf HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def _loadAndScaleData(self, fb: BufferedReader):
	"""Load data from the ABF file and scale it by its scaleFactor."""

	# read the data from the ABF file
	fb.seek(self.dataByteStart)
	raw = np.fromfile(fb, dtype=self._dtype, count=self.dataPointCount)
	nRows = self.channelCount
	nCols = int(self.dataPointCount/self.channelCount)
	raw = np.reshape(raw, (nCols, nRows))
	raw = np.transpose(raw)

	# if data is int, scale it to float32 so we can scale it
	self.data = raw.astype(np.float32)

	# if the data was originally an int, it must be scaled
	if self._dtype == np.int16:
	for i in range(self.channelCount):
	self.data[i] = np.multiply(self.data[i], self._dataGain[i])
	self.data[i] = np.add(self.data[i], self._dataOffset[i])

	if not "data" in (dir(self)):
	with open(self.abfFilePath, 'rb') as fb:
	self._loadAndScaleData(fb)