Code Monkey home page Code Monkey logo

Comments (13)

swharden avatar swharden commented on August 25, 2024

here's the code that, when called, loads all sweeps at once

pyABF/src/pyabf/abf.py

Lines 466 to 484 in 06247e0

def _loadAndScaleData(self, fb: BufferedReader):
"""Load data from the ABF file and scale it by its scaleFactor."""
# read the data from the ABF file
fb.seek(self.dataByteStart)
raw = np.fromfile(fb, dtype=self._dtype, count=self.dataPointCount)
nRows = self.channelCount
nCols = int(self.dataPointCount/self.channelCount)
raw = np.reshape(raw, (nCols, nRows))
raw = np.transpose(raw)
# if data is int, scale it to float32 so we can scale it
self.data = raw.astype(np.float32)
# if the data was originally an int, it must be scaled
if self._dtype == np.int16:
for i in range(self.channelCount):
self.data[i] = np.multiply(self.data[i], self._dataGain[i])
self.data[i] = np.add(self.data[i], self._dataOffset[i])

A simple fix may be to add an optional argument onlySweep which, if specified, only populates the self.data[i] for the i specified by that argument, then pass that value here:

pyABF/src/pyabf/abf.py

Lines 602 to 604 in 06247e0

if not "data" in (dir(self)):
with open(self.abfFilePath, 'rb') as fb:
self._loadAndScaleData(fb)

from pyabf.

shadowk29 avatar shadowk29 commented on August 25, 2024

Another approach using memmap, assuming you can extract data format in a compatible string:

columntypes = np.dtype([('channel_1', header.get_data_format()), ('channel_2', header.get_data_format())]) 
raw_data_channel_1 = np.memmap(filename, dtype=columntypes, offset=header.get_header_bytes())['channel_1']

this just memmaps the file without loading it into memory, while still allowing you to index into it like a numpy array downstream. Gives some flexibility.

My own version of this was built in a way that isn't directly mappable to your architecture at this point, but I may be able to port it over.

it also handles the channel count, so no need to reshape or even directly interact with the raw bytes, - as long as you provide one (name, dtype) tuple per channel it assumes by default that data is interleaved and handles the reshaping/slicing internally.

for example, if using 16-bit signed integer adc codes for two channels of data with little-endian bit ordering, you would use
columntypes = np.dtype([('channel_1','<i2'), ('channel_2', '<i2')])

At that point, loading data for channel 1 from T1 to T2 (indices) would be just raw_data_channel_1[T1:T2]

from pyabf.

swharden avatar swharden commented on August 25, 2024

Here's a test program I'm using to evaluate memory utilization

import sys
import pathlib
import psutil
import pyabf

environment_kb = None
def printMemoryUsage(message = ""):
    global environment_kb
    kb = psutil.Process().memory_info().rss / 1000
    if not environment_kb:
        environment_kb = kb
    kb = int(kb - environment_kb)
    print(f"{message} {kb:,} KB")


if __name__ == "__main__":
    printMemoryUsage("environment")
    abf = pyabf.ABF("vc_drug_memtest.abf", loadData=False)
    printMemoryUsage("after reading header")
    for i in range(5):
        abf.setSweep(i)
        printMemoryUsage(f"after loading sweep {i}")

It shows how the memory jumps to max the first time a sweep is loaded

environment 0 KB
after reading header 344 KB
after loading sweep 0 78,680 KB
after loading sweep 1 78,991 KB
after loading sweep 2 78,995 KB
after loading sweep 3 78,995 KB
after loading sweep 4 78,995 KB

from pyabf.

swharden avatar swharden commented on August 25, 2024

Hi @shadowk29, looking at this closer, it looks like abf.data is expected to always be a 2D numpy array sized to hold all the values from all sweeps.

Even if we attempt to read only one sweep from disk, the current architecture of _loadAndScaleData and abf.data is such that a large amount of memory will be used if lots of ABFs are opened simultaneously. There's probably not a clean fix here that doesn't require significant refactoring which I'm less inclined to pursue.

However, there are two solutions to your challenge:

  1. Modify your code to load an ABF, extract the piece of it you need, then remove the ABF object from memory. You keep only the little segment you need, and this problem goes away

  2. Create a custom GetSweepFromFile() method that bypasses _loadAndScaleData() and does not use SetSweep() at all. I'll explore this a bit and see if I can come up with a working code example.

Both options are nice because they do not require editing the source code or deploying a new package

from pyabf.

swharden avatar swharden commented on August 25, 2024

Regarding #1, here's an example that saves memory by deleting the ABF object, freeing its memory, after it is no longer needed

abf = pyabf.ABF("vc_drug_memtest.abf", loadData=False)
segments = []
for i in range(5):
    abf.setSweep(i)
    segment = np.array(abf.sweepY[100:200]) # small piece of interest
    segments.append(segment)
del abf
gc.collect()
environment 0 KB
after reading header 454 KB
after loading sweep 0 78,761 KB
after loading sweep 1 79,097 KB
after loading sweep 2 79,110 KB
after loading sweep 3 79,110 KB
after loading sweep 4 79,110 KB
after unloading ABF 1,064 KB

from pyabf.

shadowk29 avatar shadowk29 commented on August 25, 2024

from pyabf.

swharden avatar swharden commented on August 25, 2024

Now try with an ABF file that's bigger than available RAM

lol, what are you guys studying ðŸĪŠ

I'm working on a method to do this now and am optimistic it will work well! It will be nice to include this into the source and publish a new package after all 🚀

from pyabf.

swharden avatar swharden commented on August 25, 2024

Hi @shadowk29, try this! You can run it in your own code. If it works well, I can built it into pyABF for a future release. I'm looking forward to your feedback!

def getOnlySweep(abf, sweepIndex: int):
    with open(abf.abfFilePath, 'rb') as fb:
        pointsPerSweep = int(abf.dataPointCount / abf.sweepCount)
        bytesPerSweep = int(pointsPerSweep * abf.dataPointByteSize)
        sweepByteOffset = bytesPerSweep * sweepIndex
        fb.seek(abf.dataByteStart + sweepByteOffset)
        raw = np.fromfile(fb, dtype=abf._dtype, count=pointsPerSweep)
        nRows = abf.channelCount
        nCols = int(pointsPerSweep/abf.channelCount)
        raw = np.reshape(raw, (nCols, nRows))
        raw = np.transpose(raw)
        raw = raw.astype(np.float32)
        if abf._dtype == np.int16:
            for i in range(abf.channelCount):
                raw[i] = np.multiply(raw[i], abf._dataGain[i])
                raw[i] = np.add(raw[i], abf._dataOffset[i])
        return raw
abf = pyabf.ABF("test.abf", loadData=False)
for i in range(5):
    values = getOnlySweep(abf, i)
    print(values)

When I try this memory never exceeds 1MB, so at first glance it seems to be working well

environment 0 KB
after reading header 278 KB
[[-7.2021 -6.7139 -5.7373 ... -7.8125 -7.5684 -6.5918]]
after reading sweep 0 761 KB
[[-5.249  -5.8594 -7.3242 ... -8.4229 -9.0332 -8.667 ]]
after reading sweep 1 925 KB
[[-7.9346 -7.2021 -7.0801 ... -5.9814 -5.9814 -5.9814]]
after reading sweep 2 925 KB
[[-5.0049 -4.3945 -5.249  ... -7.0801 -6.4697 -6.1035]]
after reading sweep 3 925 KB
[[-6.3477 -7.8125 -8.667  ... -6.8359 -6.4697 -6.2256]]
after reading sweep 4 925 KB

from pyabf.

shadowk29 avatar shadowk29 commented on August 25, 2024

from pyabf.

swharden avatar swharden commented on August 25, 2024

the only change I would suggest is to allow loading of a specific part of a sweep

Good idea! I can adapt the method above to look something like:

def getOnlySweep(abf, sweepIndex: int, startTime: float, endTime: float):

Where times are specified in seconds

from pyabf.

shadowk29 avatar shadowk29 commented on August 25, 2024

from pyabf.

swharden avatar swharden commented on August 25, 2024

Hi @shadowk29, I think this works as expected. Can you double-check it with your large ABFs? If you report that it works well, I'll built it into the ABF class and release a new package 👍

def getOnlySweep(abf: pyabf.ABF, sweepIndex: int, startTime: float = None, endTime: float = None):
    """
    Return values for a sweep (one row per channel) by reading directly from the ABF file.
    Use this to avoid storing large amounts of sweep data in memory.

    ### Parameters
    * `abf` - ABF the sweep data will be read from
    * `sweepIndex` - The sweep number (starting at zero). Note that all channels for this sweep will be returned.
    * `startTime` - Data returned will begin at this time within the sweep (in seconds)
    * `endTime` - Data returned will end at this time within the sweep (in seconds)
    """

    startTime = startTime if startTime else 0
    startTime = max(0, startTime)

    endTime = endTime if endTime else abf.sweepLengthSec
    endTime = min(endTime, abf.sweepLengthSec)

    bytesPerSample = int(abf.dataPointByteSize)
    bytesPerSecond = int(abf.dataPointByteSize * abf.sampleRate)
    samplesPerSweep = int(abf.dataPointCount / abf.sweepCount)
    bytesPerSweep = samplesPerSweep * bytesPerSample
    sweepFirstByte = abf.dataByteStart + bytesPerSweep * sweepIndex
    startTime = startTime if startTime else 0
    sweepFirstByte += int(startTime * bytesPerSecond)
    endTime = endTime if endTime else abf.sweepLengthSec
    sampleCount = (endTime - startTime) * abf.sampleRate

    with open(abf.abfFilePath, 'rb') as fb:
        fb.seek(sweepFirstByte)
        raw = np.fromfile(fb, dtype=abf._dtype, count=sampleCount)
        nRows = abf.channelCount
        nCols = sampleCount
        raw = np.reshape(raw, (nCols, nRows))
        raw = np.transpose(raw)
        raw = raw.astype(np.float32)
        if abf._dtype == np.int16:
            for i in range(abf.channelCount):
                raw[i] = np.multiply(raw[i], abf._dataGain[i])
                raw[i] = np.add(raw[i], abf._dataOffset[i])
        return raw
abf = pyabf.ABF("test.abf", loadData=False)
for i in range(5):
    values = getOnlySweep(abf, i, 2, 3) # return only data from seconds 2 through 3
    print(values)

from pyabf.

shadowk29 avatar shadowk29 commented on August 25, 2024

Three minor bugs, but otherwise works well:

sampleCount needs to be cast to int:

sampleCount = int((endTime - startTime) * abf.sampleRate)

the load of raw data needs to take into account that the abf file might have more than one channel of interleaved data when loading by multiplying the number of points to load by the channel Count

raw = np.fromfile(fb, dtype=abf._dtype, count=abf.channelCount*sampleCount)

Finally, the function should only return data from channel/sweep i (which means you can drop the scaling loop and just scale the one channel for a little bit of extra speed).

return raw[i]

from pyabf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google âĪïļ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.