Code Monkey home page Code Monkey logo

Comments (12)

karlicoss avatar karlicoss commented on August 21, 2024 1

Ah sorry -- in this case saw the notification -- just was a bit busy with other things!

could probably just have a main function and run this like python3 -m my.location.fallback.via_prompt'

Yeah seems good -- at least to start with!

The choices the user picks would cached (so they arent re-prompted)

Hmm, not sure I understood -- what would the user pick here? Seems that it's possible to pick the source automatically, just by using accuracy?

The choices the user picks would cached (so they arent re-prompted), so this would need to save to a JSON file or something, e.g. to ~/data/locations/via_prompt.json, ~/data/locations/via_ip.json

Similarly, not sure I understand -- makes sense to have via_prompt.json, to save manually entered locations, but what would be in via_ip.json 🤔

That way its still all backwards compatible and only if all sources are SortedLocationIterator do we do a mergesort of sorts.

Yeah, seems like a good idea! Maybe there is even something existing in more_itertools we could reuse

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024 1

Perhaps it could be backed by sqlite, maybe even cachew somehow

Yeah, Im not really set on any storage/strategy yet, will experiment when I get there and see what makes sense

Also maybe makes more sense to keep the data that was bad rather than the data that was good, hopefully it would mean less stuff to keep? And won't require approving "new" data since the last review before it makes it into the location provider.

Ah true, but would then need to have some way to find 'unlikely/bad' entries, so you can at least warn user?

As otherwise things get auto approved, and you might have incorrect data -- but maybe it should be up to the user to flag bad entries.

Will keep these all open as options while Im bootstrapping this

I think at this point it could just be a personal overlay

Right, I'll probably add some more documentation for that as well, as another option


stable IDs

yeah, that is something Ive had the thought about a few times (same with IPs above), would be more convenient for the user. but also, these could all be solved with an hpi overlay, but thats a lot more involved than just adding something to your my.config which automatically filters some entries.

could experiment with adding something to hpi query that lets you add hooks to your my.config, maybe to a my.config.query block which does some checking for the names of the modules? instead of modifying every module to check a filter in my.config, just localize that to hpi query like:

def _custom_filter_function(ip: IP) -> bool:
     if ...:
         return  False
     return True
     
class query:
    filters: Dict[str, Callable[[T], bool]] = {
        "my.ip.all": _custom_filter_function
   } 

I dont really want to wrap every module function in a decorator that checks the my.config file to filter out a few entries though, a bit too boilerplate-y

also this isnt typesafe but I feel thats sort of the tradeoff you have to make for user convenience

perhaps in the future if thats deemed to be so useful that its worth the tradeoff (doubt it, but perhaps), can add a decorator to every module

from hpi.

karlicoss avatar karlicoss commented on August 21, 2024 1

Like, each namedtuple/dataclass can define a unique key on the data

Yeah, I've been reluctant to do it so far unless necessary (and in particular, advertise as a stable uuid). I guess my hope was that it could be mostly agnostic, e.g. derived from the timestamp or something. But that could be tricky too, even timestamps might change for various reasons.
I wonder if there is any good prior art we could inspire from -- seems like a common problem?

Now, we could alter the EXIF tag but that corrupts the ideal of this file being the original copy (yes I’m going a bit far with this, but date is still tricky to modify and other metadata types are not as forgiving)

Yeah agreed -- plus it's not always feasible to modify the metadata directly, e.g. you could have something coming from google takeout, so you'd have to patch up files every time you do a takeout (as opposed to a programmatic overlay).

What would be better for archival purposes is a record of the correction, maybe a note about why it was wrong, and the system moving forward with the corrected data. This keeps the original untouched, leaves an audit trail, and still lets us view reports without having to think about it again.

Yep, totally agree!

This also slightly feels outside of the realm of HPI (at least for fixing tags on images),

For images in particular, what might make sense is to create a separate repo/CLI tool which manages your images in the custom way we may want,

I guess it depends on how generic we can make it -- if it 's possible to make it work in a somewhat agnostic manner, just on top of namedtuples/dataclasses I don't see why not. But specific UIs/logic to analyse and review data don't have to be in core HPI yeah.

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

not urgent or anything, but just thought Id ping you @karlicoss since seems like notifications often dont work on here

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

just was a bit busy with other things

all good, this will probably take a while to build anyways

The choices the user picks would cached (so they arent re-prompted)

Ah, sorry; didn't totally describe this here. The sources/fallback for a day would all be picked automatically, based on what data is available on the day, which has the highest accuracy etc.

The prompting I'm referring to is to confirm a location based on an IP address, as some of the IP addresses I have from 2012-2013 when geolocated now, return incorrect locations, like in random parts of europe/islands I've never been to (I'm assuming that these IPs were bought out by another company, and the location now is different?)

Could also probably check if the location is within some distance of a known/home location and then auto-confirm it, so you're not prompting for everything, just for what look like outlier addresses

If you confirm one of those as correct addresses, then that would be saved to the via_ip.json file so you don't have to be asked again

The same thing may (or might not, havent ever gone through the latitude/longitutde tags on my images, so will see it is a problem that needs fixing) be a problem for images in my photos directory which I didn't take (so the location should not be used), but happen to be geotagged

could also let the user define a custom function in their config like:

def filter_ips(ips: Iterable[IP]) -> Iterable[IP]:
    # custom login user can define

class location:
    class fallback:
        class via_ip:
            filter_function = filter_ips

so they can remove swaths of IP addresses before/within some date etc, but will see if thats even necessary when I get to that point. I have a good amount of IPs personally so should be able to find most of the annoying edge cases:

$ hpi query my.ip.all -s | jq '.addr' -r | unique | wc -l
1882

from hpi.

karlicoss avatar karlicoss commented on August 21, 2024

The prompting I'm referring to is to confirm a location based on an IP address, as some of the IP addresses I have from 2012-2013 when geolocated now, return incorrect locations

Ah gotcha! So it's basically some tooling/overlay/filter to 'review' the data, good idea.
This seems like a lot of data to sift through though, so not sure how feasible would it really be to go through everything? But yea like you mentioned with some extra logic to filter out obviously correct stuff, might work.

The json would end up quite huge though, right? Perhaps it could be backed by sqlite, maybe even cachew somehow.
Also maybe makes more sense to keep the data that was bad rather than the data that was good, hopefully it would mean less stuff to keep? And won't require approving "new" data since the last review before it makes it into the location provider.

could also let the user define a custom function in their config like

I think at this point it could just be a personal overlay? At least to start with/expreiment cause it's less intrusive than defining a config etc.
E.g. here I have an example https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/calendar/holidays.py#L1-L14

The same thing may (or might not, havent ever gone through the latitude/longitutde tags on my images, so will see it is a problem that needs fixing) be a problem for images in my photos directory which I didn't take (so the location should not be used), but happen to be geotagged

I guess the whole thing could be quite generic in principle (at the very list you could drop datapoints). But kinda requires stable ids for each piece of data karlicoss/promnesia#173

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

Oh, and on the stable IDs, perhaps we can at least move a bit closer to that/define some standard behaviour?

Like, each namedtuple/dataclass can define a unique key on the data, sort of like in the google_takeout_parser/models.py. Could be a uuid/key @property or something. If a datasource already provides a unqiue ID that can be returned, else can estimate one with datetimes/sha-hashing the input etc.

may not be able to do that for every source, but at least when its useful for promneia/other consumers, theres some sort of standard

from hpi.

Joshfindit avatar Joshfindit commented on August 21, 2024

If you don’t mind the interjection, I think I can highlight something that might be getting missed:

I suspect that @seanbreckenridge and I have a similar desire for absolute truth even if it means putting in the extra work manually.

From that perspective someone can see that in-file metadata can be (and has been) wrong, and that it’s important for an “absolute truth” system to recognize when we’ve marked a correction and to even favour that correction.

For example:

  1. An image is taken and the device records the wrong date in to the EXIF tag
  2. As we ingest the unaltered image file, we can see what the date was supposed to be (within about 5min) because the memory is fresh

Now, we could alter the EXIF tag but that corrupts the ideal of this file being the original copy (yes I’m going a bit far with this, but date is still tricky to modify and other metadata types are not as forgiving)

What would be better for archival purposes is a record of the correction, maybe a note about why it was wrong, and the system moving forward with the corrected data. This keeps the original untouched, leaves an audit trail, and still lets us view reports without having to think about it again.

And the dream of our system being a source of absolute truth stays intact.

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

and I have a similar desire for absolute truth even if it means putting in the extra work manually

As we ingest the unaltered image file, we can see what the date was supposed to be (within about 5min

true, I would personally be fine with putting in the extra work, but I don't want to put that burden on everyone else using the module

In my opinion at least, HPI is meant to be able to take messy data as input and deal with it, and return (not throw) errors/warnings when it cant. (which is also sort of stated in reasonably defensive and extensible in the design docs)

This also slightly feels outside of the realm of HPI (at least for fixing tags on images), (which is also why I left the comment at the top):

"since this is a bit more state-ful than HPI typically is (this would include a CLI which saves data to disk based on prompting the user)"

Could give the user more than one choice:

  • use all images in a directory and just use everything
  • use all images but warn if something seems unlikely based on surrounding locations
  • manually confirm everything
  • [or something in between using some heuristics/known locations... ?]

but it is getting pretty complicated to maintain all of this in this repo at that point, and it doesn't need to be here

For images in particular, what might make sense is to create a separate repo/CLI tool which manages your images in the custom way we may want, probably built on top of an existing project like elodie (or something else, havent scoured for related projects yet).

That repo would let you import/manage your photos incrementally, and could be used for people who want the absolute truth/to confirm all of their image data, and then we'd just have a file here like my/photos/photo_importer.py which parses the output of that CLI tool

What would be better for archival purposes is a record of the correction, maybe a note about why it was wrong, and the system moving forward with the corrected data. This keeps the original untouched, leaves an audit trail, and still lets us view reports without having to think about it again.

I do think this is a good idea, and will try to implement something like this at some point

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

Hmm -- the more I try to implement any sort of confirmation mechanism its either:

  • too much work for the user
  • or, has to use other sources to find outliers, which would make this especially difficult to debug if anything goes wrong

I think instead would be easier to add a filter function just to the config like:

def filter_ips(ip: IP) -> bool:
    # user defined behaviour here...

class location:
    class fallback:
        class via_ip:
			filter = filter_ips
			accuracy: float = 10_000

Sitenote: This filter could also instead be called from my.ip.all, as those are 'incorrect' IPs. Doesnt particularly matter where, just wanted to bring up the idea of the filter in config, as I could see it possibly being used in other sources in the future.

Edit: this does go down the path of adding more boilerplate to every module though, but most likely filtering 'bad' data isn't that common of a usecase? I can only think of IP and photos that might use it, but it does increase the size of the all.py file. I could move that to common.py or ip/filter.py instead, so its clear you're opting into it and you have to add it to your all.py file yourself

And then have some documentation on how to use it - Could even have 'recipes' or something using some shared functions in my.core.common that help prompt user/save data to disk. Just feel it would be better to actually try and add this to a few sources and see common patterns that emerge instead of trying to think myself through all the possibilities here

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

Oh also, I tried implementing the SortedLocation thing, but the types get lost when you do a yield from SortedLocation(..., and cachew may interfere with this as well (not sure). Just doing return SortedLocations(...) makes it non-lazy, so you have to read in all locations before you can return the first from the generator, which is pretty bad...

diff --git a/my/location/all.py b/my/location/all.py
index eec4bcc..627fe5d 100644
--- a/my/location/all.py
+++ b/my/location/all.py
@@ -7,7 +7,7 @@ from typing import Iterator
 from my.core import Stats, LazyLogger
 from my.core.source import import_source
 
-from .common import Location
+from my.location.common import Location, SortedLocations
 
 
 logger = LazyLogger(__name__, level="warning")
@@ -15,9 +15,11 @@ logger = LazyLogger(__name__, level="warning")
 
 def locations() -> Iterator[Location]:
     # can add/comment out sources here to disable them, or use core.disabled_modules
-    yield from _takeout_locations()
-    yield from _gpslogger_locations()
-    yield from _ip_locations()
+    yield from SortedLocations.map_sort(
+        _takeout_locations(),
+        _gpslogger_locations(),
+        _ip_locations(),
+    )
 
 
 @import_source(module_name="my.location.google_takeout")
diff --git a/my/location/common.py b/my/location/common.py
index 5b5c33f..b1391fe 100644
--- a/my/location/common.py
+++ b/my/location/common.py
@@ -1,9 +1,14 @@
+from __future__ import annotations
 from datetime import date, datetime
-from typing import Union, Tuple, Optional
+from typing import Union, Tuple, Optional, Iterable, TypeVar
 from dataclasses import dataclass
 
+
 from my.core import __NOT_HPI_MODULE__
 from my.core.compat import Protocol
+from my.core.common import LazyLogger
+
+logger = LazyLogger(__name__, level='warning')
 
 DateIsh = Union[datetime, date, str]
 
@@ -32,3 +37,44 @@ class Location(LocationProtocol):
     accuracy: Optional[float]
     elevation: Optional[float]
     datasource: Optional[str] = None  # which module provided this, useful for debugging
+
+
+L = TypeVar("L", bound=LocationProtocol)
+
+# functionality related to merging sorting iterators of locations
+# using heapq.merge https://docs.python.org/3/library/heapq.html
+
+class SortedLocations(Iterable[L]):
+    def __init__(self, locations: Iterable[L], sort_input: bool = True):
+        self.sort_input = sort_input
+        self._locations: Iterable[L] = locations
+
+        # if this is already a SortedLocations, don't sort again
+        if not isinstance(locations, SortedLocations):
+            if sort_input:
+                self._locations = list(locations)
+                self._locations.sort(key=lambda x: x.dt)
+        self._iter = iter(self._locations)
+
+    def __iter__(self) -> SortedLocations:
+        return self
+
+    def __next__(self) -> L:
+        return next(self._iter)
+
+    @classmethod
+    def map_sort(cls, *locations: Iterable[L]) -> SortedLocations:
+        '''
+        Sorts locations from each iterable, then merge them
+        '''
+        logger.debug(f"Sorting locations from {len(locations)} sources")
+        # if already sorted, should be close to O(1) anyways, so shouldnt matter that much
+        # its safer to sort again, but user can opt out by wrapping their input in
+        # SortedLocations(..., sort_input=False), which will skip the sort in the constructor
+        locs = cls.sort(*(cls(loc) for loc in locations))
+        return cls(locs, sort_input=False)
+
+    @staticmethod
+    def sort(*locations: Iterable[L]) -> Iterable[L]:
+        import heapq
+        return heapq.merge(*locations, key=lambda x: x.dt)
diff --git a/my/location/fallback/via_ip.py b/my/location/fallback/via_ip.py
index 0e8fb05..3bd4775 100644
--- a/my/location/fallback/via_ip.py
+++ b/my/location/fallback/via_ip.py
@@ -4,10 +4,12 @@ Converts IP addresses provided by my.location.ip to estimated locations
 
 REQUIRES = ["git+https://github.com/seanbreckenridge/ipgeocache"]
 
+from datetime import datetime
+
 from my.core import dataclass, Stats
 from my.config import location
 from my.core.warnings import medium
-from datetime import datetime
+from my.location.common import SortedLocations
 
 
 @dataclass
@@ -42,7 +44,7 @@ def fallback_locations() -> Iterator[FallbackLocation]:
 # for compatibility with my.location.via_ip, this shouldnt be used by other modules
 def locations() -> Iterator[Location]:
     medium("via_ip.locations is deprecated, use via_ip.fallback_locations instead")
-    yield from map(FallbackLocation.to_location, fallback_locations())
+    yield from SortedLocations(iter(map(FallbackLocation.to_location, fallback_locations())))
 
 
 def estimate_location(dt: datetime) -> Location:
diff --git a/my/location/google_takeout.py b/my/location/google_takeout.py
index 80b31cb..2aed070 100644
--- a/my/location/google_takeout.py
+++ b/my/location/google_takeout.py
@@ -10,7 +10,7 @@ from my.google.takeout.parser import events, _cachew_depends_on
 from google_takeout_parser.models import Location as GoogleLocation
 
 from my.core.common import mcachew, LazyLogger, Stats
-from .common import Location
+from my.location.common import Location, SortedLocations
 
 logger = LazyLogger(__name__)
 
@@ -20,11 +20,17 @@ logger = LazyLogger(__name__)
     logger=logger,
 )
 def locations() -> Iterator[Location]:
-    for g in events():
-        if isinstance(g, GoogleLocation):
-            yield Location(
-                lon=g.lng, lat=g.lat, dt=g.dt, accuracy=g.accuracy, elevation=None
-            )
+    def _locations() -> Iterator[Location]:
+        for g in events():
+            if isinstance(g, GoogleLocation):
+                yield Location(
+                    lon=g.lng,
+                    lat=g.lat,
+                    dt=g.dt, accuracy=g.accuracy,
+                    elevation=None,
+                    datasource="google_takeout",
+                )
+    return SortedLocations(_locations())
 
 
 def stats() -> Stats:
diff --git a/my/location/gpslogger.py b/my/location/gpslogger.py
index 95f4474..79e3628 100644
--- a/my/location/gpslogger.py
+++ b/my/location/gpslogger.py
@@ -17,7 +17,6 @@ class config(location.gpslogger):
     accuracy: float = 50.0
 
 
-from itertools import chain
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Iterator, Sequence, List
@@ -27,14 +26,17 @@ from more_itertools import unique_everseen
 
 from my.core import Stats, LazyLogger
 from my.core.common import get_files, mcachew
-from .common import Location
+from my.location.common import Location, SortedLocations
 
 
 logger = LazyLogger(__name__, level="warning")
 
 
 def inputs() -> Sequence[Path]:
-    return get_files(config.export_path, glob="*.gpx")
+    # sort by timestamp so that we can merge them in order
+    # gpslogger files can optionally be prefixed by a device id,
+    # so we can't always just sort by name
+    return sorted(get_files(config.export_path, glob="*.gpx", sort=False), key=lambda p: p.stat().st_mtime)
 
 
 def _cachew_depends_on() -> List[float]:
@@ -45,7 +47,11 @@ def _cachew_depends_on() -> List[float]:
 @mcachew(depends_on=_cachew_depends_on, logger=logger)
 def locations() -> Iterator[Location]:
     yield from unique_everseen(
-        chain(*map(_extract_locations, inputs())), key=lambda loc: loc.dt
+        # sort each file, then merge them all together
+        SortedLocations.map_sort(
+            *map(_extract_locations, inputs())
+        ),
+        key=lambda loc: loc.dt,
     )
 
 
@@ -65,6 +71,7 @@ def _extract_locations(path: Path) -> Iterator[Location]:
                         accuracy=config.accuracy,
                         elevation=point.elevation,
                         dt=datetime.replace(point.time, tzinfo=timezone.utc),
+                        datasource="gpslogger",
                     )
 
 
diff --git a/my/time/tz/via_location.py b/my/time/tz/via_location.py
index 6b8e835..c56c8af 100644
--- a/my/time/tz/via_location.py
+++ b/my/time/tz/via_location.py
@@ -75,7 +75,7 @@ class DayWithZone(NamedTuple):
     zone: Zone
 
 
-from my.location.common import LatLon
+from my.location.common import LatLon, SortedLocations
 
 # for backwards compatibility
 def _locations() -> Iterator[Tuple[LatLon, datetime]]:
@@ -99,7 +99,18 @@ def _locations() -> Iterator[Tuple[LatLon, datetime]]:
 # TODO: could use heapmerge or sort the underlying iterators somehow?
 # see https://github.com/karlicoss/HPI/pull/237#discussion_r858372934
 def _sorted_locations() -> List[Tuple[LatLon, datetime]]:
-    return list(sorted(_locations(), key=lambda x: x[1]))
+    is_already_sorted = False
+    try:
+        import my.location.all
+        is_already_sorted = isinstance(my.location.all.locations(), SortedLocations)
+    except Exception:
+        pass
+
+    if is_already_sorted:
+        logger.debug("my.location.all.locations() is already sorted, skipping sort")
+        return list(_locations())
+    else:
+        return list(sorted(_locations(), key=lambda x: x[1]))
 
 
 # Note: this takes a while, as the upstream since _locations isn't sorted, so this
@@ -192,7 +203,7 @@ def _get_day_tz(d: date) -> Optional[pytz.BaseTzInfo]:
 
 # ok to cache, there are only a few home locations?
 @lru_cache(maxsize=None)
-def _get_home_tz(loc) -> Optional[pytz.BaseTzInfo]:
+def _get_home_tz(loc: LatLon) -> Optional[pytz.BaseTzInfo]:
     (lat, lng) = loc
     finder = _timezone_finder(fast=False) # ok to use slow here for better precision
     zone = finder.timezone_at(lat=lat, lng=lng)
@@ -203,7 +214,7 @@ def _get_home_tz(loc) -> Optional[pytz.BaseTzInfo]:
         return pytz.timezone(zone)
 
 
-def _get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
+def get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
     '''
     Given a datetime, returns the timezone for that date.
     '''
@@ -211,16 +222,13 @@ def _get_tz(dt: datetime) -> Optional[pytz.BaseTzInfo]:
     if res is not None:
         return res
     # fallback to home tz
-    from ...location import home
-    loc = home.get_location(dt)
+    from my.location.fallback.via_home import get_location as get_home_location
+    loc = get_home_location(dt)
     return _get_home_tz(loc=loc)
 
-# expose as 'public' function
-get_tz = _get_tz
-
 
 def localize(dt: datetime) -> tzdatetime:
-    tz = _get_tz(dt)
+    tz = get_tz(dt)
     if tz is None:
         # TODO -- this shouldn't really happen.. think about it carefully later
         return dt

You could always just sort the inputs anyways (which is what I tried), since TimSort (default python sort) should have a fast case when big chunks (or the whole list) are already sorted, but then youre always doing a list(sorted(itr)) on every source before you can start the heapmerge.

Could add some version of this that the user can opt-in to, but hard to make it automatic

from hpi.

seanbreckenridge avatar seanbreckenridge commented on August 21, 2024

Since #263 is merged (which was the location fallback system in general), going to close this in favor of #276

from hpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.