xorbitsai / xorbits Goto Github PK
View Code? Open in Web Editor NEWScalable Python DS & ML, in an API compatible & lightning fast way.
Home Page: https://xorbits.io
License: Apache License 2.0
Scalable Python DS & ML, in an API compatible & lightning fast way.
Home Page: https://xorbits.io
License: Apache License 2.0
There are several things to do after moving mars into xorbits in #195 :
When an unsupported API was called, instead of raising a NotImplemented
error, xorbits
should be able to fallback to libs like pandas or numpy and behave as expected.
K8s starts with confusing warning information: warning Readiness probe failed: dial tcp 172.31.32.96:15031: connect: connection refused.
Consider extending initialDelaySeconds or use startupProbe.
PlotAccessor.call is not implemented.
df = pd.Series([1, 2, 3])
df.plot()
Note that the issue tracker is NOT the place for general support. For
discussions about development, questions about usage, or any general questions,
contact us on https://discuss.xorbits.io/.
Add CI for installation.
For now, df.dtypes
returns a pandas series and df.columns
returns a series in Mars, we should return xorbits data only to unify these behaviors.
In interactive environment, users tend to do exploratory analysis and there's no so called "final result". Thus, deferred execution may cause duplicated execution. In the following example, line 0 and line 1 are executed twice:
[0]: df = pd.DataFrame({"foo": (1, 2, 3), "bar": (4, 5, 6)})
[1]: df["baz"] = df["bar"] + 3
[2]: df.sum(axis=0)
[3]: df.describe()
Execute named variables eagerly in interactive env.
Note that the issue tracker is NOT the place for general support. For
discussions about development, questions about usage, or any general questions,
contact us on https://discuss.xorbits.io/.
Docker image supports multi python versions.
Now multiple supervisors are not widely used, in order to avoid ambiguity, we can replace the example with one supervisor.
Here's an example:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df.groupby('A').plot()
A
bar AxesSubplot(0.125,0.11;0.775x0.77)
foo AxesSubplot(0.125,0.11;0.775x0.77)
dtype: object
I'm thinking to add a property own_data
to indicate whether xorbits object holds the data directly, for those objects created from pandas or numpy, iterating and printing them doesn't need execution, just iterating or printing the pandas dataframe or numpy ndarray.
Here list the methods that can skip execution for "own-data" entity:
__str__
__repr__
As Mars hasn't implemented __iter__
, so does xorbits.
I install xorbits by pip, and try to import xorbits.pandas, which got below error:
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xorbits.pandas as pd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/__init__.py", line 17, in <module>
from .core import run
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/core/__init__.py", line 16, in <module>
from .execution import run
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/core/execution.py", line 17, in <module>
from .adapter import MarsEntity, mars_execute
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/core/adapter.py", line 26, in <module>
from .._mars import dataframe as mars_dataframe
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/_mars/dataframe/__init__.py", line 17, in <module>
from .initializer import DataFrame, Series, Index
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/_mars/dataframe/initializer.py", line 20, in <module>
from ..tensor import tensor as astensor, stack
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/_mars/tensor/__init__.py", line 297, in <module>
from . import special
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/_mars/tensor/special/__init__.py", line 18, in <module>
from .err_fresnel import (
File "/home/bji/.local/lib/python3.8/site-packages/xorbits/_mars/tensor/special/err_fresnel.py", line 221, in <module>
@implement_scipy(spspecial.voigt_profile)
AttributeError: module 'scipy.special' has no attribute 'voigt_profile'
I've checked the installed packages in my python env. The root cause seems to: I've previously installed the scipy==1.3.3
After I manually update scipy to 1.4.1(Seems, for load xorbits.pandas at least needs 1.4.0), then it could work.
Could simply reproduce it by:
pip3 install 'scipy<1.4.0'
pip3 install xorbits
python3 -c 'import xorbits.pandas as pd'
I'd like to provide the PR to fix it, later.
mars.deploy.oscar.session.py
warning_msg = """
No session found, local session \
will be created in background, \
it may take a while before execution. \
If you want to new a local session by yourself, \
run code below:
```
import mars
mars.new_session()
```
"""
A clear and concise description of what the bug is.
import xorbits.pandas as pd
s = pd.Series([1, 2, 3])
s.loc[0] = 111
TypeError Traceback (most recent call last)
Cell In [3], line 1
----> 1 s.loc[0] = 111
TypeError: 'DataFrameLoc' object does not support item assignment
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df.loc[0] = [11, 22]
TypeError Traceback (most recent call last)
Cell In [5], line 1
----> 1 df.loc[0] = [11, 22]
TypeError: 'DataFrameLoc' object does not support item assignment
Other Indexing methods like iloc, at and iat have the same problem
To help us to reproduce this bug, please provide information below:
A clear and concise description of what you expected to happen.
Add any other context about the problem here.
DataFrame.iloc
returns a mars object.
>>> import xorbits.pandas as pd
>>> dates = pd.date_range("20130101", periods=6)
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
>>> df.iloc[3]
Series(op=DataFrameIlocGetItem)
>>> import xorbits.pandas as pd
>>> dates = pd.date_range("20130101", periods=6)
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
>>> df.iloc[3]
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64
Now xorbits uses tqdm to show progress of execution, it will leave too many bars in console if there are multiple tasks, more flexible way should be supported in the future.
np.sort failed for GPU.
To help us to reproduce this bug, please provide information below:
In [10]: %time print(np.sort(np.random.rand(100_000_000, gpu=True)))
0%| | 0.00/100 [00:01<?, ?it/s]2023-01-30 09:27:22,974 xorbits._mars.services.scheduling.worker.execution 2867073 ERROR Failed to run subtask FvEe9WQxwQ3wEr5zzgnGcQe8 on band gpu-0
Traceback (most recent call last):
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/services/subtask/worker/processor.py", line 203, in _execute_operand
return execute(ctx, op)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/core/operand/core.py", line 491, in execute
result = executor(results, op)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py", line 525, in execute
res = ctx[op.outputs[0].key] = _sort(a, op, xp)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py", line 425, in _sort
assert xp is cp
AssertionError
2023-01-30 09:27:22,976 xorbits._mars.services.task.execution.mars.stage 2867073 ERROR Subtask FvEe9WQxwQ3wEr5zzgnGcQe8 errored
Traceback (most recent call last):
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/services/subtask/worker/processor.py", line 203, in _execute_operand
return execute(ctx, op)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/core/operand/core.py", line 491, in execute
result = executor(results, op)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py", line 525, in execute
res = ctx[op.outputs[0].key] = _sort(a, op, xp)
File "/home/xuyeqin/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py", line 425, in _sort
assert xp is cp
AssertionError
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100.00/100 [00:01<00:00, 73.21it/s]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
File <timed eval>:1
File ~/projects/xorbits/python/xorbits/utils.py:33, in safe_repr_str.<locals>.inn(self, *args, **kwargs)
31 return getattr(object, f.__name__)(self)
32 else:
---> 33 return f(self, *args, **kwargs)
File ~/projects/xorbits/python/xorbits/core/data.py:223, in DataRef.__str__(self)
221 return self.data._mars_entity.op.data.__str__()
222 else:
--> 223 run(self)
224 return self.data.__str__()
File ~/projects/xorbits/python/xorbits/core/execution.py:42, in run(obj)
40 if isinstance(obj, DataRef):
41 if need_to_execute(obj):
---> 42 mars_execute(_get_mars_entity(obj))
43 else:
44 refs_to_execute = [_get_mars_entity(ref) for ref in obj if need_to_execute(ref)]
File ~/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:1875, in execute(tileable, session, wait, new_session_kwargs, show_progress, progress_update_interval, *tileables, **kwargs)
1873 session = get_default_or_create(**(new_session_kwargs or dict()))
1874 session = _ensure_sync(session)
-> 1875 return session.execute(
1876 tileable,
1877 *tileables,
1878 wait=wait,
1879 show_progress=show_progress,
1880 progress_update_interval=progress_update_interval,
1881 **kwargs,
1882 )
File ~/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:1669, in SyncSession.execute(self, tileable, show_progress, warn_duplicated_execution, *tileables, **kwargs)
1667 fut = asyncio.run_coroutine_threadsafe(coro, self._loop)
1668 try:
-> 1669 execution_info: ExecutionInfo = fut.result(
1670 timeout=self._isolated_session.timeout
1671 )
1672 except KeyboardInterrupt: # pragma: no cover
1673 logger.warning("Cancelling running task")
File ~/miniconda3/envs/mars/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
444 raise CancelledError()
445 elif self._state == FINISHED:
--> 446 return self.__get_result()
447 else:
448 raise TimeoutError()
File ~/miniconda3/envs/mars/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
389 if self._exception:
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception
394 self = None
File ~/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:1855, in _execute(session, wait, show_progress, progress_update_interval, cancelled, *tileables, **kwargs)
1852 else:
1853 # set cancelled to avoid wait task leak
1854 cancelled.set()
-> 1855 await execution_info
1856 else:
1857 return execution_info
File ~/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:106, in ExecutionInfo._ensure_future.<locals>.wait()
105 async def wait():
--> 106 return await self._aio_task
File ~/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:954, in _IsolatedSession._run_in_background(self, tileables, task_id, progress, profiling)
948 logger.warning(
949 "Profile task %s execution result:\n%s",
950 task_id,
951 json.dumps(task_result.profiling, indent=4),
952 )
953 if task_result.error:
--> 954 raise task_result.error.with_traceback(task_result.traceback)
955 if cancelled:
956 return
File ~/projects/xorbits/python/xorbits/_mars/services/task/supervisor/processor.py:373, in TaskProcessor.run(self)
371 async with self._executor:
372 async for stage_args in self._iter_stage_chunk_graph():
--> 373 await self._process_stage_chunk_graph(*stage_args)
374 except Exception as ex:
375 self.result.error = ex
File ~/projects/xorbits/python/xorbits/_mars/services/task/supervisor/processor.py:250, in TaskProcessor._process_stage_chunk_graph(self, stage_id, stage_profiler, chunk_graph)
244 tile_context = await asyncio.to_thread(
245 self._get_stage_tile_context,
246 {c for c in chunk_graph.result_chunks if not isinstance(c.op, Fetch)},
247 )
249 with Timer() as timer:
--> 250 chunk_to_result = await self._executor.execute_subtask_graph(
251 stage_id, subtask_graph, chunk_graph, tile_context
252 )
253 stage_profiler.set("run", timer.duration)
255 self._preprocessor.post_chunk_graph_execution()
File ~/projects/xorbits/python/xorbits/_mars/services/task/execution/mars/executor.py:208, in MarsTaskExecutor.execute_subtask_graph(self, stage_id, subtask_graph, chunk_graph, tile_context, context)
206 curr_tile_progress = self._tile_context.get_all_progress() - prev_progress
207 self._stage_tile_progresses.append(curr_tile_progress)
--> 208 return await stage_processor.run()
File ~/projects/xorbits/python/xorbits/_mars/services/task/execution/mars/stage.py:231, in TaskStageProcessor.run(self)
227 if self.subtask_graph.num_shuffles() > 0:
228 # disable scale-in when shuffle is executing so that we can skip
229 # store shuffle meta in supervisor.
230 await self._scheduling_api.disable_autoscale_in()
--> 231 return await self._run()
232 finally:
233 if self.subtask_graph.num_shuffles() > 0:
File ~/projects/xorbits/python/xorbits/_mars/services/task/execution/mars/stage.py:251, in TaskStageProcessor._run(self)
249 if self.error_or_cancelled():
250 if self.result.error is not None:
--> 251 raise self.result.error.with_traceback(self.result.traceback)
252 else:
253 raise asyncio.CancelledError()
File ~/projects/xorbits/python/xorbits/_mars/services/subtask/worker/processor.py:203, in _execute_operand()
198 @enter_mode(build=False, kernel=True)
199 def _execute_operand(
200 self, ctx: Dict[str, Any], op: OperandType
201 ): # noqa: R0201 # pylint: disable=no-self-use
202 try:
--> 203 return execute(ctx, op)
204 except BaseException as ex:
205 # wrap exception in execution to avoid side effects
206 raise ExecutionError(ex).with_traceback(ex.__traceback__) from None
File ~/projects/xorbits/python/xorbits/_mars/core/operand/core.py:491, in execute()
487 else:
488 # Cast `UFuncTypeError` to `TypeError` since subclasses of the former is unpickleable.
489 # The `UFuncTypeError` was introduced by numpy#12593 since v1.17.0.
490 try:
--> 491 result = executor(results, op)
492 succeeded = True
493 if op.stage is not None:
File ~/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py:525, in execute()
522 if not op.return_indices:
523 if op.kind is not None:
524 # sort
--> 525 res = ctx[op.outputs[0].key] = _sort(a, op, xp)
526 else:
527 # do not sort, prepare for sample by `xp.partition`
528 kth = xp.linspace(
529 max(w - 1, 0), a.shape[op.axis] - 1, num=n, endpoint=False
530 ).astype(int)
File ~/projects/xorbits/python/xorbits/_mars/tensor/base/psrs.py:425, in _sort()
422 return method(axis=axis, kind=kind, order=order)
423 else: # pragma: no cover
424 # cupy does not support structure type
--> 425 assert xp is cp
426 assert order is not None
427 method = a.sort if inplace else partial(cp.sort, a)
AssertionError:
Xorbits now has some troubles with CGroup V2 when collect process stats, https://github.com/xprobe-inc/xorbits/blob/2e2e624798178b1abcfd7cc74e2c762b1ae50512/python/xorbits/deploy/kubernetes/config.py#L651-L652 should be uncommented if it is fixed.
Using xorbits.pandas.read_csv() does not honor the skiprows keyword argument.
To help us to reproduce this bug, please provide information below:
import xorbits.pandas as pd
df = pd.read_csv('file.txt',
sep=' ',
names=('val1', 'val2'),
dtype = {'val1': 'int', 'val2': 'float'},
skiprows=1)
Contents of file.txt
:
# This is a comment line
1 2.2
2 4.4
3 6.6
4 8.8
5 11.0
6 13.2
7 15.4
8 17.6
9 19.8
10 22.0
For a dataframe to be generated with the data without errors.
I checked to see if the number of fields in the comment line mattered, and they don't, but the error appears to fail on the second to last field. Changing the data separator doesn't impact this behavior either. Changing the import to vanilla pandas allows the code to run fine.
Handle some version-sensitive packages for user when deploying Xorbits on K8s, e.g. cloudpickle, numpy, pandas ,etc.
Creating issue templates for xorbits.
Doc of 10 minutes to xorbits.numpy requires for some third-party libraries like hdf5
, maybe we need add a note to tell users how to install them.
Note that the issue tracker is NOT the place for general support. For
discussions about development, questions about usage, or any general questions,
contact us on https://discuss.xorbits.io/.
enhance setup.py to make it support pip install 'xorbits[aws]'
In [9]: def raise_repr(arg):
...: raise TypeError(f'Unknown arg {repr(arg)}')
...:
In [10]: raise_repr(DataFrame([1,2,3]))
/Users/hekaisheng/Documents/projects/xorbits/python/xorbits/_mars/deploy/oscar/session.py:2049: UserWarning: No existing session found, creating a new local session now.
warnings.warn(warning_msg)
2022-12-12 15:21:07,638 xorbits._mars.deploy.oscar.local 22244 WARNING Web service started at http://0.0.0.0:21856
100%|██████████████████████████████████████████████████████████████████████████████| 100.00/100 [00:00<00:00, 288.28it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [10], line 1
----> 1 raise_repr(DataFrame([1,2,3]))
Cell In [9], line 2, in raise_repr(arg)
1 def raise_repr(arg):
----> 2 raise TypeError(f'Unknown arg {repr(arg)}')
TypeError: Unknown arg 0
0 1
1 2
2 3
DataFrame.at returns a mars object.
>>> import xorbits.pandas as pd
>>> df = pd.DataFrame((1, 2, 3))
>>> df.at[0, 0]
Tensor <op=DataFrameLocGetItem, shape=(), key=f5caa0e174b3142b0aa648f305532703_0>
>>> import xorbits.pandas as pd
>>> df = pd.DataFrame((1, 2, 3))
>>> df.at[0, 0]
1
When creating kubernetes cluster, xorbits uses latest xorbits image which installed required packages, it needs a customized image if users want to use some third-party packages. It's quite useful to support specifying some packages and install them automatically before creating cluster, code would be like this, new_cluster(pip_list=["tensorflow"], conda_list=["numba", "lightgbm"])
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.