Version 0.10.1 (January 22, 2013)#
This is a minor release from 0.10.0 and includes new features, enhancements, and bug fixes. In particular, there is substantial new HDFStore functionality contributed by Jeff Reback.
An undesired API breakage with functions taking the inplace
option has been
reverted and deprecation warnings added.
API changes#
Functions taking an
inplace
option return the calling object as before. A deprecation message has been addedGroupby aggregations Max/Min no longer exclude non-numeric data (GH 2700)
Resampling an empty DataFrame now returns an empty DataFrame instead of raising an exception (GH 2640)
The file reader will now raise an exception when NA values are found in an explicitly specified integer column instead of converting the column to float (GH 2631)
DatetimeIndex.unique now returns a DatetimeIndex with the same name and
timezone instead of an array (GH 2563)
New features#
MySQL support for database (contribution from Dan Allan)
HDFStore#
You may need to upgrade your existing data files. Please visit the compatibility section in the main docs.
You can designate (and index) certain columns that you want to be able to
perform queries on a table, by passing a list to data_columns
In [1]: store = pd.HDFStore("store.h5")
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/pandas/compat/_optional.py:140, in import_optional_dependency(name, extra, errors, min_version)
139 try:
--> 140 module = importlib.import_module(name)
141 except ImportError:
File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'tables'
During handling of the above exception, another exception occurred:
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 store = pd.HDFStore("store.h5")
File /usr/lib/python3/dist-packages/pandas/io/pytables.py:572, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
569 if "format" in kwargs:
570 raise ValueError("format is not a defined argument for HDFStore")
--> 572 tables = import_optional_dependency("tables")
574 if complib is not None and complib not in tables.filters.all_complibs:
575 raise ValueError(
576 f"complib only supports {tables.filters.all_complibs} compression."
577 )
File /usr/lib/python3/dist-packages/pandas/compat/_optional.py:143, in import_optional_dependency(name, extra, errors, min_version)
141 except ImportError:
142 if errors == "raise":
--> 143 raise ImportError(msg)
144 return None
146 # Handle submodules: if we have submodule, grab parent module from sys.modules
ImportError: Missing optional dependency 'pytables'. Use pip or conda to install pytables.
In [2]: df = pd.DataFrame(
...: np.random.randn(8, 3),
...: index=pd.date_range("1/1/2000", periods=8),
...: columns=["A", "B", "C"],
...: )
...:
In [3]: df["string"] = "foo"
In [4]: df.loc[df.index[4:6], "string"] = np.nan
In [5]: df.loc[df.index[7:9], "string"] = "bar"
In [6]: df["string2"] = "cool"
In [7]: df
Out[7]:
A B C string string2
2000-01-01 0.469112 -0.282863 -1.509059 foo cool
2000-01-02 -1.135632 1.212112 -0.173215 foo cool
2000-01-03 0.119209 -1.044236 -0.861849 foo cool
2000-01-04 -2.104569 -0.494929 1.071804 foo cool
2000-01-05 0.721555 -0.706771 -1.039575 NaN cool
2000-01-06 0.271860 -0.424972 0.567020 NaN cool
2000-01-07 0.276232 -1.087401 -0.673690 foo cool
2000-01-08 0.113648 -1.478427 0.524988 bar cool
# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 store.append("df", df, data_columns=["B", "C", "string", "string2"])
NameError: name 'store' is not defined
In [9]: store.select("df", "B>0 and string=='foo'")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 store.select("df", "B>0 and string=='foo'")
NameError: name 'store' is not defined
# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]:
A B C string string2
2000-01-02 -1.135632 1.212112 -0.173215 foo cool
Retrieving unique values in an indexable or data column.
# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")
You can now store datetime64
in data columns
In [11]: df_mixed = df.copy()
In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")
In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan
In [14]: store.append("df_mixed", df_mixed)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 store.append("df_mixed", df_mixed)
NameError: name 'store' is not defined
In [15]: df_mixed1 = store.select("df_mixed")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 1
----> 1 df_mixed1 = store.select("df_mixed")
NameError: name 'store' is not defined
In [16]: df_mixed1
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 df_mixed1
NameError: name 'df_mixed1' is not defined
In [17]: df_mixed1.dtypes.value_counts()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 1
----> 1 df_mixed1.dtypes.value_counts()
NameError: name 'df_mixed1' is not defined
You can pass columns
keyword to select to filter a list of the return
columns, this is equivalent to passing a
Term('columns',list_of_columns_to_filter)
In [18]: store.select("df", columns=["A", "B"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 store.select("df", columns=["A", "B"])
NameError: name 'store' is not defined
HDFStore
now serializes MultiIndex dataframes when appending tables.
In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
....: ['one', 'two', 'three']],
....: labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
....: names=['foo', 'bar'])
....:
In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
....: columns=['A', 'B', 'C'])
....:
In [21]: df
Out[21]:
A B C
foo bar
foo one -0.116619 0.295575 -1.047704
two 1.640556 1.905836 2.772115
three 0.088787 -1.144197 -0.633372
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
baz two -0.824095 -0.337730 -0.927764
three -0.840123 0.248505 -0.109250
qux one 0.431977 -0.460710 0.336505
two -3.207595 -1.535854 0.409769
three -0.673145 -0.741113 -0.110891
In [22]: store.append('mi', df)
In [23]: store.select('mi')
Out[23]:
A B C
foo bar
foo one -0.116619 0.295575 -1.047704
two 1.640556 1.905836 2.772115
three 0.088787 -1.144197 -0.633372
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
baz two -0.824095 -0.337730 -0.927764
three -0.840123 0.248505 -0.109250
qux one 0.431977 -0.460710 0.336505
two -3.207595 -1.535854 0.409769
three -0.673145 -0.741113 -0.110891
# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
A B C
foo bar
bar one 0.925372 -0.006438 -0.820408
two -0.600874 -1.039266 0.824758
Multi-table creation via append_to_multiple
and selection via
select_as_multiple
can create/select from multiple tables and return a
combined result, by using where
on a selector table.
In [19]: df_mt = pd.DataFrame(
....: np.random.randn(8, 6),
....: index=pd.date_range("1/1/2000", periods=8),
....: columns=["A", "B", "C", "D", "E", "F"],
....: )
....:
In [20]: df_mt["foo"] = "bar"
# you can also create the tables individually
In [21]: store.append_to_multiple(
....: {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 1
----> 1 store.append_to_multiple(
2 {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
3 )
NameError: name 'store' is not defined
In [22]: store
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 store
NameError: name 'store' is not defined
# individual tables were created
In [23]: store.select("df1_mt")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 store.select("df1_mt")
NameError: name 'store' is not defined
In [24]: store.select("df2_mt")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[24], line 1
----> 1 store.select("df2_mt")
NameError: name 'store' is not defined
# as a multiple
In [25]: store.select_as_multiple(
....: ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[25], line 1
----> 1 store.select_as_multiple(
2 ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
3 )
NameError: name 'store' is not defined
Enhancements
HDFStore
now can read native PyTables table format tablesYou can pass
nan_rep = 'my_nan_rep'
to append, to change the default nan representation on disk (which converts to/fromnp.nan
), this defaults tonan
.You can pass
index
toappend
. This defaults toTrue
. This will automagically create indices on the indexables and data columns of the tableYou can pass
chunksize=an integer
toappend
, to change the writing chunksize (default is 50000). This will significantly lower your memory usage on writing.You can pass
expectedrows=an integer
to the firstappend
, to set the TOTAL number of expected rows thatPyTables
will expected. This will optimize read/write performance.Select
now supports passingstart
andstop
to provide selection space limiting in selection.Greatly improved ISO8601 (e.g., yyyy-mm-dd) date parsing for file parsers (GH 2698)
Allow
DataFrame.merge
to handle combinatorial sizes too large for 64-bit integer (GH 2690)Series now has unary negation (-series) and inversion (~series) operators (GH 2686)
DataFrame.plot now includes a
logx
parameter to change the x-axis to log scale (GH 2327)Series arithmetic operators can now handle constant and ndarray input (GH 2574)
ExcelFile now takes a
kind
argument to specify the file type (GH 2613)A faster implementation for Series.str methods (GH 2602)
Bug Fixes
HDFStore
tables can now storefloat32
types correctly (cannot be mixed withfloat64
however)Fixed Google Analytics prefix when specifying request segment (GH 2713).
Function to reset Google Analytics token store so users can recover from improperly setup client secrets (GH 2687).
Fixed groupby bug resulting in segfault when passing in MultiIndex (GH 2706)
Fixed bug where passing a Series with datetime64 values into
to_datetime
results in bogus output values (GH 2699)Fixed bug in
pattern in HDFStore
expressions when pattern is not a valid regex (GH 2694)Fixed performance issues while aggregating boolean data (GH 2692)
When given a boolean mask key and a Series of new values, Series __setitem__ will now align the incoming values with the original Series (GH 2686)
Fixed MemoryError caused by performing counting sort on sorting MultiIndex levels with a very large number of combinatorial values (GH 2684)
Fixed bug that causes plotting to fail when the index is a DatetimeIndex with a fixed-offset timezone (GH 2683)
Corrected business day subtraction logic when the offset is more than 5 bdays and the starting date is on a weekend (GH 2680)
Fixed C file parser behavior when the file has more columns than data (GH 2668)
Fixed file reader bug that misaligned columns with data in the presence of an implicit column and a specified
usecols
valueDataFrames with numerical or datetime indices are now sorted prior to plotting (GH 2609)
Fixed DataFrame.from_records error when passed columns, index, but empty records (GH 2633)
Several bug fixed for Series operations when dtype is datetime64 (GH 2689, GH 2629, GH 2626)
See the full release notes or issue tracker on GitHub for a complete list.