Comparison with R / R libraries¶
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
Base R¶
subset¶
New in version 0.13.
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10)) subset(df, a <= b) df[df$a <= df$b,] # note the comma
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [1]: from pandas import DataFrame In [2]: from numpy.random import randn In [3]: df = DataFrame({'a': randn(10), 'b': randn(10)}) In [4]: df.query('a <= b') a b 2 -1.950301 0.173875 3 -1.478332 -0.798063 5 -0.806934 0.141070 8 0.084343 0.879800 9 -0.590813 0.465165 In [5]: df[df.a <= df.b] a b 2 -1.950301 0.173875 3 -1.478332 -0.798063 5 -0.806934 0.141070 8 0.084343 0.879800 9 -0.590813 0.465165 In [6]: df.loc[df.a <= df.b] a b 2 -1.950301 0.173875 3 -1.478332 -0.798063 5 -0.806934 0.141070 8 0.084343 0.879800 9 -0.590813 0.465165
For more details and examples see the query documentation.
with¶
New in version 0.13.
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b) df$a + df$b # same as the previous expression
In pandas the equivalent expression, using the eval() method, would be:
In [7]: df = DataFrame({'a': randn(10), 'b': randn(10)}) In [8]: df.eval('a + b') 0 -0.316408 1 2.764941 2 2.079059 3 -0.149641 4 1.708174 5 -0.695574 6 -0.513258 7 0.543637 8 1.373293 9 0.466815 dtype: float64 In [9]: df.a + df.b # same as the previous expression 0 -0.316408 1 2.764941 2 2.079059 3 -0.149641 4 1.708174 5 -0.695574 6 -0.513258 7 0.543637 8 1.373293 9 0.466815 dtype: float64
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.