Randomly sample rows in pandas
This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.
Randomly sample rows from a DataFrame¶
In [1]:
import pandas as pd
In [2]:
link = 'http://bit.ly/uforeports'
ufo = pd.read_csv(link)
In [3]:
ufo.head()
Out[3]:
In [4]:
# to get 3 random rows
# each time you run this, you would have 3 different rows
ufo.sample(n=3)
Out[4]:
In [7]:
# you can use random_state for reproducibility
ufo.sample(n=3, random_state=2)
Out[7]:
In [9]:
# fraction of rows
# here you get 75% of the rows
ufo.sample(frac=0.75, random_state=99)
Out[9]:
For machine learning train-test split
- You need non-overlapping rows in your train and test sets
In [10]:
train = ufo.sample(frac=0.75, random_state=99)
In [12]:
# you can't simply split 0.75 and 0.25 without overlapping
# this code tries to find that train = 75% and test = 25%
test = ufo.loc[~ufo.index.isin(train.index), :]