Randomly sample rows in pandas
Randomly Sample Rows

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Randomly sample rows from a DataFrame

In [1]:
import pandas as pd
In [2]:
link = 'http://bit.ly/uforeports'
ufo = pd.read_csv(link)
In [3]:
ufo.head()
Out[3]:
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
In [4]:
# to get 3 random rows
# each time you run this, you would have 3 different rows
ufo.sample(n=3)
Out[4]:
City Colors Reported Shape Reported State Time
13615 Hillsboro NaN TRIANGLE OR 6/3/1999 0:44
140 East Palestine NaN LIGHT OH 7/10/1950 20:30
15412 Ceder Lake RED LIGHT IN 12/1/1999 6:00
In [7]:
# you can use random_state for reproducibility
ufo.sample(n=3, random_state=2)
Out[7]:
City Colors Reported Shape Reported State Time
7236 Mesquite NaN OTHER NV 11/25/1993 15:00
14432 Pittsburg NaN OTHER CA 9/4/1999 0:38
4559 Mondel NaN NaN NM 6/18/1981 3:00
In [9]:
# fraction of rows
# here you get 75% of the rows
ufo.sample(frac=0.75, random_state=99)
Out[9]:
City Colors Reported Shape Reported State Time
6250 Sunnyvale NaN OTHER CA 12/16/1989 0:00
8656 Corpus Christi NaN NaN TX 9/13/1995 0:10
2729 Mentor NaN DISK OH 8/8/1974 10:00
7348 Wilson NaN LIGHT WI 6/1/1994 1:00
12637 Lowell NaN CIRCLE MA 11/26/1998 10:00
2094 Victorville NaN LIGHT CA 6/6/1971 21:00
15905 Black Canyon City BLUE CIRCLE AZ 2/16/2000 4:45
6792 Houston NaN CHEVRON TX 6/10/1992 23:00
5063 Ely NaN DIAMOND MN 6/15/1984 19:00
16626 Atlantic Ocean NaN NaN NC 6/17/2000 0:35
17030 Portland RED ORANGE LIGHT OR 7/27/2000 3:35
2391 Larchwood NaN DIAMOND IA 6/6/1973 22:00
12210 Castaic GREEN BLUE CIGAR CA 9/23/1998 22:45
11447 Friday Harbor NaN FIREBALL WA 4/22/1998 21:20
14849 Breckenridge NaN OTHER TX 10/15/1999 20:00
11056 Salida NaN LIGHT CA 12/15/1997 20:00
16877 Redland NaN CIRCLE OR 7/10/2000 20:30
9707 Seattle NaN SPHERE WA 11/7/1996 23:30
9811 Bartlett NaN OTHER TN 12/15/1996 23:00
16516 Chattanooga NaN DISK TN 6/1/2000 17:00
3587 Albuquerque NaN OVAL NM 7/24/1977 7:00
9288 Tampa NaN NaN FL 5/6/1996 22:00
16360 Hwy 12 NaN LIGHT WA 5/1/2000 20:00
4718 Overland Park NaN CIGAR KS 6/15/1982 14:00
9825 Woodville NaN TRIANGLE TX 12/17/1996 20:17
3843 Spokane NaN DISK WA 7/14/1978 22:30
17525 Jordan BLUE CIGAR MN 9/26/2000 13:00
7973 Fort Wayne NaN NaN IN 3/30/1995 0:00
16900 Casa GREEN CIGAR AR 7/12/2000 23:00
14709 Oak Brook GREEN LIGHT IL 10/1/1999 1:00
... ... ... ... ... ...
1353 Opa Locka NaN CHEVRON FL 1/1/1967 13:00
3754 Paterson NaN DISK NJ 6/1/1978 21:00
11495 Moultrie NaN VARIOUS GA 5/4/1998 6:00
10053 New York City NaN OVAL NY 3/14/1997 23:59
12868 Portland ORANGE SPHERE OR 1/8/1999 18:00
11491 Portland NaN TRIANGLE OR 5/2/1998 22:45
715 Aurora NaN LIGHT CO 6/1/1962 20:00
18122 San Diego NaN LIGHT CA 12/15/2000 4:05
358 Akron NaN OTHER OH 6/6/1956 22:00
1552 Wheaton RED CIRCLE MD 1/1/1968 23:00
3453 Long Green NaN DISK MD 4/17/1977 16:30
16514 Albuquerque NaN LIGHT NM 6/1/2000 15:00
14848 Newbern NaN LIGHT TN 10/15/1999 19:20
11414 Catawba NaN TRIANGLE OH 4/15/1998 7:45
11330 Okoboji YELLOW GREEN FIREBALL IA 3/20/1998 18:00
10724 Pinckney NaN LIGHT MI 8/20/1997 23:00
7773 Shawnee RED NaN OK 2/7/1995 21:10
8988 Schnecksville BLUE NaN PA 12/20/1995 23:50
5306 San Carlos NaN LIGHT CA 7/31/1985 22:30
8731 San Francisco NaN NaN CA 9/29/1995 12:50
7254 Fort Lauderdale NaN NaN FL 1/1/1994 3:00
3622 Black River Falls NaN LIGHT WI 8/18/1977 19:30
8241 Ann Arbor NaN NaN MI 6/14/1995 1:35
13133 Fresno NaN CIGAR CA 3/4/1999 7:15
7598 Spring Valley NaN LIGHT CA 10/31/1994 18:00
8965 Lynnwood NaN NaN WA 12/6/1995 22:45
4991 Kent NaN NaN WA 12/5/1983 5:00
2740 Niagara Falls NaN TRIANGLE NY 8/15/1974 20:00
11887 Vancouver NaN TRIANGLE WA 7/25/1998 21:00
9809 Issaquah NaN NaN WA 12/14/1996 20:20

13681 rows × 5 columns

For machine learning train-test split

  • You need non-overlapping rows in your train and test sets
In [10]:
train = ufo.sample(frac=0.75, random_state=99)
In [12]:
# you can't simply split 0.75 and 0.25 without overlapping
# this code tries to find that train = 75% and test = 25%
test = ufo.loc[~ufo.index.isin(train.index), :]
Tags: pandas