Here's how I used pandas to rearrange and clean up the raw tide data from NOAA.
import pandas as pd
Read the data via pandas' flexible read_table function. This returns a DataFrame:
d = pd.read_table('BatteryParkTideData_Cleaned.txt', sep='\s+', parse_dates=[[1, 2]])
d.head()
Date_Time ?Station Pred6 Backup Acoustc 0 2012-10-29 00:00:00 8518750 1.59 4.68 4.65 1 2012-10-29 00:06:00 8518750 1.50 4.55 4.54 2 2012-10-29 00:12:00 8518750 1.40 4.46 4.44 3 2012-10-29 00:18:00 8518750 1.31 4.36 4.33 4 2012-10-29 00:24:00 8518750 1.22 4.28 4.26
In the data I output I don't want dates, I just want the time since the first measurement, which is a simple, easy to work with floating point number.
So next I add a new column to d
that's the difference between each timestamp and the first timestamp:
d['TimeOffset'] = d['Date_Time'] - d['Date_Time'][0]
d.head()
Date_Time ?Station Pred6 Backup Acoustc TimeOffset 0 2012-10-29 00:00:00 8518750 1.59 4.68 4.65 0:00:00 1 2012-10-29 00:06:00 8518750 1.50 4.55 4.54 0:06:00 2 2012-10-29 00:12:00 8518750 1.40 4.46 4.44 0:12:00 3 2012-10-29 00:18:00 8518750 1.31 4.36 4.33 0:18:00 4 2012-10-29 00:24:00 8518750 1.22 4.28 4.26 0:24:00
Now I have the time since the first measurement, but it's still in a datetime format, specifically a timedelta. To convert that to hours I use the total_seconds methods and divide by 3600. I add that to d
as another new column:
d['TimeOffsetHours'] = pd.Series(to.total_seconds() / 3600. for to in d['TimeOffset'])
Finally I write a CSV using the to_csv method. I'm writing only the time since first measurement, predicted measurement, and measurement columns:
d.to_csv('BatteryParkTideData.csv', na_rep='NA', cols=['TimeOffsetHours', 'Pred6', 'Backup', 'Acoustc'], index=False)