Weather Station Data Analysis with Python (Pandas)

Quick Answer

Python with Pandas turns years of weather station CSV archives into actionable insight in a few dozen lines of code. Import your data, set a datetime index, resample to hourly or daily aggregates, compute derived values like dew point and heat index, flag anomalies from sensor drift, and produce charts that actually communicate something useful. This guide in the Tutorials section walks through the complete workflow with real code examples.

What This Guide Covers

We will cover CSV ingestion with proper datetime parsing, resampling and aggregation, derived weather calculations, trend analysis with rolling averages, anomaly detection for sensor quality control, and visualization with Matplotlib. If your station data lives in a database rather than CSV, the Pandas workflow is nearly identical once the data is in a DataFrame. The techniques here also complement the Station Data Sanity Checks guide, which covers the conceptual side of data quality.

Prerequisites

Python 3.9 or later
Station data exported as CSV (most weather software supports this)
Familiarity with basic Python syntax

Install the required libraries:

pip install pandas matplotlib scipy

Step 1: Import and Parse Your Data

Most station software exports CSV with a header row and timestamps in the first column. A typical WeeWX archive export looks like this:

dateTime,outTemp,outHumidity,barometer,windSpeed,windDir,rain
1711929600,12.3,78,1013.2,5.4,225,0.0
1711929900,12.2,79,1013.1,4.8,220,0.0

Load it into Pandas:

import pandas as pd

df = pd.read_csv("weather_archive.csv", parse_dates=["dateTime"])
df.set_index("dateTime", inplace=True)
df.sort_index(inplace=True)

If your timestamps are Unix epoch integers rather than ISO strings, convert them:

df.index = pd.to_datetime(df.index, unit="s", utc=True)
df.index = df.index.tz_convert("Europe/Paris")  # adjust to your timezone

Getting timezone handling right at the import stage saves hours of confusion later. I have debugged more "my daily max temperature is wrong" reports caused by timezone misalignment than any other single issue.

Step 2: Resampling and Aggregation

Weather data typically arrives at five-minute intervals. For analysis, you often want hourly, daily, or monthly summaries:

# Daily summaries
daily = df.resample("D").agg({
    "outTemp": ["min", "max", "mean"],
    "outHumidity": "mean",
    "barometer": "mean",
    "windSpeed": ["max", "mean"],
    "rain": "sum",
})

# Flatten multi-level columns
daily.columns = ["_".join(col).strip("_") for col in daily.columns]

Monthly summaries follow the same pattern with resample("M"). For climate-normal comparisons, group by month across years:

monthly_normals = daily.groupby(daily.index.month).mean()

Step 3: Derived Weather Calculations

Raw sensor readings tell part of the story. Derived values tell the rest.

Dew point from temperature and relative humidity (Magnus formula):

import numpy as np

def dew_point(temp_c, rh):
    a, b = 17.27, 237.7
    alpha = (a * temp_c) / (b + temp_c) + np.log(rh / 100.0)
    return (b * alpha) / (a - alpha)

df["dewpoint"] = dew_point(df["outTemp"], df["outHumidity"])

Heat index (simplified Rothfusz regression, valid above 27 °C and 40% RH):

def heat_index(temp_c, rh):
    t_f = temp_c * 9/5 + 32  # convert to Fahrenheit for the standard formula
    hi = (-42.379 + 2.04901523*t_f + 10.14333127*rh
           - 0.22475541*t_f*rh - 0.00683783*t_f**2
           - 0.05481717*rh**2 + 0.00122874*t_f**2*rh
           + 0.00085282*t_f*rh**2 - 0.00000199*t_f**2*rh**2)
    return (hi - 32) * 5/9  # back to Celsius

mask = (df["outTemp"] > 27) & (df["outHumidity"] > 40)
df.loc[mask, "heatIndex"] = heat_index(df.loc[mask, "outTemp"], df.loc[mask, "outHumidity"])

Wind chill (Environment Canada formula, valid below 10 °C and wind above 4.8 km/h):

def wind_chill(temp_c, wind_kmh):
    return (13.12 + 0.6215*temp_c
            - 11.37*wind_kmh**0.16
            + 0.3965*temp_c*wind_kmh**0.16)

mask_wc = (df["outTemp"] < 10) & (df["windSpeed"] * 3.6 > 4.8)
df.loc[mask_wc, "windChill"] = wind_chill(
    df.loc[mask_wc, "outTemp"],
    df.loc[mask_wc, "windSpeed"] * 3.6  # m/s to km/h
)

Step 4: Trend Analysis

Rolling averages smooth out noise and reveal trends:

df["temp_7d"] = df["outTemp"].rolling("7D").mean()
df["pressure_24h"] = df["barometer"].rolling("24h").mean()

For seasonal decomposition — separating the long-term trend from seasonal cycles and residual noise:

from scipy.signal import detrend

daily_temp = df["outTemp"].resample("D").mean().dropna()
trend = daily_temp.rolling(365, center=True, min_periods=180).mean()
seasonal = daily_temp - trend

A rising trend line in your annual temperature data might indicate genuine climate shift, or it might indicate your temperature sensor has drifted. Cross-reference with the Station Data Sanity Checks methodology before drawing conclusions.

Step 5: Anomaly Detection

Sensor drift and hardware faults produce outliers. Flag them with z-score analysis:

from scipy import stats

z_scores = stats.zscore(df["outTemp"].dropna())
anomalies = df.loc[df["outTemp"].dropna().index[abs(z_scores) > 3]]
print(f"Found {len(anomalies)} anomalous temperature readings")

A z-score above 3 means the reading is more than three standard deviations from the mean — almost certainly a sensor glitch or a genuinely extreme event worth investigating.

For humidity sensors, which degrade predictably over time (capacitive sensors drift upward), plot a rolling median and look for a steady upward trend that does not correlate with seasonal patterns.

Step 6: Visualization

Matplotlib produces publication-ready charts:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Temperature
axes[0].plot(daily.index, daily["outTemp_max"], color="#e74c3c", linewidth=0.8, label="Max")
axes[0].plot(daily.index, daily["outTemp_min"], color="#3498db", linewidth=0.8, label="Min")
axes[0].fill_between(daily.index, daily["outTemp_min"], daily["outTemp_max"], alpha=0.15)
axes[0].set_ylabel("Temperature (°C)")
axes[0].legend()

# Pressure
axes[1].plot(daily.index, daily["barometer_mean"], color="#2c3e50", linewidth=0.8)
axes[1].set_ylabel("Pressure (hPa)")

# Rainfall
axes[2].bar(daily.index, daily["rain_sum"], color="#27ae60", width=1)
axes[2].set_ylabel("Rainfall (mm)")

axes[2].xaxis.set_major_formatter(mdates.DateFormatter("%b %Y"))
fig.autofmt_xdate()
plt.tight_layout()
plt.savefig("annual_overview.png", dpi=150)

For wind roses, the windrose package provides a clean polar plot:

pip install windrose

from windrose import WindroseAxes

ax = WindroseAxes.from_ax()
ax.bar(df["windDir"], df["windSpeed"], normed=True, opening=0.8)
ax.set_legend()
plt.savefig("wind_rose.png", dpi=150)

Common Mistakes

Timezone confusion. Mixing UTC and local time in the same analysis produces shifted daily aggregates. Pick one timezone at import time and stick with it.
Interpolating large gaps. If your station was offline for a week, do not interpolate across the gap — it fabricates data. Use resample().mean() which naturally produces NaN for missing periods.
Confusing resampling methods. resample("D").mean() gives the daily average. resample("D").last() gives the last observation of the day. For rainfall, you almost always want .sum().
Not handling the first/last partial day. If your data starts at 14:00 on day one, the first daily aggregate only covers half a day. Trim or flag partial days.
Drawing climate conclusions from one station. Your backyard station measures your microclimate. Trends that diverge from regional averages may reflect local factors (new building, tree removal, sensor degradation) rather than broader climate signals.

FAQ

Can I use Jupyter notebooks for this? Absolutely. Jupyter is excellent for exploratory analysis. The code examples here work identically in notebooks, with the added benefit of inline chart rendering.

What if my data is in a database, not CSV? Use pd.read_sql() with a SQLAlchemy connection string. WeeWX's SQLite database, for example: pd.read_sql("SELECT * FROM archive", "sqlite:///weewx.sdb").

How much data do I need for meaningful trend analysis? For seasonal patterns, one full year minimum. For multi-year trend detection, three to five years gives statistically useful baselines. For daily analysis, a few weeks is sufficient.

Should I denormalise my data before analysis? If your data is in multiple tables (main archive + daily summaries), join them first or work from the raw archive. Pandas handles millions of rows efficiently on modern hardware.