Previous
29 June 2018 on posts
My objective here was to see how well I could predict historical crop prices in Ethiopia using solely weather and climate information. I chose to focus my analysis in Amhara, located in the Ethiopian highlands.
The main questions I was trying to answer here were:
How well can prices be predicted using solely climate and weather data?
How does this compare to models that include a lag term (i.e. information about previous prices)?
If models using solely weather and climate are on par with those using a lag term, this gives evidence that there is a relationship between climate and prices that is independent of previous price conditions.
I sourced monthly crop prices for Maize and Wheat from the World Food Program’s online crop prices database.. This contains prices sourced from a variety of different markets within Amhara, which I simply averaged in each month. The final dataset contains monthly prices (with some missing values) from 2005 to 2015.
I then de-trended the prices (using a linear trend) and set them to a 2015 baseline value. The following figure shows the resulting prices for Wheat.
I began with historical daily gridded estimates (0.05 degree resolution) of rainfall (from CHIRPS) and maximum and minimum temperatures (from GDAS). From these, I developed monthly estimates of growing degree days (GDDs) (using base temperatures of 0 and 10 degrees Celsius for Wheat and Maize respectively) and extreme degree days (EDDs) (using a temperature of 29 degrees Celsius). The following figure shows these data over Ethiopia for April 2000. I then “regionalized” these rasters, simply taking the average value over Amhara for each month.
Using these data I created a dataset for each crop with the following covariates:
Column | Description |
---|---|
year | Year |
month | Month |
price | Price in birr / 100kg |
EDDs | Regionalized average EDDs for this month |
GDDs | Regionalized average GDDs for this month |
avg_sum_pre | Regionalized average total precipitation for this month |
rain_lag | Regionalized average total precipitation from the previous year |
EDD_lag | Regionalized average total EDDs from the previous year |
GDD_lag | Regionalized average total GDDs from the previous year |
rain_YTD | Total year-to-date sum of rainfall |
EDD_YTD | Total year-to-date sum of EDDs |
GDD_YTD | Total year-to-date sum of GDDs |
The following figure shows these covariates over time for the two crops.
The first model I fit was a simple linear regression using all covariates. I trained the model to the entire dataset. Note that this isn’t ideal best-practice, as I am evaluating my predictions based on data that the model has already seen, but it’s a first attempt. The following figure shows the predicted vs. actual values using this linear model for Maize and then Wheat.
This is surprisingly good! I’m not able to capture the extreme values, but the model does a reasonable job of explaining the variance in the prices and capturing the peaks and troughs of the time series.
But am I considerably overestimating my predictive accuracy by using the entire dataset?
To get around this, I created a “rolling model”, in which I predicted each month’s price using only the data that had previously been observed. For example, to predict the dataset’s second month’s price I trained a model only on the information from the first month. Each month I added an additional month’s worth of data to the dataset.
How do we do?
Still very well (apart from a spurious month in 2011, which I believe is a sign of the model’s sensitivity to the abnormally low rainfall and high temperatures in the previous year (refer to the figure showing the temporal evolution of the covariates) – but this is not entirely clear). Again, we’re able to capture the general trends in the data. The model does pretty badly at predicting Wheat prices in 2013 (i.e. it predicts a price increase, rather than a price decrease), and again the extreme values are not always predicted, but I’m still pretty impressed.
So, to give a preliminary answer to question #1 above, prices can be predicted reasonably accurately using solely climate information.
Say we’re trying to predict what the price will be next month (in July). We know what the price was now (in June), so why not utilize this information for our prediction? To assess this, I simply added an extra covariate representing the price from the previous month. The following figures show the predictions for Maize and Wheat using a “rolling” linear model with a lag term.
This has improved the model fit considerably! Given we know the previous price, we’re now able to capture the pits and peaks in the dataset with a high amount of accuracy.
This is not unexpected, and gives us a preliminary answer to question #2 above. It seems that the inclusion of the lag term does help improve the predictions considerably. But now let’s assess this quantitatively…
An alternative way of evaluating model performance is to conduct a holdout analysis. Here, I trained a linear model on a random 80% of the data, and then tested it on the remaining 20%. I repeated this 50 times, giving a distribution of predictive accuracy.
I also added a backwards stepwise variable selection (using sklearn.feature_selection.RFE
) to see how the predictive accuracy depends on the number of covariates that are included.
The following figures show: (1) predicted vs actual values (for all 50 holdouts); (2) boxplots of the mean absolute prediction error (MAE) as a function of the number of covariates included; and (3) a boxplot of the MAE of the persistence model (i.e. if the prediction is set equal to last month’s value).
Results are shown for models without and with a lag term (only the results for Maize are shown for brevity, but the results for Wheat are similar).
No lag term:
With lag term:
Some observations:
Based on this final point, I can conclude that I may have some pretty graphs so far, but the models are pretty useless at predicting month-ahead crop yields.
Based on this, I conclude that:
Which variables are the most useful for predicting prices? The following two figures show (for Maize), for each number of covariates in the stepwise selection algorithm, the frequency with which each covariate was selected.
With a lag term (prev_price
):
Without a lag term:
We see that, as expected, when the lag term is included, it is almost always retained as a variable. Following this, it generally seems that the temperature (EDD and GDD) variables are selected more often than the rainfall variables, suggesting that prices might be more closely related to temperature than rainfall.
We can also look at the value of the regression parameters to see the direction with which each coefficient influences prices.
For this, all data was normalized (using sklearn.preprocessing.MinMaxScaler
).
Some observations:
Adding a lag term is slightly in the spirit of an autoregressive (AR) model, which are specifically intended for time series analysis. I’ve never worked with these before, but decided to have a preliminary stab at building some.
The “null” model here is what is called a persistence model, i.e. just predicting the value from the previous period. The AR model is a regression model that predicts/forecasts the price given the history of prices. If an AR model can’t do better than the persistence model, then there’s not much point. A generalization of the AR model is vector autoregression (VAR), which tries to capture dependencies between multiple time series (in this case, price, temperature, and precipitation variables). To be honest, I’m not sure if this is the most valid method for what I’m trying to achieve, but I tried it anyway.
All models were built in python using the statsmodels.tsa.ar_model.AR
and statsmodels.tsa.api.VAR
packages.
Looking at the results for Maize and Wheat (below) something seems to have gone awry… The AR model does okay, and the VAR model terribly. Neither are better than the persistence model.
Since I was able to get good predictions using the regular linear models I’m going to leave this as an open book for now.
While a very simple and rough analysis, I think that this shows potential. Since market prices can directly influence the food security and livelihoods of smallholders, being able to predict these prices ahead of time could be of great use. Some next steps could be: