Globally, wildfires are becoming more frequent and destructive, generating a significant amount of smoke that can transport thousands of miles. Therefore, improving air pollution forecasts from wildfires is essential and informing citizens of more frequent, accurate, and interpretable updates related to localized air pollution events. This research proposes a multi-head attention-based deep learning architecture, SpatioTemporal (ST)-Transformer, to improve spatiotemporal predictions of PM2.5 concentrations in wildfire-prone areas. The ST-Transformer model employed a sparse attention mechanism that concentrates on the most useful contextual information across spatial, temporal, and variable-wise dimensions. The model includes critical driving factors of PM2.5 concentrations as predicting factors, including wildfire perimeter and intensity, meteorological factors, road traffic, PM2.5, and temporal indicators from the past 24 h. The model is trained to conduct time series forecasting on PM2.5 concentrations at EPA's air quality stations in the greater Los Angeles area. Prediction results were compared with other existing time series forecasting methods and exhibited better performance, especially in capturing abrupt changes or spikes in PM2.5 concentrations during wildfire situations. The attention matrix learned by the proposed model enabled interpretation of the complex spatial, temporal, and variable-wise dependencies, indicating that the model can differentiate between wildfires and non-wildfires. The ST-Transformer model's accurate predictability and interpretation capacity can help effectively monitor and predict the impacts of wildfire smoke and be applicable to other complex spatiotemporal prediction problems.