Why data samples are changing
Why data samples are changing
Hi all,
I have been estimating multiple models with different dependent variables (they all have different lengths) within one PRG file. I found that the samples used for different dependent variable model estimations are not their full lengths any more when I run them in batch mode. For example, if I run one particular model independently, the data sample used is 250. But when I run all the models together, the sample used for this particular model is only 230. Could any one please tell me what is happening here (is its sample size influenced by other dependant variables sample sizes?) and how I could deal with this problem using batch processing?
Thank you very much in advance!
anozman
I have been estimating multiple models with different dependent variables (they all have different lengths) within one PRG file. I found that the samples used for different dependent variable model estimations are not their full lengths any more when I run them in batch mode. For example, if I run one particular model independently, the data sample used is 250. But when I run all the models together, the sample used for this particular model is only 230. Could any one please tell me what is happening here (is its sample size influenced by other dependant variables sample sizes?) and how I could deal with this problem using batch processing?
Thank you very much in advance!
anozman
Re: Why data samples are changing
It's difficult to offer any specific suggestions without seeing exactly what you are doing.
Are you reading all of the data in at one time at the beginning of the program, or are you basically chaining together a set of complete, independent programs, each with its own set of DATA, CALENDAR, and related commands?
If it's the latter--that is, if you executing a set of essentially independent programs one after the other, you are probably running into a situation where the default range defined by the first "program" is not long enough to include all the data used in the later sections. In this case, you could do any of the following:
* set the initial CALENDAR and ALLOCATE commands to include the full range used by any section of the program,
* use END(RESET) instructions between each section. This clears the data as well as the calendar and default range settings out of memory, providing a clean slate for the next section of the program.
* use SMPL instructions or "start" and "end" parameters as needed to make sure each command is run over the appropriate range
If these suggestions don't help, we would probably need to see the full program to be of any further help.
Regards,
Tom Maycock
Estima
Are you reading all of the data in at one time at the beginning of the program, or are you basically chaining together a set of complete, independent programs, each with its own set of DATA, CALENDAR, and related commands?
If it's the latter--that is, if you executing a set of essentially independent programs one after the other, you are probably running into a situation where the default range defined by the first "program" is not long enough to include all the data used in the later sections. In this case, you could do any of the following:
* set the initial CALENDAR and ALLOCATE commands to include the full range used by any section of the program,
* use END(RESET) instructions between each section. This clears the data as well as the calendar and default range settings out of memory, providing a clean slate for the next section of the program.
* use SMPL instructions or "start" and "end" parameters as needed to make sure each command is run over the appropriate range
If these suggestions don't help, we would probably need to see the full program to be of any further help.
Regards,
Tom Maycock
Estima
Re: Why data samples are changing
Thanks Tom for your suggestions.
My procedure is based on the first scenario you mentioned: I use RATS to extract all the series from one big excel spreadsheet and do my modelling within this dataset as most of the input variables are the same. I extracted all data in the normal way and did not specify for each model its sample range (I was assuming each model should be able to pick up the maximum range (it is the range for the whole dataset) and if the series is shorter than this range, the program should start from the first available data point.
Just not sure whether it is true or whether I have misunderstood the way RATS operates.
Thanks for your help!
anozman
My procedure is based on the first scenario you mentioned: I use RATS to extract all the series from one big excel spreadsheet and do my modelling within this dataset as most of the input variables are the same. I extracted all data in the normal way and did not specify for each model its sample range (I was assuming each model should be able to pick up the maximum range (it is the range for the whole dataset) and if the series is shorter than this range, the program should start from the first available data point.
Just not sure whether it is true or whether I have misunderstood the way RATS operates.
Thanks for your help!
anozman
Re: Why data samples are changing
By default, RATS will indeed try use as many observations as it can for a given estimation. Observations for which the dependent variable or any of the regressors contain a missing value are dropped. Additional observations may be lost if the model includes lags of variables that contain missing values. The situation gets more complex for non-linear estimations, where you may lose observations due to issues like taking square roots of a negative, division by zero, and so on.anozman wrote:My procedure is based on the first scenario you mentioned: I use RATS to extract all the series from one big excel spreadsheet and do my modelling within this dataset as most of the input variables are the same. I extracted all data in the normal way and did not specify for each model its sample range (I was assuming each model should be able to pick up the maximum range (it is the range for the whole dataset) and if the series is shorter than this range, the program should start from the first available data point.
I'm not really sure what you meant by "independently" versus "run all the models together" in the first post, but if you are seeing differences in the sample between two estimations that you would expect to be identical, start with the information presented in the initial estimation output. Are the reported starting and ending periods the same, but with differing numbers of skipped/missing observations? Or do the start/end periods differ? Which of the two you are seeing should help you track down the source of the difference.
You might try using PRINT to display the residuals, the dependent variable, and the regressors for both instances (probably using explicit "start" and "end" parameters to ensure you view the data over the maximum possible range). See where the NAs (missing values) are showing up, and work backwards from the regression to trace the source of any differences.
If you can't figure out the source of the problem, we'd need to see at least the two program files in question, and it would help to have the data as well.
Regards,
Tom