21st Century Researchers: Stata: Dealing with missing values

Dealing with missing values is probably the first thing you do after labeling your variables. Unfortunately, this is not an easy job and many users use inappropriate means to accomplish it. Let’s start at the very beginning.

In Stata 7 and previous versions use only one default missing value “.” (without quote). If you wish to exclude missing values, it would be correct to use the:

if variable !=.

Sample code for OLS regression would resemble the following part:

regress a b if c!=. & d!=.    
regress a b c if d!=.     
regress a b c d

This would be 100% correct if you used an old Stata dataset; however, if your dataset in had different missing values, this code would be problematic. Stata 8 and later versions allows you do define different types of missing values, each of which begins with a “.” (without quote), such as .a, and .b. Therefore, if you have these missing values in your dataset and you use old code like that above, you would probably obtain inconsistent observation numbers.

The correct way to perform this would be to use if c <. or if !mi(c). The revised code would similar to:

regress a b if !mi(c) & !mi(d)      
regress a b c if !mi(d)       
regress a b c d

What if you have 20 variables in your regression? Such if statements often result in very long lines, thereby reducing the readability of your code. There are two easy ways to overcome this: 1) creating a dummy called “touse” with 1 representing valid values for all variables; and 0 for at least one missing value.

gen touse =!mi(y, a, b, c, d)    
regress y a b if touse     
regress y a b c if touse     
regress y a b c d if touse

2) If you don’t like this approach, you can also deal with missing values by using nestreg:

nestreg: reg y (a b) (c) (d)

21st Century Researchers

Stata: Dealing with missing values

Friday, September 4, 2009

Comments

Post a Comment

Followers

Web 2.0 tools

Subscribe Now: Feed Icon

##EasyReadMore##

Labels

Blog Archive

Subscribe

Visitor