The Other Kelly Yancey: C

Showing posts with label C. Show all posts

Monday, July 27, 2009

Calculating MD5 of binaries without debug symbols

If you compile a binary with gcc with debugging information enabled (-g), the MD5 of the resulting binary will change depending on the name of the directory you compile it in. Which means that if two developers compile the same source code with the same options on the same machine, only they do it in their own home directories, the MD5 of the resulting binaries may differ.

However, as soon as you strip the binaries, their MD5s will be the same.

Which leads me to this little tool I whipped up to compare two binaries without stripping them:


#!/bin/sh
# Display the MD5 of a file, ignoring any debugging symbols in
# binaries.

# The strip(1)/objdump(1) commands for removing debugging
# symbols do not support writing to stdout so we need to
# allocate a temp file to write the stripped binary too.
tempfoo=`basename $0`
TMPFILE=`mktemp -q /tmp/${tempfoo}.XXXXXX`
if [ $? -ne 0 ]; then
    echo "$0: Can't create temp file, exiting..."
    exit 1
fi

while [ "$1" != "" ]; do
    # The following line is a cheezy way to accurately
    # reproduce the same error messages as md5(1) when a
    # specified file is unreadable.
    md5 "$1" > /dev/null
    if [ $? -eq 0 ]; then

        # Try to strip symbols from the file on the
        # assumption it is a binary and, if successful,
        # compute the md5 of the stripped file.  Note that
        # objcopy is the same as the strip(1) command.  If
        # objcopy failed to parse the file (i.e. because it
        # is not in ELF format), simply compute the md5 of
        # the whole file since there are no debugging symbols
        # to strip.
        m=`(objcopy -g "$1" $TMPFILE >/dev/null 2>&1 && \
            md5 -q $TMPFILE;) || md5 -q "$1"`

        # Output the result in a md5(1)-compatible format.
        echo "MD5($1) = $m"
    fi
    shift
done

rm $TMPFILE

Thursday, December 20, 2007

Less code

I was just reading Steve Yegge's rant against code size and realized that he managed to put into words exactly the feelings that have been drawing me to python in recent years. In particular, I managed to mostly skip the Java step in my journey from Pascal, through assembler, up to C, and then the leap to high-level languages including perl, and more recently python. I don't really know why, but Java never felt "right" -- for anything. To this day, I can't think of too many applications that I would say Java was the best tool for the job. For which, I think Steve hit the nail on the head when he writes:

Java is like a variant of the game of Tetris in which none of the pieces can fill gaps created by the other pieces, so all you can do is pile them up endlessly.

Hallelujah, brother.

Anyway, I strongly agree with Steve's general points about the merits of small code bases, but I won't go so far to say that smaller is necessarily always better. Python hits a sweet spot for me (at least for now) between compactness and comprehensiveness. Certainly a good number of problems could be expressed more succinctly in a functional language such as Erlang or Haskell, but you lose readability. In fact, as elegantly as many problems can be expressed in a functional language, they quickly start to look like line noise when the problems exceed textbook examples.

Programming language preferences aside, what I agree with most from Steve's blog post was not so much that more succinct languages are better, but that less code is better. His post is written so as to suggest that Java itself is a problem -- which may certainly be true -- but he doesn't clarify whether he thinks it is Java the language, or Java the set of libraries.

Python, for example, combines a great set of standard libraries with a language syntax that makes it easy to use those libraries. All the lines of code hidden away in libraries are effectively "free" code. You don't have to manage their complexity. Give me a language that makes leveraging as many libraries as possible painless, then I can glue them together to make great programs with low apparent complexity. In reality, the lines of code might be astronomical, but I don't have to manage the complexity of all of it -- just the part I wrote -- so it doesn't matter.
Python does a great job here, whereas Java (and C++'s STL) largely get it wrong.

In particular, I would argue that, in addition to python's straightforward syntax, the fact that so many of python's libraries are written in C is a large factor in why they are so easy to use. There may be a huge amount of complexity, and a huge number of lines of code, in the C implementation of a library. However, the API boundary between python and C acts a sort of line of demarcation -- no complexity inherent in the implementation of the library can leak out into the python API without the programmer explicitly allowing it. That is, the complexity of libraries written in C and callable from python is necessarily encapsulated.

As a personal anecdote, in one project I work on, we use ctypes to make foreign function calls to a number of Windows APIs. One thing that really bothers me about this technique is that I find myself re-implementing a number of data structures in ctypes that are already defined in C header files. If I make a mistake, then I introduce a bug. Ironically, since I could leverage more existing code, often times there would be fewer lines of code and less complexity had I just used C to call the APIs. Of course, other parts of the program would become hugely unwieldy, but the point of this anecdote is that libraries (more specifically, being able to leverage known-good code) can be much more effective in reducing code than the implementation language.

So long as the implementation language isn't Java. Java just sucks. :)

Friday, June 1, 2007

C: Converting `struct tm` times with timezone to `time_t`

Both the BSD and GNU standard C library have extended the struct tm to include a tm_gmtoff member that holds the offset from UTC of the time represented by the structure. Which might lead you to believe that mktime(3) would honor the time offset indicated by tm_gmtoff when converting to a time_t representation.

Nope.

mktime(3) always assumes the "current timezone" defined by the executing environment. Since ISO C and POSIX define the semantics for mktime(3) but neither defines a tm_gmtoff member for the tm structure, not surprisingly mktime(3) does not honor it.

So, lets say you have a struct tm, complete with correctly-populated tm_gmtoff field: how do you convert it to a time_t representation?

Many modern C libraries (including glibc and FreeBSD's libc) include a timegm(3) function. No, this function doesn't honor tm_gmtoff either. Instead, gmtime(3) converts the struct tm to a time_t just like mktime(3), but ignores the timezone of the executing environment and always assumes GMT as the timezone.

However, if your libc implements both tm_gmtoff and timegm(3) you are in luck. You just need to use timegm(3) to get the time_t representing the time in GMT and then subtract the offset stored in tm_gmtoff. The tricky part is that calling timegm(3) will modify the struct tm, clearing the tm_gmtoff field to zero (at least it does on the FreeBSD 4.10 machine I'm testing with). Combined with C's lack of guaranteed left-to-right evaluation, you need to save the tm_gmtoff so it doesn't get clobbered before you can use it. Something like:


time_t
tm2time(const struct tm *src)
{
       struct tm tmp;

       tmp = *src;
       return timegm(&tmp) - src->tm_gmtoff;
}

Note that I copy the entire struct tm into a temporary variable. This prevents timegm(3) from clobbering the tm_gmtoff so that we can use it to accurately compute the seconds since the epoch. The copy in tmp gets clobbered, but the copy in src is left intact. Also, by copying the src struct tm into a temporary, we never modify the argument passed in -- which is just a generally friendly thing to do.

All that said, the truly pedantic will point out that neither ISO C nor POSIX specs dictate that time_t must represents seconds. However, since we are already depending on two non-standard extensions, it seems reasonable to also depend on the fact that systems implementing timegm(3) and the tm_gmtoff field all implement time_t values in seconds.

The Other Kelly Yancey

Monday, July 27, 2009

Calculating MD5 of binaries without debug symbols

Thursday, December 20, 2007

Less code

Friday, June 1, 2007

C: Converting `struct tm` times with timezone to `time_t`

Labels

Links

Monday, July 27, 2009

Calculating MD5 of binaries without debug symbols

Thursday, December 20, 2007

Less code

Friday, June 1, 2007

C: Converting struct tm times with timezone to time_t

C: Converting `struct tm` times with timezone to `time_t`