Thanks, David.  I thought that was the issue -- that apparent size would 
not include overhead, so I was not able to understand why I was getting 
apparent size that was smaller than ondisk size.  After they moved my data 
to a different array, that difference reversed direction.  This was 
explained to me last night:

"on the old project spaces, zfs did some compression on the data so the 
apparent-size was larger than the ondisk size."

So, compression is also an issue, and I wouldn't have thought of that.

Now that there is no compression, I see that ondisk usage is 20GB more 
than apparent size:

$ \du -sB GB --apparent-size miller
146GB   miller

$ \du -sB GB miller
166GB   miller

$ find miller | wc -l
9908

So there are about 2 million bytes of overhead per file, which seems like 
a lot, to me.  I would think that implies disk blocks of multiple 
megabytes, which seems unlikely.  There must be more that I don't 
understand.

Regarding your idea (David)...

> As an aside, imho, the 'apparent size' option is really a terrible 
> option to include in 'du' and is a violation of the unix philosophy 
> because it has explicitly NOTHING to do with disk management. But that's 
> neither here nor there.
>
> A better way to get the byte count of a file is
>
> stat --format=%s

...I guess you mean that we should do something like this to get the 
totals for a directory and contents:

$ find miller -print0 | xargs -0 stat --format=%s | awk '{sum+=$1}END{print sum}'
145159848954

OK, that does work, but how horrible is it that I can get exactly the same 
answer like so:

$ du -sb miller
145159848954    miller

Of course it's worse if you want to do multiple directories at once.

That's a violation of unix philosophy?  It isn't true that it has nothing 
to do with disk management.  For example, when moving files between 
systems, it might help a lot to know the actual size.  What if I want to 
make a .tar file from a directory?  How large will that file be?  How much 
space will the files take up on tape?  If I'm using tar for tape backup, I 
think the size will be given by --apparent-size, not by ondisk size.

Mike


On Fri, 4 Apr 2014, David Wagle wrote:

> "apparent size" is the "ls -l" size of the file.
>
> which is the "rght" size for you to use is dependent on what you're trying
> to do.
>
> Apparent size is nearly useless for managing disks -- which is usually what
> you use du for.
>
> Say my disk has blocks that are 1KB. If I have a file with the nothing but
> the letter 'A' in it, that will have an apparent size of 1 byte. But
> because the smallest block size on my disk is 1KB, that 1 byte file will
> USE 1 KB of disk space no matter what because the physical data has to be
> recorded in a block and that block will then be marked 'used.'
>
> As an aside, imho, the 'apparent size' option is really a terrible option
> to include in 'du' and is a violation of the unix philosophy because it has
> explicitly NOTHING to do with disk management. But that's neither here nor
> there.




On Fri, 4 Apr 2014, David Wagle wrote:

> "apparent size" is the "ls -l" size of the file.
>
> which is the "rght" size for you to use is dependent on what you're trying
> to do.
>
> Apparent size is nearly useless for managing disks -- which is usually what
> you use du for.
>
> Say my disk has blocks that are 1KB. If I have a file with the nothing but
> the letter 'A' in it, that will have an apparent size of 1 byte. But
> because the smallest block size on my disk is 1KB, that 1 byte file will
> USE 1 KB of disk space no matter what because the physical data has to be
> recorded in a block and that block will then be marked 'used.'
>
> As an aside, imho, the 'apparent size' option is really a terrible option
> to include in 'du' and is a violation of the unix philosophy because it has
> explicitly NOTHING to do with disk management. But that's neither here nor
> there.
>
>
> On Fri, Apr 4, 2014 at 2:29 PM, Mike Miller <mbmiller+l at gmail.com> wrote:
>
>> On Tue, 1 Apr 2014, Mike Miller wrote:
>>
>>  On Tue, 1 Apr 2014, Ben wrote:
>>>
>>>  -h will always be different from the actual disk usage, you might also
>>>> want to play around with -B option too.
>>>>
>>>
>>> I've done that.  Using --si -sB GB gives the same result as --si -sh. Did
>>> you think that they would be different?
>>>
>>
>> Thanks for the suggestions.  Now I have answers (below).
>>
>> I was misusing the --si option there.  It should be used *instead* of -h,
>> not in conjunction with it.  These two commands should do the same thing
>> when the volume in "dir" is in the multi-gigabyte range...
>>
>> du -s --si dir
>> du -sB GB dir
>>
>> ...and so should these two commands:
>>
>> du -sh dir
>> du -sB G dir
>>
>> The first pair will report 1000*1000*1000 bytes and the second will report
>> 1024*1024*1024 bytes.
>>
>>
>>
>>  What happens when you use --apparent-size option.
>>>> --apparent-size
>>>>   print apparent sizes,  rather  than  disk  usage;  although the
>>>>   apparent  size is usually smaller, it may be larger due to holes
>>>>   in ('sparse') files, internal  fragmentation,  indirect blocks,
>>>>   and the like
>>>>
>>>
>>> I want to try that, but I'm having this problem right now:
>>>
>>> $ ls /project/guanwh
>>> ls: cannot access /project/guanwh: Stale file handle
>>>
>>
>> Yep, you nailed it.  That was the issue.  If I use --apparent-size, the
>> results are consistent.  According to supercomputing staff:
>>
>> "it is not a bug, -b is implies --apparent-size, so to compare its output
>> to -sm/sh you have to include --apparent-size with -sm/-sh as well.
>>
>> "when the apparent size is different from the reported size it is not a
>> bug in du but rather a feature of the filesystem :)"
>>
>> Now I just have to figure out which is the right size for me -- apparent
>> or reported.  I guess apparent sizes are the real file sizes.  In this
>> example "dir" has about 10,000 files in it with about half being 5 KB and
>> have about 29 MB:
>>
>> $ du -s --si dir
>> 162G    dir
>>
>> $ du -s --si --apparent-size dir
>> 143G    dir
>>
>> $ du -sb dir
>> 142038799951    dir
>>
>> $ wc -c dir/* | tail -1
>> 142037349967 total
>>
>>
>> One thing to note:  It seems that du always rounds up.  So if 1.1 GB are
>> used, du will report 2 GB.
>>
>>
>> Mike
>> _______________________________________________
>> TCLUG Mailing List - Minneapolis/St. Paul, Minnesota
>> tclug-list at mn-linux.org
>> http://mailman.mn-linux.org/mailman/listinfo/tclug-list
>>
>