This manuscript presents the most rigorous benchmarking of gene annotation algorithms for metagenomic datasets to date. We compare three different programs: GeneMark, MetaGeneAnnotator (MGA) and Orphelia. The comparisons are based on their performances over simulated fragments from hundred species of diverse lineages. We defined three different types of fragments: one type from the intra-coding region and the other types are from the gene edges. The general observation was that performances of all these programs improve as we increase the length of the fragment. On the other hand, intra-coding fragments of our data show a low annotation error in all of the programs if compared to the genes edges. Keywords- Metagenomic; Orphelia; MGA; GeneMark; fragments, sensitivity; specificity; error