High-priority TODO list
=======================

o Looks like IMGT sometimes makes a release where all the FASTA files for
  a given organism are empty. For example, in release 202530-1, all the
  IG*.fasta and TR*.fasta files for Mus_musculus_C57BL6J are empty.
  Problem is that

    install_IMGT_germline_db("202530-1", "Mus_musculus_C57BL6J",
                             without.intdata=TRUE, without.auxdata=TRUE)

  seems to work but creates an empty germline db that is unusable.
  It also breaks 'list_germline_dbs(long.listing=TRUE)'.
  So instead we should fail early and loudly with a clear error message.
  Seems that the early check should be in install_germline_db() or
  create_germline_db().

o Take a look at:
    install_IMGT_germline_db("202614-2", "Ovis_aries")
    # Error in data.frame(allele_name = allele_names, ...  :
    #   row names contain missing values

o Maybe replace 'without.intdata' and 'without.auxdata' arguments with
  'with.intdata' and 'with.auxdata'. With default of TRUE for 'with.intdata'
  and "auto" for 'with.auxdata' ("auto" meaning "include the computed auxdata
  if it's complete").

o Fill missing documentation:
  - add extract_auxdata_from_ogrdb_json() example in
    man/download_OGRDB_germline_json.Rd;
  - document 'with.auxdata' and 'imgt.fasta' arguments of
    install_custom_germline_db();
  - document 'without.auxdata' argument of install_IMGT_germline_db();
  - add example to man/auxdata-IO.Rd;
  - document the *_ndm_data functions (ndm_data-IO.Rd file).

o Support the "engineered mouse" use case.

o Question for the IMGT folks: IMGT human J allele IGLJ2A*01 has a codon
  start set to 1, which is surprising for the various reasons explained
  in the comment preceding .EXCLUDED_IMGT_HUMAN_J_ALLELES in file
  R/install_IMGT_germline_db.R

o Question for William Lees: It's not clear whether the j_cdr3_end reported
  reported in OGRDB json files is 0-based or 1-based or something else.
  See IMPORTANT NOTE in R/download_OGRDB_germline_json.R for the details.

o Why are some of the V alleles provided by IMGT "truncated" at the 3' end
  before they reach the end of their FWR3?
  This is the case for 34 human V alleles and 82 mouse V alleles.
  To identify them e.g. for mouse:

    db_name <- "IMGT-202614-2.Mus_musculus.IGH+IGK+IGL"
    V_alleles <- load_germline_db(db_name, region_types="V")
    intdata <- load_intdata(db_name)
    stopifnot(identical(names(V_alleles), intdata[ , "allele_name"]))
    seq_lens <- width(V_alleles)  # same as lengths(V_alleles)
    table(seq_lens < intdata$fwr3_end)  # 82 are truncated (out of 865)
    head(subset(intdata, seq_lens < fwr3_end))

  A more straightforward way to identify them is to look at the gapped
  sequences. See how we compute 'has_incomplete_fwr3'
  in '?parse_imgt_fasta_headers' for the details.

o Maybe revisit example 2. in '?compute_V_gene_delineations'.

o The augment_germline_db_*() functions are not doing the right thing
  with the internal and auxiliary data. This needs to be addressed.
  This might require deprecating and replacing them with something
  completely different (e.g. something based on install_custom_germline_db()).

o install_custom_germline_db.Rd() must enforce loci-specification (e.g.
  IGH+IGL) in the name of the custom db. Not having the loci embedded
  in the name of a germline db breaks list_germline_dbs(long.listing=TRUE).

o Investigate BIG inconsistency between local IgBLAST and web IgBLAST:
  https://www.ncbi.nlm.nih.gov/igblast/

  Try to run the former with the -remote option to execute search remotely,
  and compare. Collect as much data on this issue as possible and send an
  email to the IgBLAST folks at NCBI (blast-help@ncbi.nlm.nih.gov or
  nlm-support@nlm.nih.gov).

o Mismatch/indel summarization:

  The goal is to summarize information about the mismatches and indels
  between the query sequences (BCR or TCR nucleotide sequences) and the
  germline gene alleles sequences that they’re aligned to.

  Deliverables:

  - tabulate_mismatches(), tabulate_insertions(), tabulate_deletions():
    take the AIRR-formatted data.frame and return a matrix of counts
    with one row per query sequence (i.e. one row per row in the data.frame)
    and 7 columns: fwr1, cdr1, fwr2, cdr2, fwr3, cdr3, fwr4.
    By default the counts are for the mismatches/indels at the nucleotide
    level. What we count is the number of nucleotides involved in
    mismatches or indels, not the number of events e.g. an insertion
    of 3 nucleotides counts for 3 not for 1.

  - Discussed at Hyrien lab meeting on Oct 29:
    (a) support summarization at amino acid level
    (b) report % identity per CDR/FWR regions

  Questions:

  - Should we also add columns for the V, D, J, C regions?


Lower-priority TODO list
=======================

o igbrowser() improvements:
  - Display pairwise alignment between BCR query sequence and germline
    V/D/J/C sequences.
  - Take a look at visualization tool from IMGT/V-QUEST for inspiration.

o Maybe implement the following advice given by IgBLAST when using one
  of the num_alignments_V/D/J arguments:

  Warning messages:
  1: In .parse_and_issue_warnings(stderr_file) :
    Warning: To obtain better run time performance, please run blastdb_aliastool
    -seqid_file_in <INPUT_FILE_NAME> -seqid_file_out <OUT_FILE_NAME> and use
    <OUT_FILE_NAME> as the argument to -seqidlist


Things to do at BioC 3.22 release time
======================================

o Advertize igblastr:
  - Announce on various bioc-community Slack channels.
  - Announce on the FH-Data Slack (fhdata.slack.com) on channels
    #r-user-comm and #general.
  - Announce on LinkedIn.
  - Try to get an entry in the next R Journal advertizing igblastr.
  - Bioinformatics accepts short articles introducing new software.


Lowest-priority TODO list
=========================

o Older versions of IgBLAST (e.g. version 1.19.0) don't necessarily include
  the same data as the latest version (version 1.22.0). However, we
  always initialize igblastr_cache(LIVE_IGDATA) with the content of
  inst/extdata/igdata_store/ (a.k.a. "the igdata store") regardless of what
  version of IgBLAST is used by igblastr (note that the igdata store contains
  the data included in IgBLAST 1.22.0). This means that, if we use a version
  of IgBLAST that includes different data, then igblastr_cache(LIVE_IGDATA)
  in its original state already differs from the data included in the
  IgBLAST that we are using.
  There are some problems with this:
  1. The content of igblastr_cache(LIVE_IGDATA) is not guaranteed to be
     compatible with all versions of IgBLAST.
  2. If we're using a version of IgBLAST that includes data that differs
     from the igdata store, then from the very start (i.e. before
     any run of update_live_igdata()), igdata_info() shows differences
     between live and original auxiliary files. Running reset_live_igdata()
     of course doesn't help because it's a no-op when
     igblastr_cache(LIVE_IGDATA) is in its original state.
  3. Running update_live_igdata() doesn't help either because the updates
     available at NCBI are for the latest version of IgBLAST.
  Proposed solution:
  (a) Implement an internal helper (e.g. compatible_with_igdata_store())
      that can quickly compare the data included in an installation of
      IgBLAST with the igdata store. Should return TRUE or FALSE to indicate
      whether the data is identical or not.
  (b) When using a non-compatible IgBLAST (could be an internal or external
      installation), reset igblastr_cache(LIVE_IGDATA) with the data included
      in that IgBLAST.
      More generally, calling set_igblast_root() should:
      - reset igblastr_cache(LIVE_IGDATA) with the data included in the
        selected IgBLAST;
      - raise an error if the internal_data/ or optional_file/ subdir is
        missing in the selected IgBLAST;
      - print a message that suggests running update_live_igdata() only
        when selecting a compatible IgBLAST (message should be similar to
        what .print_tip_if_live_igdata_needs_check() does when is.infinite(dt),
        see R/zzz.R).
      Open question: When do we do the above if the IgBLAST to use is an
      external IgBLAST installation selected via IGBLAST_ROOT?
  (c) Disable update_live_igdata() if we're using a non-compatible IgBLAST.
      In this case, the function should raise an error with an error message
      that explains the situation.
  (d) The .onLoad() hook should only print the "igblastr tip" if igblastr
      already has access to an IgBLAST installation (i.e. if get_igblast_root()
      works) and if that installation compatible with the igdata store. Note
      that since update_live_igdata() is disabled when the selected IgBLAST
      non-compatible, time_since_live_igdata_last_checked() will always
      return Inf in that case, but we shouldn't even need to call
      time_since_live_igdata_last_checked() in that case.
  (e) Change what igdata_info() displays when the selected IgBLAST is
      non-compatible. For example 'last_checked:' and 'last_updated:' could
      display something like 'checking is disabled' and 'updating is disabled'.
      Or don't display these fields.
  (f) About install_igblast(): ncbi-igblast-1.22.0+.dmg is missing the
      internal_data/ and optional_file/ folders but install_igblast() is
      able to fix this.
      Make sure that the fix is applied when installing version 1.22.0 only.

o Migrate code in R/OGRDB-utils.R from OGRDB API v1 to OGRDB API v2.
  See OGRDB API v2.0.0 Guide here:
  https://github.com/airr-community/ogrdb/blob/master/schema/ogrdb_api_v2_guide.md

o Add igblastp(), a wrapper to the igblastp standalone executable included
  in IgBLAST. Requested by Dr Iman Haddad in an email from Aug 12, 2025.

o Add 'clonotype_out' arg to igblastn(). Add examples in man page and
  vignette that use this functionality.

o It was mentioned that some people use mixeR to analyse TCR sequences.
  How does this compare to using igblastn(..., ig_seqtype="TCR")?

o Add bibliography to vignette. See AuthoringRmdVignettes.Rmd vignette in
  BiocStyle for how to do this.

o Add Seqinfo to Imports (but wait until BioC 3.23 for that). Note
  that we'll still need GenomeInfoDb just for list_ftp_dir().

o Clarify provenance of 1279067_1_Paired_sequences.fasta.gz and its licence.
  Give appropriate credit. See https://opig.stats.ox.ac.uk/webapps/oas/

o One should be able to pass the name of an IMGT germline db to
  install_IMGT_germline_db(), or a vector of names.

o Improve read_igblastn_fmt7_output.Rd man page (e.g. document customized
  format 7 and list_outfmt7_specifiers()) as well as associated unit tests (in
  tests/testthat/test-outfmt7-utils.R).

o Maybe make 'num_threads' an explicit argument with default to 4?
  The doc should show how to specify a higher but still reasonable
  custom value based on detectCores().

o Parse $footer part of output format 7.

o Implement parsing of output formats 3 and 4?

o Set environment variable IGDATA to point to the internal_data directory.
  Note that IGDATA must be set to the **parent** directory of the internal_data
  directory.

o Great resource for how to use AIRR Community Reference germline sets with
  IgBLAST: https://williamdlees.github.io/receptor_utils/_build/html/airrc_sets_with_igblast.html
  In particular, the author seems to be using an OGRDB REST API version 2:
    https://ogrdb.airr-community.org/api_v2
  but where is this API documented?
  Note that internal utility igblastr:::.fetch_germline_set_from_OGRDB()
  (implemented in R/OGRDB-utils.R) uses the OGRDB REST API at
    https://ogrdb.airr-community.org/api
  which is poorly documented and is somewhat confusing.

o Why are some loci/groups missing for some mouse strains at OGRDB?
  For example for strain A/J, only sequences from the light chain (i.e.
  groups IGKV, IGKJ, IGLV, and IGLJ) seem to be available.
  See https://ogrdb.airr-community.org/germline_sets/Mus%20musculus

o Implement install_OGRDB_germline_db(). Will download the germline sequences
  from https://ogrdb.airr-community.org/

