DMTN-304
Processing DP0.2 at FrDF: A comparison with DP0.2 catalogs produced at IDF#
Abstract
The purpose of this note is to compare the final catalogs produced by the processing done in FrDF of the DP0.2 data with the reference catalogs produced at IDF.
Introduction#
In this note, we report the results of the comparison between DP0.2 catalogs produced at FrDF and the reference catalog produced at IDF.
In the context of the Data Preview 0.2 (DP0.2), the Data Release Production pipelines have been executed on the DC-2 simulated dataset (generated by the Dark Energy Science Collaboration, DESC). This dataset includes 20 000 simulated exposures, representing 300 square degrees of Rubin images with a typical depth of 5 years. DP0.2 ran at the Interim Data Facility, and the full exercise was independently replicated at FrDF (CC-IN2P3) and described in Le Boulc'h et al. [2024].
In this note we will start describing the catalogs and how we retrieved the data. Then we report the analysis performed on each table, checking two main objectives: how the sources’ positions in the sky match, and how the sources’ fluxes compare (when applicable).
The data and notebooks used are available in the CC-IN2P3 GitLab.
The catalogs in Qserv#
The catalogs have been ingested in Qserv production instance at FrDF:
dp02_dc2_catalogs_frdf catalog produced at CC-IN2P3 (hereafter FrDF catalog)
dp02_dc2_catalogs catalog produced at IDF (hereafter IDF catalog)
For the FrDF catalog, two tables are missing (TruthSummary and MatchesTruth) because that tables require post processing before to be ingested in Qserv.
In the following image you can see the number of line per table in FrDF and IDF catalogs. CcdVisit and Visit, produced by pipeline Step 7, have been produced twice at FrDF. For our scope, the FrDF data have been filtered to remove the rows in double. This problem needs to be adressed: we have to be able to flag the non valid tables to detect them before the ingestion process.
It is not possible compare the full catalogs, so for the analysis reported here we used a subsample of both the catalogs selected using a spatial query as this:
SELECT <column1>, <column2>, ...,<columnN> from <table> where scisql_s2PtInCircle(<ra>, <decl>, 60.0, -30.0, 0.5) = 1 limit 5000000
We limited the number of retrieved lines to 5M but for tables with a large number of lines (sources) we reduced also the query radius: a radius of 0.5 degree is to large and the number of sources in the area defined by the circle exceed largley the limits we imposed on the number of the lines. For this reason the catalogs retrieved are not comparable because the sources in the tables are not covering the same region as shown in the following image for ForcedSource table retrieved in a radius of 0.5deg, where in red you see the objects extracted from FrDF and in blue the objects extracted from the IDF.
To reduce the table size we also retrieved a subsamble of columns (ra, dec and fluxes). Only for few small table we retrieved all columns.
The fluxes has been converted to magnitude AB using UDF SQL function scisql_nanojanskyToAbMag
integrated in Qserv.
All the queries used for each table are reported in query notebook.
The analysis has been performed offline: all the tables have been retrieved once and stored locally as fits files (available in fits directory in this repository). For each table a file called <df>_<table>.fits
has been generated and for each table a new column (DF) has been added allowing to identify easily the data origin during the analysis.
Topcat has been used to quick validate the retrieved datasets and to filter out the line in double for Visit and CcdVisit tables.
Comparison#
For the analysis there is an interactive notebook allowing table selection and type of plot to generate. For each table we also created a notebook (available in notebook directory) and we generated an interactive html with all the plot (coordinates and magnitudes) and available in the html directory.
We performed a match between the catalogs to make a correct correspondances between the rows and to avoid odd results (i.e comparison between sources in different region of the sky), for this we use astropy module when the number of the row in the retrieved table was the same (in this case it is as we reordered the tables to have match between the row). When the number of row in tables was not the same, we used Topcat stilt functions (as implemented in pystitls
. We used stilts because it allows the “symmetric match” i.e. it allows only one match per source: with astropy this is not possible and you can have multi-match than could lead to wrong results.
But also with these conditions, in the case of the ForcedSource table the match can be very difficult if not impossible.
To better understand this point see next figures.
The following figure shows the data retrieved for three table:
Object in (radius=0.5deg , number of object=395952)
Source in (radius=0.1deg , number of object=561385)
ForcedSource in (radius=0.05deg , number of object=1643290)
In ForcedSource
we have 4 times more entries than Object table in a region 100 times smaller.
Taking a look to the density of source, we see how it could be complicated for a matching algorithm to find the good match. The next figures show the number of sources per “pixel” in the different tables. The pixel size is:
Object table: 12x12 arcsec (we cannot generate smaller pixel)
Source table: 7.2x7.2 arcsec
ForcedSource table: 7.2x7.2 arcsec
You can see the number of sources per pixel in the right scale: ForcedSource have an incredible number of source per pixel, in some case more than 3k: it’s clear that a matching algorithm using a separation of 1arcsec as parameter to match two points, in this case, it cannot be 100% reliable. In our case, if we exclude ForcedSource table, the matching algorithm worked as we expect.
Sources positions analysis#
To analyse how position in the sky of the object match we compared ra,dec and we analysed also the sky separation (i.e. the great-circle distance) estimated using astropy as
c1 = SkyCoord(df[ra_1]*u.deg, df[decl_1]*u.deg, frame='icrs')
c2 = SkyCoord(df[ra_2]*u.deg, df[decl_2]*u.deg, frame='icrs')
sep=c1.separation(c2).degree
An exemple of the distribution of the sky separation is visible in the next figure.
Fluxes (magnitudes) analysis#
For the fluxes comparison we converted nJy to magnitude AB.
For each table we select fluxes columns and we convert them to AB magnitude using te UDF function scisql_nanojanskyToAbMagintegrated in Qserv.
Then, for each magnitude we plot the histogram and the box plot of the distribution for each catalogs, as shown in the next figure.
References#
Quentin Le Boulc'h, Fabio Hernandez, and Gabriele Mainetti. The Rubin Observatory's Legacy Survey of Space and Time DP0.2 processing campaign at CC-IN2P3. EPJ Web Conf., 295:04049, 2024. doi:10.1051/epjconf/202429504049.