Skip to content

Commit

Permalink
deploy: 69e199e
Browse files Browse the repository at this point in the history
  • Loading branch information
drcege committed Aug 22, 2024
1 parent 8403546 commit 3ce3325
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 4 deletions.
10 changes: 7 additions & 3 deletions _modules/data_juicer/core/data.html
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ <h1>Source code for data_juicer.core.data</h1><div class="highlight"><pre>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">op</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">exporter</span><span class="o">=</span><span class="n">exporter</span><span class="p">,</span> <span class="n">tracer</span><span class="o">=</span><span class="n">tracer</span><span class="p">)</span>
<span class="c1"># record processed ops</span>
<span class="k">if</span> <span class="n">checkpointer</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">checkpointer</span><span class="o">.</span><span class="n">record</span><span class="p">(</span><span class="n">op</span><span class="o">.</span><span class="n">_of_cfg</span><span class="p">)</span>
<span class="n">checkpointer</span><span class="o">.</span><span class="n">record</span><span class="p">(</span><span class="n">op</span><span class="o">.</span><span class="n">_op_cfg</span><span class="p">)</span>
<span class="n">end</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s1">&#39;OP [</span><span class="si">{</span><span class="n">op</span><span class="o">.</span><span class="n">_name</span><span class="si">}</span><span class="s1">] Done in </span><span class="si">{</span><span class="n">end</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="si">:</span><span class="s1">.3f</span><span class="si">}</span><span class="s1">s. &#39;</span>
<span class="sa">f</span><span class="s1">&#39;Left </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span><span class="si">}</span><span class="s1"> samples.&#39;</span><span class="p">)</span>
Expand All @@ -280,7 +280,7 @@ <h1>Source code for data_juicer.core.data</h1><div class="highlight"><pre>
<span class="n">traceback</span><span class="o">.</span><span class="n">print_exc</span><span class="p">()</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">finally</span><span class="p">:</span>
<span class="k">if</span> <span class="n">checkpointer</span><span class="p">:</span>
<span class="k">if</span> <span class="n">checkpointer</span> <span class="ow">and</span> <span class="n">dataset</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">:</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s1">&#39;Writing checkpoint of dataset processed by &#39;</span>
<span class="s1">&#39;last op...&#39;</span><span class="p">)</span>
<span class="n">dataset</span><span class="o">.</span><span class="n">cleanup_cache_files</span><span class="p">()</span>
Expand Down Expand Up @@ -416,7 +416,11 @@ <h1>Source code for data_juicer.core.data</h1><div class="highlight"><pre>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Override the cleanup_cache_files func, clear raw and compressed</span>
<span class="sd"> cache files.&quot;&quot;&quot;</span>
<span class="n">cleanup_compressed_cache_files</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">cleanup_cache_files</span><span class="p">()</span></div></div>
<span class="k">return</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="n">cleanup_cache_files</span><span class="p">()</span></div>

<div class="viewcode-block" id="NestedDataset.load_from_disk"><a class="viewcode-back" href="../../../data_juicer.core.html#data_juicer.core.NestedDataset.load_from_disk">[docs]</a> <span class="nd">@staticmethod</span>
<span class="k">def</span> <span class="nf">load_from_disk</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">):</span>
<span class="k">return</span> <span class="n">NestedDataset</span><span class="p">(</span><span class="n">Dataset</span><span class="o">.</span><span class="n">load_from_disk</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kargs</span><span class="p">))</span></div></div>


<span class="k">def</span> <span class="nf">nested_query</span><span class="p">(</span><span class="n">root_obj</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="n">NestedDatasetDict</span><span class="p">,</span> <span class="n">NestedDataset</span><span class="p">,</span>
Expand Down
42 changes: 42 additions & 0 deletions data_juicer.core.html
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,48 @@
cache files.</p>
</dd></dl>

<dl class="py method">
<dt class="sig sig-object py" id="data_juicer.core.NestedDataset.load_from_disk">
<em class="property"><span class="pre">static</span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">load_from_disk</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="o"><span class="pre">*</span></span><span class="n"><span class="pre">args</span></span></em>, <em class="sig-param"><span class="o"><span class="pre">**</span></span><span class="n"><span class="pre">kargs</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/data_juicer/core/data.html#NestedDataset.load_from_disk"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#data_juicer.core.NestedDataset.load_from_disk" title="Permalink to this definition"></a></dt>
<dd><p>Loads a dataset that was previously saved using [<cite>save_to_disk</cite>] from a dataset directory, or from a
filesystem using any implementation of <cite>fsspec.spec.AbstractFileSystem</cite>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>dataset_path</strong> (<cite>str</cite>) – Path (e.g. <cite>“dataset/train”</cite>) or remote URI (e.g. <cite>“s3//my-bucket/dataset/train”</cite>)
of the dataset directory where the dataset will be loaded from.</p></li>
<li><p><strong>fs</strong> (<cite>fsspec.spec.AbstractFileSystem</cite>, <em>optional</em>) – <p>Instance of the remote filesystem where the dataset will be saved to.</p>
<p>&lt;Deprecated version=”2.8.0”&gt;</p>
<p><cite>fs</cite> was deprecated in version 2.8.0 and will be removed in 3.0.0.
Please use <cite>storage_options</cite> instead, e.g. <cite>storage_options=fs.storage_options</cite></p>
<p>&lt;/Deprecated&gt;</p>
</p></li>
<li><p><strong>keep_in_memory</strong> (<cite>bool</cite>, defaults to <cite>None</cite>) – Whether to copy the dataset in-memory. If <cite>None</cite>, the
dataset will not be copied in-memory unless explicitly enabled by setting
<cite>datasets.config.IN_MEMORY_MAX_SIZE</cite> to nonzero. See more details in the
[improve performance](../cache#improve-performance) section.</p></li>
<li><p><strong>storage_options</strong> (<cite>dict</cite>, <em>optional</em>) – <p>Key/value pairs to be passed on to the file-system backend, if any.</p>
<p>&lt;Added version=”2.8.0”/&gt;</p>
</p></li>
</ul>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p><ul class="simple">
<li><p>If <cite>dataset_path</cite> is a path of a dataset directory, the dataset requested.</p></li>
<li><p>If <cite>dataset_path</cite> is a path of a dataset dict directory, a <cite>datasets.DatasetDict</cite> with each split.</p></li>
</ul>
</p>
</dd>
<dt class="field-odd">Return type<span class="colon">:</span></dt>
<dd class="field-odd"><p>[<cite>Dataset</cite>] or [<cite>DatasetDict</cite>]</p>
</dd>
</dl>
<p>Example:</p>
<p><code class="docutils literal notranslate"><span class="pre">`py</span>
<span class="pre">&gt;&gt;&gt;</span> <span class="pre">ds</span> <span class="pre">=</span> <span class="pre">load_from_disk(&quot;path/to/dataset/directory&quot;)</span>
<span class="pre">`</span></code></p>
</dd></dl>

</dd></dl>

<dl class="py class">
Expand Down
2 changes: 2 additions & 0 deletions genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -778,6 +778,8 @@ <h2 id="L">L</h2>
</ul></td>
<td style="width: 33%; vertical-align: top;"><ul>
<li><a href="data_juicer.format.html#data_juicer.format.load_formatter">load_formatter() (in module data_juicer.format)</a>
</li>
<li><a href="data_juicer.core.html#data_juicer.core.NestedDataset.load_from_disk">load_from_disk() (data_juicer.core.NestedDataset static method)</a>
</li>
<li><a href="data_juicer.ops.html#data_juicer.ops.load_ops">load_ops() (in module data_juicer.ops)</a>
</li>
Expand Down
Binary file modified objects.inv
Binary file not shown.
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

0 comments on commit 3ce3325

Please sign in to comment.