Skip to content

Commit

Permalink
Merge pull request #127 from begoldsm/master
Browse files Browse the repository at this point in the history
Merge fixes and improvements from dev.
  • Loading branch information
begoldsm authored Jan 18, 2017
2 parents 8a95ac2 + acdaeb2 commit 985ba77
Show file tree
Hide file tree
Showing 87 changed files with 4,269 additions and 3,800 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ matrix:
- python: 3.3
- python: 3.4
- python: 3.5
- python: 3.6

install:
# Install dependencies
Expand Down
44 changes: 34 additions & 10 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ To play with the code, here is a starting point:
from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(store_name, token)
adl = core.AzureDLFileSystem(token, store_name=store_name)
# typical operations
adl.ls('')
Expand All @@ -53,18 +53,42 @@ To play with the code, here is a starting point:
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)
Command Line Sample Usage
------------------
To interact with the API at a higher-level, you can use the provided
command-line interface in "azure/datalake/store/cli.py". You will need to set
command-line interface in "samples/cli.py". You will need to set
the appropriate environment variables as described above to connect to the
Azure Data Lake Store.
Azure Data Lake Store. Below is a simple sample, with more details beyond.


.. code-block:: bash
python samples\cli.py ls -l
Execute the program without arguments to access documentation.


Contents
========

.. toctree::
api
:maxdepth: 2

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


To start the CLI in interactive mode, run "python azure/datalake/store/cli.py"
To start the CLI in interactive mode, run "python samples/cli.py"
and then type "help" to see all available commands (similiar to Unix utilities):

.. code-block:: bash
> python azure/datalake/store/cli.py
> python samples/cli.py
azure> help
Documented commands (type help <topic>):
Expand All @@ -83,7 +107,7 @@ familiar with the Unix/Linux "ls" command, the columns represent 1) permissions,

.. code-block:: bash
> python azure/datalake/store/cli.py
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
Expand All @@ -103,7 +127,7 @@ named after the remote file minus the directory path.

.. code-block:: bash
> python azure/datalake/store/cli.py
> python samples/cli.py
azure> ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
Expand All @@ -124,7 +148,7 @@ For example, listing the entries in the home directory:

.. code-block:: bash
> python azure/datalake/store/cli.py ls -l
> python samples/cli.py ls -l
drwxrwx--- 0123abcd 0123abcd 0 Aug 02 12:44 azure1
-rwxrwx--- 0123abcd 0123abcd 1048576 Jul 25 18:33 abc.csv
-r-xr-xr-x 0123abcd 0123abcd 36 Jul 22 18:32 xyz.csv
Expand All @@ -136,7 +160,7 @@ Also, downloading a remote file:

.. code-block:: bash
> python azure/datalake/store/cli.py get xyz.csv
> python samples/cli.py get xyz.csv
2016-08-04 18:57:48,603 - ADLFS - DEBUG - Creating empty file xyz.csv
2016-08-04 18:57:48,604 - ADLFS - DEBUG - Fetch: xyz.csv, 0-36
2016-08-04 18:57:49,726 - ADLFS - DEBUG - Downloaded to xyz.csv, byte offset 0
Expand Down
8 changes: 5 additions & 3 deletions azure-data-lake-store-python.pyproj
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,11 @@
<PtvsTargetsFile>$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Python Tools\Microsoft.PythonTools.targets</PtvsTargetsFile>
</PropertyGroup>
<ItemGroup>
<Content Include="dev_requirements.txt" />
<Content Include="docs\requirements.txt" />
<Content Include="License.txt" />
<Content Include="requirements.txt" />
</ItemGroup>
<ItemGroup>
<Compile Include="azure\datalake\store\cli.py" />
<Compile Include="azure\datalake\store\core.py" />
<Compile Include="azure\datalake\store\exceptions.py" />
<Compile Include="azure\datalake\store\lib.py" />
Expand All @@ -37,8 +36,10 @@
<Compile Include="azure\datalake\__init__.py" />
<Compile Include="azure\__init__.py" />
<Compile Include="docs\source\conf.py" />
<Compile Include="samples\benchmarks.py" />
<Compile Include="samples\cli.py" />
<Compile Include="samples\__init__.py" />
<Compile Include="setup.py" />
<Compile Include="tests\benchmarks.py" />
<Compile Include="tests\conftest.py" />
<Compile Include="tests\fake_settings.py" />
<Compile Include="tests\settings.py" />
Expand All @@ -57,6 +58,7 @@
<Folder Include="azure\datalake\store" />
<Folder Include="docs" />
<Folder Include="docs\source" />
<Folder Include="samples\" />
<Folder Include="tests" />
</ItemGroup>
<Import Project="$(PtvsTargetsFile)" Condition="Exists($(PtvsTargetsFile))" />
Expand Down
6 changes: 6 additions & 0 deletions azure-data-lake-store-python.pyproj.user
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
<ProjectView>ProjectFiles</ProjectView>
</PropertyGroup>
</Project>
5 changes: 3 additions & 2 deletions azure/datalake/store/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,12 @@ def auth(tenant_id=None, username=None,
if not authority:
authority = 'https://login.microsoftonline.com/'

context = adal.AuthenticationContext(authority +
tenant_id)
if not tenant_id:
tenant_id = os.environ.get('azure_tenant_id', "common")

context = adal.AuthenticationContext(authority +
tenant_id)

if tenant_id is None or client_id is None:
raise ValueError("tenant_id and client_id must be supplied for authentication")

Expand Down
16 changes: 15 additions & 1 deletion azure/datalake/store/multithread.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,12 @@ def __init__(self, adlfs, rpath, lpath, nthreads=None, chunksize=2**28,
delimiter=None, overwrite=False, verbose=True):
if not overwrite and adlfs.exists(rpath):
raise FileExistsError(rpath)

# forcibly remove the target file before execution
# if the user indicates they want to overwrite the destination.
if overwrite and adlfs.exists(rpath):
adlfs.remove(rpath)

if client:
self.client = client
else:
Expand Down Expand Up @@ -478,10 +484,18 @@ def put_chunk(adlfs, src, dst, offset, size, buffersize, blocksize, delimiter=No
return nbytes, None


def merge_chunks(adlfs, outfile, files, shutdown_event=None):
def merge_chunks(adlfs, outfile, files, shutdown_event=None, overwrite=False):
try:
# note that it is assumed that only temp files from this run are in the segment folder created.
# so this call is optimized to instantly delete the temp folder on concat.
# if somehow the target file was created between the beginning of upload
# and concat, we will remove it if the user specified overwrite.
if adlfs.exists(outfile):
if overwrite:
adlfs.remove(outfile)
else:
raise FileExistsError(outfile)

adlfs.concat(outfile, files, delete_source=True)
except Exception as e:
exception = repr(e)
Expand Down
4 changes: 3 additions & 1 deletion azure/datalake/store/transfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import threading
import time
import uuid
import operator

from .exceptions import DatalakeIncompleteTransferException

Expand Down Expand Up @@ -387,7 +388,8 @@ def _update(self, future):
merge_future = self._submit(
self._merge, self._adlfs, dst,
[chunk for chunk, _ in sorted(cstates.objects,
key=lambda obj: obj[1])])
key=operator.itemgetter(1))],
overwrite=self._parent._overwrite)
self._ffutures[merge_future] = parent
else:
self._fstates[parent] = 'finished'
Expand Down
35 changes: 3 additions & 32 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ azure-datalake-store

A pure-python interface to the Azure Data-lake Storage system, providing
pythonic file-system and file objects, seamless transition between Windows and
POSIX remote paths, high-performance up- and down-loader and CLI commands.
POSIX remote paths, high-performance up- and down-loader.

This software is under active development and not yet recommended for general
use.
Expand All @@ -29,8 +29,8 @@ Auth
Although users can generate and supply their own tokens to the base file-system
class, and there is a password-based function in the ``lib`` module for
generating tokens, the most convenient way to supply credentials is via
environment parameters. This latter method is the one used by default in both
library and CLI usage. The following variables are required:
environment parameters. This latter method is the one used by default in
library. The following variables are required:

* azure_tenant_id
* azure_username
Expand Down Expand Up @@ -117,32 +117,3 @@ be transferred, files matching a specific glob-pattern or any particular file.
# download the whole directory structure using 5 threads, 16MB chunks
ADLDownloader(adl, '', 'my_temp_dir', 5, 2**24)
Command Line Usage
------------------

The package provides the above functionality also from the command line
(bash, powershell, etc.). Two principle modes are supported: execution of one
particular file-system operation; and interactive mode in which multiple
operations can be executed in series.

.. code-block:: bash
python cli.py ls -l
Execute the program without arguments to access documentation.


Contents
========

.. toctree::
api
:maxdepth: 2

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
Empty file added samples/__init__.py
Empty file.
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'License :: OSI Approved :: MIT License',
],
packages=find_packages(exclude=['tests']),
Expand Down
9 changes: 7 additions & 2 deletions tests/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
To run the test suite:
To run the test suite against the published package:

py.test -x -vvv --doctest-modules --pyargs azure-datalake-store tests

To run the test suite against a local build:
python setup.py develop
py.test -x -vvv --doctest-modules --pyargs azure.datalake.store tests

This test suite uses [VCR.py](https://github.com/kevin1024/vcrpy) to record the
responses from Azure. Borrowing from VCR's
[usage](https://vcrpy.readthedocs.io/en/latest/usage.html#record-modes), this
Expand All @@ -22,6 +26,7 @@ environment variables should be defined:

* `azure_username`
* `azure_password`
* `azure_store_name`
* `azure_data_lake_store_name`
* `azure_subscription_id`

Optionally, you may need to define `azure_tenant_id` or `azure_url_suffix`.
Loading

0 comments on commit 985ba77

Please sign in to comment.