Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] First DataStreams version #13

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

[WIP] First DataStreams version #13

wants to merge 3 commits into from

Conversation

evetion
Copy link
Collaborator

@evetion evetion commented Jan 10, 2018

Here is a WIP for implementing DataStreams. Still rough around the edges, but I'd like some feedback for the overall direction/API.

What

DataStreams seems the way to go in the Julia data ecosystem, enabling streaming conversions, for example, between CSV and SQLite, or DataFrames, DataTables. This enables users to easily read LAS files into DataFrames and back, without needing to know any raw header/point information.

Why

This addresses most of the comments from @c42f in #4 for a new API and v0.1

  • Correct header creation
  • No raw header, but a Schema (with the raw header in the metadata)
  • LasIO.Source provides only the header, points on request with stream!, so no tuple (header, points) anymore
  • LasIO.Source works without FileIO
  • DataStreams enables, well, streaming 😄

TODO

For using the Source:

  • Implement adding scale and offset on the fly for XYZ on stream!
  • Expose CRS if there

For using the Sink

  • Conversion XYZ Float to Int32 ?
  • Determine scaling/offset ?
  • Implement CRS (and thus VLR)

Discussion

For using the Sink we need some discussion. Writing now only works with Source columns that match a LasPoint perfectly. Do we fill these gaps, and thus allow for an invalid point type? And the xyz coordinates will be in Float, which we need to scale/offset. Doing this afterwards is detrimental for performance. I would propose for float input:

  • User is required to give a bounding box used for scaling/offset. Precision (for scaling) is a keyword argument set by default to 2 (=> scale=0.01). We give errors/warnings when these things overflow during streaming. Creation of a Sink: LasIO.Sink(filename, bbox; precision=2, crs=nothing)

Further improvements can be made to the pointtypes. Since these are hardly used by this implementation (only looking up attributes and types for the Schema creation) we could explode some attributes such as the flag_byte into their individual components for better accessibility. I'm not sure what this would do to the performance though.

Demo

julia> s = LasIO.Source("test/srs.las")
LasIO.Source(Data.Schema:
rows: 10  cols: 10
Columns:
 "x"                   Int32  
 "y"                   Int32  
 "z"                   Int32  
 "intensity"           UInt16 
 "flag_byte"           UInt8  
 "raw_classification"  UInt8  
 "scan_angle"          Int8   
 "user_data"           UInt8  
 "pt_src_id"           UInt16 
 "gps_time"            Float64, LasHeader with 10 points.
, IOStream(<file test/srs.las>), "test/srs.las", 759)

julia> d = Data.stream!(s, DataFrame)
DataFrames.DataFrameStream{Tuple{Array{Int32,1},Array{Int32,1},Array{Int32,1},Array{UInt16,1},Array{UInt8,1},Array{UInt8,1},Array{Int8,1},Array{UInt8,1},Array{UInt16,1},Array{Float64,1}}}((Int32[28981415, 28981464, 28981512, 28981560, 28981608, 28981656, 28981703, 28981753, 28981801, 28981850], Int32[432097861, 432097884, 432097906, 432097928, 432097950, 432097971, 432097992, 432098016, 432098038, 432098059], Int32[17076, 17076, 17075, 17074, 17068, 17066, 17063, 17062, 17061, 17058], UInt16[0x0104, 0x0118, 0x0118, 0x0118, 0x0104, 0x00f0, 0x00f0, 0x0118, 0x0118, 0x0104], UInt8[0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30], UInt8[0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02], Int8[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], UInt8[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00], UInt16[0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000], [4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5, 4.99451e5]), String["x", "y", "z", "intensity", "flag_byte", "raw_classification", "scan_angle", "user_data", "pt_src_id", "gps_time"])

julia> Data.close!(d)
10×10 DataFrames.DataFrame
│ Row │ x        │ y         │ z     │ intensity │ flag_byte │ raw_classification │ scan_angle │ user_data │ pt_src_id │ gps_time  │
├─────┼──────────┼───────────┼───────┼───────────┼───────────┼────────────────────┼────────────┼───────────┼───────────┼───────────┤
│ 128981415432097861170760x01040x300x0200x000x00004.99451e5 │
│ 228981464432097884170760x01180x300x0200x000x00004.99451e5 │
│ 328981512432097906170750x01180x300x0200x000x00004.99451e5 │
│ 428981560432097928170740x01180x300x0200x000x00004.99451e5 │
│ 528981608432097950170680x01040x300x0200x000x00004.99451e5 │
│ 628981656432097971170660x00f00x300x0200x000x00004.99451e5 │
│ 728981703432097992170630x00f00x300x0200x000x00004.99451e5 │
│ 828981753432098016170620x01180x300x0200x000x00004.99451e5 │
│ 928981801432098038170610x01180x300x0200x000x00004.99451e5 │
│ 1028981850432098059170580x01040x300x0200x000x00004.99451e5 │

julia> Data.reset!(s)

julia> d = Data.stream!(s, LasIO.Sink, "test_final.las")
Stream now at 227
LasIO.Sink{LasIO.LasPoint1}(IOStream(<file test_final.las>), LasHeader with 10 points.
, LasIO.LasPoint1)

julia> Data.close!(d)
LasIO.Sink{LasIO.LasPoint1}(IOStream(<file test_final.las>), LasHeader with 10 points.
, LasIO.LasPoint1)
➜ lasinfo test_final.las
lasinfo (170528) report for test_final.las
reporting all LAS header entries:
  file signature:             'LASF'
  file source ID:             0
  global_encoding:            0
  project ID GUID data 1-4:   00000000-0000-0000-2020-202020202020
  version major.minor:        1.0
  system identifier:          'LasIO.jl datastream             '
  generating software:        'LasIO.jl                        '
  file creation day/year:     10/2018
  header size:                227
  offset to point data:       227
  number var. length records: 0
  point data format:          1
  point data record length:   28
  number of point records:    10
  number of points by return: 10 0 0 0 0
  scale factor x y z:         0.01 0.01 0.01
  offset x y z:               0 0 0
  min x y z:                  289814.15 4320978.61 170.58
  max x y z:                  289818.50 4320980.59 170.76
reporting minimum and maximum for all LAS point record entries ...
  X            28981415   28981850
  Y           432097861  432098059
  Z               17058      17076
  intensity         240        280
  return_number       0          0
  number_of_returns   6          6
  edge_of_flight_line 0          0
  scan_direction_flag 0          0
  classification      2          2
  scan_angle_rank     0          0
  user_data           0          0
  point_source_ID     0          0
  gps_time 499450.805994 499450.806120
number of first returns:        10
number of intermediate returns: 0
number of last returns:         0
number of single returns:       0
WARNING: for return 1 real number of points by return (0) is different from header entry (10).
WARNING: there are 10 points with return number 0
overview over number of returns of given pulse: 0 0 0 0 0 10 0
histogram of classification of points:
              10  ground (2)

@evetion evetion requested review from c42f and visr January 10, 2018 16:44
@evetion
Copy link
Collaborator Author

evetion commented Jan 10, 2018

I think this also comes close to the comment at https://github.com/FugroRoames/PointClouds.jl:

Perhaps one day PointCloud can be implemented in terms of an underlying DataFrame [..]

julia> s = LasIO.Source("test/srs.las")
julia> d = Data.stream!(s, DataFrame)
julia> d = Data.close!(d)
julia> d[:intensity]
10-element Array{UInt16,1}:
 0x0104
 0x0118
 0x0118
 0x0118
 0x0104
 0x00f0
 0x00f0
 0x0118
 0x0118
 0x0104

@evetion
Copy link
Collaborator Author

evetion commented Jan 17, 2018

  • Document API changes (add high level interface) by @visr
  • Fail on non fitting schemas in Sink
  • Use xcoords() and other existing functions
  • Add issues for architecture change to transparently rewrite raw fields

@visr
Copy link
Owner

visr commented Jan 15, 2019

I believe now, a year later, it makes more sense do add support for the new Tables.jl interface instead. Perhaps good to focus on getting in #16 first, and then revisiting this? Since #16 will also affect the API.

@evetion
Copy link
Collaborator Author

evetion commented Jan 16, 2019

Not sure about this clashing with #16, these are two separate approaches in my opinion.

@visr
Copy link
Owner

visr commented Jan 16, 2019

How do you mean two separate approaches? As in we should have one or the other? I thought we can have both right?

In any case it might be good to try to get LAS 1.3 and 1.4 support in first, as it is becoming increasingly common.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants